Preparing data
Every data visualization starts with data, and having your data properly organized will make it easier for you to visualize it and convey your message more accurately. Process of organizing your data prior to uploading it into the Charts app is called data preprocessing. After reading this guide you should be able to understand the basic ideas behind the process and learn how to implement it in your daily work.
Unedited data that is collected directly from the source is called raw data. Data preprocessing (sometimes also called data cleansing) is a process of altering raw data in order to remove all inconsistencies such as data entry errors or outliers. This can be done using spreadsheet tools such as Excel.
Collecting the data
Make sure that your data source is credible and try to find the one that provides the most complete datasets. When downloading it, choose a common file type such as CSV or XLS. Downloading data from a good source will make it easier for you to handle it when you start preprocessing it.
In the recent years there has been a significant increase in publicly available data due to the increase in transparency of government data and data collected from global organizations such as United Nations and World Bank. These global organizations even have data dedicated web pages where you can easily search for and download the data you need.
Preprocessing the data
Preprocessing data is a broad term that covers many actions, but you’ll most likely need to filter data, choose the right data format, and make calculations.
Filter the data
First thing to do when you start the process of cleaning your data is to filter out all the data you don't need. This means you can remove data entries for a certain category or observation, or remove data records for a specific time period that is not of your interest. You can do this by hand or using filtering options if you are dealing with bigger datasets. Spreadsheet tools come with predefined filtering options you can apply to rows and columns, but you can also write and apply your own filter rules.
Label rows and columns
Proper naming of rows and columns in the dataset will help you navigate more easily through the data. Use descriptive and detailed labels, but try to make them as short as possible to avoid clutter. Use abbreviations for long words and include time periods in the names like "gdp_2015" for GDP data for the year 2015 or "gov_dev" for government development.
Try to establish and follow a certain naming convention for all name labels in the dataset. This includes using upper and lower case letters or using certain symbols such as underscore to separate the words.
Choose a data format
When it comes to the type of visualization you are going to use, format of your data is essential for the decision-making process. Charts recognizes three data types: text, number and date&time and uses them when creating visualizations recommendations.
Text data type is used for textual data points. Make sure your text data has the same letter casing to provide uniformity.
Number data type is used for numerical values. Numerical data can contain outliers so keep an eye out for outliers when working on your dataset. Most formatting options applied before uploading the dataset into the Charts app will be ignored, so avoid formatting options such as percentage or dollar values. Decimals are chart friendly. You can set number of decimals depending on you preference and they will be imported into the app with the upload.
Date&time data type is used for different time intervals. This can be years, quarters, months or even to days, hours and minutes. If you have this data type in your dataset, make sure to pay attention when you format it. Charts requires date&time data to be in a specific format in order to recognize it.
Apply calculations
Calculations are a common part of data preprocessing. Sometimes the data your need is not available in raw form, but you can apply calculations to raw data and get the numbers you need. These calculations can vary from basic arithmetic operations to advanced statistical calculations. Spreadsheet tools come with variety of formulas that can be used to derive new data from the existing dataset.