Preparing data
Every data visualization begins with data, and organizing it properly makes it easier to create clear visuals and communicate your message effectively. Data collected directly from a source is known as raw data. The process of preparing your data before uploading it is called data preprocessing. Data preprocessing (also called data cleansing) involves refining the raw data to remove inconsistencies, such as entry errors or outliers. This can be done using spreadsheet tools like Excel. This guide will help you understand the basic concepts behind this process and show you how to apply them in your daily work.
Collecting the data
Make sure your data source is reliable, and aim to find one that offers the most complete dataset available. When downloading data, choose common file formats such as CSV or XLS. Downloading data from a good source will make the preprocessing step much easier.
In recent years, the availability of public data has grown significantly, driven by increased transparency from governments and contributions from global organizations like the United Nations and the World Bank. Many of these organizations provide dedicated data portals where you can easily search for and download the data you need.
Preprocessing the data
Data preprocessing covers a range of actions, but most commonly, you’ll need to filter data, select the right data format, and apply calculations.
Filter the data
The first step in cleaning your data is to remove any unnecessary entries. This could mean excluding certain categories, observations, or time periods that are not relevant to your analysis. For smaller datasets, you can do this manually, while larger datasets benefit from filtering tools available in spreadsheet programs. You can use predefined filters or create your own rules to sort rows and columns efficiently.
Label rows and columns
Descriptive row and column names make your dataset easier to navigate. Use concise but clear labels, and consider abbreviations for long words. Include time periods in labels, such as gdp_2015 for GDP data in 2015 or gov_dev for government development data.
Try to establish and follow a consistent naming convention throughout your dataset. This can include using upper and lower case letters consistently or using symbols like underscores (_) to separate words.
Choose a data format
The type of data you have will influence the visualization you can create. The Charts app recognizes three data types: Text, Number, and Date, and uses them to suggest appropriate visualizations.
- Text – use for textual data points and ensure consistent letter casing for uniformity.
- Number – use for numerical values. Watch out for outliers. Avoid formatting like percentages or currency, as most formatting applied in spreadsheets will be ignored by Vizualist Chart. Decimals are supported, and you can specify the number of decimal places before uploading.
- Date – use for time intervals such as years, quarters, months, or even days, hours, and minutes. If you have this data type in your dataset, make sure to pay attention when you format it. Charts requires date data to be in a specific format in order to recognize it.
Apply calculations
Calculations are a common part of data preprocessing. Sometimes the data your need is not available in raw form, but you can apply calculations to raw data and get the numbers you need. These calculations can vary from basic arithmetic operations to advanced statistical calculations. Spreadsheet tools come with variety of formulas that can be used to derive new data from the existing dataset.