Data cleaning is one of the most important procedures you should learn in data analysis. You will constantly be working with different sets of data and the accuracy or completeness of the same is never guaranteed. Because of this reason, you should learn how to handle such data and make sure the incompleteness or errors present do not affect the final outcome.
Why should you clean data?
Especially if you did not produce it in the first place? Using unclean data is a sure way to get poor results. You might be using a very powerful computer capable of performing calculations at a very high speed, but what they lack is intuition. Without this, you must make a judgement call each time you go through a set of data. In data analysis, your final presentation should be a reflection of the reality in the data you use. For this reason, you must eliminate any erroneous entries.
Possible Causes of Dirty Data
One of the most expensive overheads in many organizations is data cleaning. Unclean data is present in different forms. Your company might suffer in the form of omissions and errors present in the master data you need for analytical purposes. Since this data is used in important decision-making processes, the effects are costly. By understanding the different ways dirty data finds its way into your organization, you can find ways of preventing it, thereby improving the quality of data you use.
In most instances, automation is applied in data collection. Because of this, you might experience some challenges with the quality of data collected or consistency of the same. Since some data is obtained from different sources, they must be collated into one file before processing. It is during this process that concerns as to the integrity of the data might arise. The following are some explanations as to why you have unclean data:
The problem of incomplete data is very common in most organizations. When using incomplete data, you end up with many important parts of the data blank. For example, if you are yet to categorize your customers according to the target industry, it is impossible to create a segment in your sales report according to industry classification. This is an important part of your data analysis that will be missing, hence your efforts will be futile, or expensive in terms of time and resources invested before you get the complete and appropriate data.
Errors at input
Most of the mistakes that lead to erroneous data happen at data entry points. The individual in charge might enter the wrong data, use the wrong formula, misread the data, or innocently mistype the wrong data. In the case of an open-ended report like questionnaires, the respondents might input data with typos or use words and phrases that computers cannot decipher appropriately. Human error at input points is always the biggest challenge in data accuracy.
Inaccurate data is in most cases a matter of context. You could have the correct data, but for the wrong purpose. Using such data can have far-reaching effects, most of which are very costly in the long run. Think about the example of a data analyst preparing a delivery schedule for clients, but the addresses are inaccurate. The company could end up delivering products to their customers, but with the wrong address details. As a matter of context, the company does have the correct addresses for their clients, but they are not matched correctly.
In cases where you collect data from different sources, there is always a high chance of data duplication. You must have a lot of checks in place to ensure that duplicates are identified. For example, one report might list student scores under Results, while another will have them under Performance. The data under these tags will be similar, but your sensors will consider them as two independent entities.
Unless you are using a machine that periodically checks for errors and corrects them or alerts you, it is possible to encounter errors as a result of problematic sensors. Machines can be faulty or breakdown too, which increases the likelihood of a problematic data entry.
Entries An incorrect entry will always deliver the wrong result. Incorrect entry happens when your dataset includes entries that are not within the acceptable range. For example, data for the month of February should range from 1 to 28 or 29. If you have data for February ranging up to 31, there is definitely an error in your entries.
If at your data entry point you use a machine with problematic sensors, it is possible to record erroneous values. You might be recording people’s ages, and the machine inputs a negative figure. In some cases, the machine could actually record correct data, but between the input point and the data collection point, the data might be mungled, hence the erroneous results. If you are accessing data from a public internet connection, a network outage during data transmission might also affect the integrity of the data.
For data obtained from different sources, one of the concerns is often how to standardize the data. You should have a system or method in place to identify similar data and represent them accordingly. Unfortunately, it is not easy to manage this level of standardization. As a result, you end up with erroneous entries. Apart from data obtained from multiple sources, you can also experience challenges dealing with data obtained from the same source. Everyone inputs data uniquely, and this might pose a challenge in data analysis.
How to Identify Inaccurate Data
More often, you need to make a judgement call to determine whether the data you are accessing is accurate or not. As you go through data, you must make a logical decision based on what you see. The following are some factors you should think about:
Study the range
First, check the range of data. This is usually one of the easiest problems to identify. Let’s say you are working on data for primary school kids. You know the definitive age bracket for the students. If you identify age entries that are either too young or too old for primary school kids whose data you have, you need to investigate further.
Essentially what you are doing here is an overview of a max-min approach. With these ranges in mind, you can skim through data and identify erroneous entries. Skimming through is easy if you are working with a few entries. If you have thousands or millions of data entries, a max-min function code can help you identify the wrong entries in an instant. You can also plot the data on a graph and visually detect the values that don’t fall within the required distribution pattern.
Investigate the categories
How many categories of data do you expect? This is another important factor that will help you determine whether your data is accurate or not. If you expect a dataset with nine categories, anything less is acceptable, but not more. If you have more than nine categories, you should investigate to determine the legitimacy of the additional categories. Say you are working with data on marital status, and your expected options are single, married, divorced, or widowed. If the data has six categories, you should investigate to determine why there are two more.
Look at the data in question and ensure all entries are consistent. In some cases, inaccuracies appear as a result of inconsistency. This is common when working with percentages. Percentages can either be fed into data sets as basis points or decimal points. If you have data that has both sets of entries, they might be incompatible.
Inaccuracies across multiple fields
This is perhaps one of the most difficult challenges you will overcome when cleaning inaccurate data. The following entries, for example, are valid individually. A 4-year old girl is a valid age entry. 5 children is also a valid entry. However, a datapoint that depicts Grace as a 4-year old girl with 5 children is absurd. You would need to check for inconsistencies and inaccuracies in several rows and columns.
Plotting data in visual form is one of the easiest ways of identifying abnormal distributions or any other errors in the data. Say you are working with data whose visualization should result in a bimodal distribution, but when you plot the data you end up with a normal distribution. This would immediately alert you that something is not right, and you need to check your data for accuracy.
Number of errors
in your data set Having identified the unique errors in the data set, you must enumerate them. Enumeration will help you make a final decision on how and whether to use the data. How many errors are there? If you have more than half of the data as inaccurate, it is obvious that your presentation would be greatly flawed. You must then follow up with the individuals who prepared the data for clarification or find an alternative.
A common data concern that data analysts deal with is working with datasets missing some entries. Missing entries is relative. If you are missing two or three entries, this should not be a big issue. However, if your data set is missing many entries, you have to find out the reason behind this.
Missing entries usually happen when you are collating data from multiple sources, and in the process some of the data is either deleted, overwritten, or skipped. You must investigate the missing entries because the answer might help you determine whether you are missing only a few entries that might be insignificant going forward, or important entries whose absence affects the outcome.
How to Clean Data
Having gone through the procedures described above and identified unclean data, your next challenge is how to clean it and use accurate data for analysis. You have five possible alternatives for handling such a situation:
If you are unable to find the necessary values, you can impute them by filling in the gaps for the inaccurate values. The closest explanation for imputation is that it is a clever way of guessing the missing values, but through a data-driven scientific procedure. Some of the techniques you can use to impute missing data include stratification and statistical indicators like mode, mean and median.
If you have studied the data and identified unique patterns, you can stratify the missing values based on the trend identified. For example, men are generally taller than women. You can use this presumption to fill in missing values based on the data you already have.
The most important thing, however, is to try and seek a second opinion on the data before imputing your new values. Some datasets are very critical, and imputing might introduce a personal bias which eventually affects the outcome.
Data scaling is a process where you change the data range so that you have a reasonable range. Without this, some values that might appear larger than others might be given prominence by some algorithms.
For example, the age of a sample population generally exists within a smaller range compared to the average population of a city. Some algorithms will give the population priority over age, and might ignore the age variable altogether.
By scaling such entries, you maintain a proportional relationship between different variables, ensuring that they are within a similar range. A simple way of doing this is to use a baseline for the large values, or use percentage values for the variables.
Correcting data is a far better alternative than removing data. This involves intuition and clarification. If you are concerned about the accuracy of some data, getting clarification can help allay your fears. With the new information, you can fix the problems you identified and use data you are confident about in your analysis.
One of the first things you could think about is to eliminate the missing entries from your dataset. Before you do this, it is advisable that you investigate to determine why the entries are missing. In some cases, the best option is to remove the data from your analysis altogether. If, for example, more than 80% of entries in a row is missing and you cannot replace them from any other source, that row will not be useful to your analysis. It makes sense to remove it.
Data removal comes with caveats. If you have to eliminate any data from your analysis, you must give a reason for this decision in a report accompanying your analysis. This is important so as to safeguard yourself from claims of data manipulation or doctoring data to suit a narrative.
Some types of data are irreplaceable, so you must consult experts in the associated fields before you remove them. Most of the time, data removal is applied when you identify duplicates in the data, especially if removing the duplicates does not affect the outcome of your analysis.
There are situations where you have columns missing some values, but you cannot afford to eliminate all of them. If you are working with numeric data, a reprieve would be to introduce a new column where you indicate all the missing values. The algorithm you are using should identify these values as such. In case the flagged values are necessary in your analysis, you can impute them or find a better way to correct them then use them in your analysis. In case this is not possible, make sure you highlight this in your report.
Cleaning erroneous data can be a difficult process. A lot of data scientists generally hope to avoid it, especially since it is time-consuming. However, it is a necessary process that will bring you closer to using appropriate data for objective is to use clean data that will give you the closest reflection of the true picture of events.
How to Avoid Data Contamination
From empty data fields to data duplication and invalid addresses, there are so many ways you can end up with contaminated data. Having looked at possible causes and methods of cleaning data, it is important for an expert in your capacity to put measures in place to prevent data contamination in the future. The challenges you experienced in cleaning data could easily be avoided, especially if the data collection processes are within your control.
Looking back to the losses your business suffers in dealing with contaminated data and the resource wastage in terms of time, you can take significant measures to reduce inefficiencies, which will eventually have an impact on your customers and their level of satisfaction.
One of the most important steps today is to invest in the appropriate CRM programs to help in data handling. Having data in one place makes it easier to verify the credibility and integrity of data within your database. The following are some simple methods you can employ in your organization to prevent data contamination, and ensure you are using quality data for decision-making.
Irrespective of the data handling programs you use, one of the most important things is to make sure you configure applications properly. Your company could be using CRM programs or simple Excel sheets. Whichever the case, it is important to configure your programs properly. Start with the critical information. Make sure the entries are accurate and complete.
One of the challenges of incomplete data is that there is always the possibility that someone could complete them with inaccurate data to make them presentable, when this is not the real picture.
Data integrity is just as important, so make sure you have the appropriate data privileges in place for anyone who has to access critical information. Set the correct range for your data entries. This way, anyone keying in data will be unable to enter incorrect data not within the appropriate range. Where possible, set your system up such that you can receive notifications whenever someone enters the wrong range, or is struggling, so that you can follow up later on and ensure you captured the correct data.
Human error is one of a data analyst’s worst nightmares when trying to prevent data contamination. Other than innocent mistakes, many errors from human entry are usually about context. It is important that you train everyone handling data on how to go about it. This is a good way to improve accuracy and data integrity from the foundation – data entry. Your team must also understand the challenges you experience when using contaminated data, and more importantly why they need to be keen at data entry. If you are using CRM programs, make sure they understand different functionality levels so they know the type of data they should enter.
Another issue is how to find the data they need. When under duress, most people key in random or inaccurate data to get some work done or bypass some restrictions. By training them on how to search for specific data, it is easier to avoid unnecessary challenges with erroneous entries. This is usually a problem when you have new members joining your team. Ensure you train them accordingly, and encourage them to ask for help whenever they are unsure of anything.
The data format is equally important as the desired level of accuracy. Think about this from a logical perspective. If someone sends you a text message written in all capital letters, you will probably disregard it or be offended by the tone of the message. However, if the same message is sent with proper formatting, your response is more positive. The same applies to data entry. Try and make sure that everyone who participates in data handling is careful enough to enter data using the correct format. Ensure the formats are easy to understand, and remind the team to update data they come across if they realize it is not in the correct format. Such changes will go a long way in making your work easier during analysis.
Empower data handlers
Beyond training your team, you also need to make sure they are empowered and aware of their roles in data handling. One of the best ways of doing this is to assign someone the data advocacy role. A data advocate is someone whose role is to ensure and champion consistency in data handling. Such a person will essentially be your data administrator. Their role is usually important, especially when implementing new systems. They come up with a plan to ensure data is cleaned and organized. One of their deliverables should include proper data collection procedures to help you improve the results obtained from using the data in question.
Overcoming data duplication
Data duplication happens in so many organizations because the same data is processed at different levels. Duplication might eventually see you discard important and accurate data accidentally, affecting any results derived from the said data.
For example, ensure your team searches for specific items before they create new ones. Provide an in-depth search process that increases the search results and reduces the possibility of data duplication. For example, beyond looking for a customer’s name, the entry should also include contact information. Provide as many relevant fields that can be searched into, thereby increasing the possibility of arresting and avoiding duplicates.
You can find data for a customer named Charles McCarthy in different databases labeled as Charles MacCarthy or Charles Mc Carthy. The moment you come across such duplicates, the last thing you want to do is to eliminate them from the database. Instead, investigate further to ascertain the similarities and differences between the entries. Consult, verify, and update the correct entry accordingly. Alternatively, you can escalate such issues to your data advocate for further action. At the same time, put measures in place that scans your database to warn users whenever they are about to create a duplicate entry.
Perhaps one of the best solutions would be cleaning data before it gets into your database. A good way of doing this would be creating clear outlines on the correct data format to use. With such procedures in place, you have an easier time handling data. If all the conditions are met, you will probably handle data cleaning at the entry point instead of once the data is in your database, making your work easier.
Create filters to determine the right data to collect and the data that can be updated later. It doesn’t make sense to collect a lot of information to give you the illusion of a complete and elaborate database, when in a real sense very little of what you have is relevant to your cause.
The misinformation that arises from inaccurate data can be avoided if you take the right precautionary measures in data handling. Data security is also important, especially if you are using data sources where lots of other users have access. Restrict access to data where possible, and make sure you create different access privileges for all users.