Data cleaning is a critical step in data analysis, and it can take up to 80% of the time spent on a project.
Data cleaning is not merely a chore but a craft that shapes the destiny of your analysis.
Reasons for unclean data
Data comes from people (who can mistakes) and machines (which make different types of mistakes).
Data entry clerks accidentally misname files or enter data into the wrong field.
Automatic sensors being offline or broken.
Servers being down for a day or misconfigured XML feed-reading system.
Routers were being rebooted or a backhoe cut the internet line.
A brief exploration of data cleaning techniques that transform raw data into a valuable asset: include
Data is a precious thing and will last longer than the systems themselves."
More data-cleaning techniques include;
Outlier Detection- The process of identifying and managing values that significantly deviate from the rest of the data. Outliers are values that are unusually high or low compared to the rest of the data and can have a substantial impact on statistical analyses and machine learning models.
Data profiling - It is the process of studying data to determine what is wrong and it is important before moving to the next step.
It is the process of analyzing and examining the content, structure, and quality of a dataset.
It helps ensure that subsequent analyses are based on a solid understanding of the data and that potential issues are addressed early in the process.
In the era of big data, where the volume and diversity of information continue to grow, the importance of sound data-cleaning practices becomes increasingly evident. By adopting and adapting these techniques to the unique characteristics of each dataset, analysts and data scientists pave the way for more trustworthy, insightful, and impactful results.
Below is a table showing what it means to "clean data"
Cleaning Empty Cells.
Empty cells can potentially give you wrong results when you analyze data.
One way to deal with empty cells is to remove rows that contain empty cells.
By default, the dropna() method returns a new DataFrame, and will not change the original.
Now, the dropna(inplace = True) will NOT return a new DataFrame, but it will remove all rows containing NULL values from the original DataFrame.
Replace Empty Values
Another way of dealing with empty cells is to insert a new value instead.
This way you do not have to delete entire rows just because of some empty cells.
df.fillna (130,inplace= True)
This code remember it replaces all the empty cells in the whole data, but still, u can replace only specified columns
- To only replace empty values for one column, specify the column name for the DataFrame.
df["calories"].fillna(130,inplace = True)
- A common way to replace empty cells, is to calculate the mean, median, or mode value of the column. for mean the code will be
x = df["Calories"].mean()
df["Calories"].fillna(x, inplace = True)
x=df["Calories"].median()
df["Calories"].fillna(x, inplace =True)
Cleaning Data of Wrong Format.
Cells with data in the wrong format can make it difficult, or even impossible, to analyze data.
To fix the cells u have to remove the rows or convert all cells in the columns into the same format.
pandas have a method to convert datetime. to_datetime()
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())
Replacing Values
U can use loc to replace missing values for small data
df.loc[7, 'Duration'] = 45
where 7 is the row index and the Duration is the Column name.
df.replace - replaces values in one or more columns with other values
df.replace('a':{'b':'c'),'d')
Removing Duplicates
Discovering Duplicates
Duplicate rows are rows that have been registered more than one time.
To discover duplicates, we can use the duplicated() method.
The duplicated method return a boolean values for each row.
df = pd.read_csv
print(df.duplicated())
To remove duplicates, use the drop_duplicates() method.
Note The (inplace = True) will make sure that the method does
NOT return a new DataFrame, but it will remove all
duplicates from the original DataFrame.
Data cleaning also involves data pre-processing such as renaming the column, converting it to datetime variable and sorting the data in ascending order of date.
Renaming the data columns include;
df =df.rename(columns=
converting the date column to datetime,
df['Date'] =pd.to_datetime(df['Date'])
sorting the dataset in ascending order of date
df =df.sort_values(by = 'Date")
follow the following steps to do EDA,
- Read the dataset and examine the first five rows(using the head method)
.head()
- Get information about the data type of each column and the number of non-null values in each column (using the info _method) and the dimensions of the dataset (using the shape_ attribute).
- Get summary statistics for each column ( using the describe method) and obtain the values of the count, min, max standard deviation, and percentiles.
Data Wrangling,
- Check if the data types of the columns are correctly identified (using the info or type method). `
If not change the data types using the
astype` method - Rename the columns if necessary or redundant columns or rows using the _drop _method.
- Rename the columns if necessary using the_ rename _method.
- Make the data tidy,if needed by restructuring it using the melt or stack method.
- Remove any extraneous data(blank values ,special characters, etc ) that does not add any value using the replace method.
- Check for the presence of null values, using the_ isna_ method, and drop or fill the null values using the_ dropna_ or fillna method.
- Add a column if it adds value to your analysis.
- Aggregate the data if the data is in a disaggregated format,using the groupby method.
Top comments (0)