DEV Community

Cover image for Data Cleaning with Python: A Step-by-Step Guide Using a Kaggle Titanic Dataset
shiningamour
shiningamour

Posted on

Data Cleaning with Python: A Step-by-Step Guide Using a Kaggle Titanic Dataset

Introduction

An essential step in any data analytic project is the data cleaning phase. This process entails identifying and fixing problems with missing data, inaccuracies, and inconsistencies in the chosen dataset to make sure the data is complete, accurate, and reliable. This article will showcase the various steps involved in data cleaning. We will be using the Titanic dataset from Kaggle and respective examples using Python code snippets.

Dataset Description

This article uses the "Titanic: Machine Learning from Disaster" dataset on Kaggle. A dataset of passenger records from the Titanic. The dataset includes sex, age, class, fare, and survival status. You can download the dataset from the link below:

https://www.kaggle.com/c/titanic/data

Step 1: Importing Libraries and Loading Data

To begin, you have to import all the required Python libraries and load the dataset. For reading the dataset into the pandas dataframe we will use the pandas library. Below is a Python code snippet for the above step:

# Importing libraries

import pandas as pd


# Loading dataset

df = pd.read_csv('train.csv')

Enter fullscreen mode Exit fullscreen mode

In the above code snippet, we imported the pandas library using the "import pandas as pd" syntax. This statement would allow the use of the pandas library as "pd" alias. Next, we load the dataset by utilizing the "pd.read_csv('train.csv')" function. The function stores the CSV file as pandas dataframe after reading it.

Step 2: Exploring Data

After uploading the data, it is worthwhile to explore the dataset to better understand it. This process checks for missing data, data types, and summarization. Below is a Python code snippet for the above steps:

# Checking for missing values

print(df.isnull().sum())

# Checking data types

print(df.dtypes)

# Summarizing data

print(df.describe())

Enter fullscreen mode Exit fullscreen mode

The above code snippet checks for missing data using the "df.isnull().sum()" function. This function returns the total number of missing values contained in each column of the Titanic dataset. After this we used the "df.dtypes" function to determine the data types of Lastly the "df.describe()" function was called to summarize the data. This function retrieves different statistical measures such as mean, standard deviation, maximum, minimum, and quartile.

Step 3: Cleaning Data

After the data exploration process, we will discover issues that need to be fixed to properly clean the dataset. Correcting inconsistencies, eliminating duplicate values, filling in missing data, and converting data types are some of the measures to clean the data. Below is a Python code snippet for the data cleaning step.

# Filling in missing values

df['Age'].fillna(df['Age'].median(), inplace=True)

df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)


# Converting data types

df['Pclass'] = df['Pclass'].astype('category')


# Removing duplicates

df.drop_duplicates(inplace=True)


# Correcting inconsistencies

df.loc[df['Age'] < 0, 'Age'] = df['Age'].median()

Enter fullscreen mode Exit fullscreen mode

Here is an explanation of the code snippet above. Firstly we applied the "fillna()" function to fill in missing values in the 'Age' and 'Embarked' columns. We used the median value to fill in the missing value for the 'Age' column and the mode value for the 'Embarked' column. Using the "astype()" function we converted the 'Pclass' column of the dataset to a categorical data type. Also, we utilized the "drop_duplicates()" function to remove duplicate values. Lastly, we replaced negative values with median values to correct age column inconsistencies.

Step 4: Validating Cleaned Data

When you are done cleaning the data it's necessary to confirm that the data cleaning process was successful. This can be done by checking for missing values and data types, and then summarizing the data again. Below is a Python code snippet for this process.

# Checking for missing values

print(df.isnull().sum())


# Checking data types

print(df.dtypes)


# Summarizing data

print(df.describe())


Enter fullscreen mode Exit fullscreen mode

To check for data types and missing values, and summarize the data again to ensure the cleaning process was successful, the code snippet above was used.

Below is Python code that displays the first five rows of cleansed data:

# Importing libraries

import pandas as pd


# Loading dataset

df = pd.read_csv('train.csv')


# Filling in missing values

df['Age'].fillna(df['Age'].median(), inplace=True)

df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)


# Converting data types

df['Pclass'] = df['Pclass'].astype('category')


# Removing duplicates

df.drop_duplicates(inplace=True)


# Correcting inconsistencies

df.loc[df['Age'] < 0, 'Age'] = df['Age'].median()


# Checking the cleansed data

print(df.head())

Enter fullscreen mode Exit fullscreen mode

Output:

  PassengerId Survived Pclass  \

0            1        0      3   

1            2        1      1   

2            3        1      3   

3            4        1      1   

4            5        0      3   


                                                Name     Sex   Age  SibSp  \

0                            Braund, Mr. Owen Harris    male  22.0      1   

1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   

2                             Heikkinen, Miss. Laina  female  26.0      0   

3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   

4                           Allen, Mr. William Henry    male  35.0      0   


   Parch            Ticket     Fare Embarked  

0      0         A/5 21171   7.2500        S  

1      0          PC 17599  71.2833        C  

2      0  STON/O2. 3101282   7.9250        S  

3      0            113803  53.1000        S  

4      0            373450   8.0500        S 

Enter fullscreen mode Exit fullscreen mode

As shown above, the columns for 'Age' and 'Embarked' with missing values have been filled in. The column 'Pclass' has also been converted to a categorical data type. It shows the removal of duplicate values and the correction of negative values in the 'Age' column. The data is clean and ready for additional analysis.

Conclusion

In conclusion, we took you through the entire data cleaning process using Python and the "Titanic" dataset from Kaggle. First, we imported the required libraries and loaded the Titanic dataset into Python. Then we conducted data exploration to better understand the dataset and identify diverse problems such as missing and duplicate values. These problems needed to be fixed to clean the dataset. To clean the data, we filled in missing values, converted data types, removed duplicates, and corrected inconsistencies in the data. The final step was to validate the cleaning process to ensure it was successful. To achieve this we checked for data types, missing values, and summarized the data again.

In any data analytics project, data cleaning is a crucial step that guarantees data reliability, accuracy and completeness. By following the guidelines in this article you will be able to clean your data effectively and make it fit for further analysis.

Top comments (0)