DEV Community

Cover image for Data Cleaning with Pandas
justkmike
justkmike

Posted on

Data Cleaning with Pandas

In this guide, we'll explore various data-cleaning techniques using Python and the Pandas library. We'll also cover functions like head(), tail(), info(), describe(), shape, and size, and demonstrate how to remove empty cells, deal with wrong data formats, access data and remove duplicates.

DataFrame Basics

head() and tail()

These functions display the first and last n rows of a DataFrame, respectively.

# Display the first 5 rows
df.head()

# Display the last 5 rows
df.tail()
Enter fullscreen mode Exit fullscreen mode

info()

info() provides essential information about the DataFrame, including column data types, non-null counts, and memory usage.

df.info()
Enter fullscreen mode Exit fullscreen mode

describe()

describe() offers statistical summaries of the DataFrame, such as mean, median, and quartiles.

df.describe()
Enter fullscreen mode Exit fullscreen mode

shape

shape returns the dimensions of the DataFrame as a tuple (number of rows, number of columns).

df.shape
Enter fullscreen mode Exit fullscreen mode

size

size returns the total number of elements in the DataFrame.

df.size
Enter fullscreen mode Exit fullscreen mode

Data Cleaning

Removing Empty Cells

dropna()

dropna() removes rows with empty cells, and it can create a new DataFrame. If you want to modify the existing DataFrame, use the inplace=True parameter.

# Create a new DataFrame with empty cells removed
new_df = df.dropna()

# Modify the existing DataFrame in-place
df.dropna(inplace=True)
Enter fullscreen mode Exit fullscreen mode

[fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna)()

fillna("Value to replace with") replaces empty cells with a specified value. It also supports additional parameters like axis, method, limit, and value.

# Replace empty cells with a specific value
df.fillna("Replacement Value", inplace=True)
Enter fullscreen mode Exit fullscreen mode

Handling Wrong Data Formats

For example, to convert a column named "date" to datetime format:

df["date"] = pd.to_datetime(df["date"])
Enter fullscreen mode Exit fullscreen mode

Removing Duplicates

To identify and remove duplicate rows:

duplicated()

duplicated() returns a Boolean Series, indicating whether each row is a duplicate (True) or not (False).

duplicate_rows = df.duplicated()
Enter fullscreen mode Exit fullscreen mode

drop_duplicates()

drop_duplicates() removes duplicate rows. Use the inplace=True parameter to modify the existing DataFrame.

df.drop_duplicates(inplace=True)
Enter fullscreen mode Exit fullscreen mode

Accessing Data in a DataFrame

at and iat

at

at is used to get or set a specific element by row and column labels.

# Get the value at row 2, column "name"
value = df.at[2, "name"]

# Assign a new value to the selected element
df.at[2, "name"] = "Justkmike"
Enter fullscreen mode Exit fullscreen mode

iat

iat is used to access elements by row and column index.

# Get the value at row 1, column 2
value = df.iat[1, 2]

# Update data at a specific index
df.iat[1, 2] = 10
Enter fullscreen mode Exit fullscreen mode

[loc](https://www.statology.org/pandas-loc-vs-iloc/) and iloc

loc

loc selects rows using index labels.

# Select a row with the index label "12-23-23"
selected_row = df.loc["12-23-23"]
Enter fullscreen mode Exit fullscreen mode

iloc

iloc selects rows using integer-based indexing.

# Select the first two rows and the first two columns
selected_data = df.iloc[0:2, 0:2]
Enter fullscreen mode Exit fullscreen mode

Remember that this is just the tip of the iceberg regarding Pandas. There are many more operations and functions available for data manipulation. If you encounter any issues or need further assistance, feel free to contact mwkariuki2e@gmail.com. Stay tuned for our next guide on data visualization. See you! 😊

Top comments (0)