DEV Community

Piyush Raj
Piyush Raj

Posted on

Pandas - Basic Exploratory Data Analysis - 7 Days of Pandas

Welcome to the third article in the "7 Days of Pandas" series where we cover the pandas library in Python which is used for data manipulation.

In the first article of the series, we looked at how to read and write CSV files with Pandas.
In the second article, we looked at how to perform basic data manipulation.
In this tutorial, we will look at some of the common operations that we perform on a dataframe during the exploratory data analysis (EDA phase).

Exploratory Data Analysis (EDA) helps us better understand the data at hand and can give us valuable insights. In this phase, we look at the data for insights and use descriptive statistics and visualizations to derive insights from the data.

The pandas library comes with a number of useful functions that help us explore the data. In this tutorial, we will cover the following topics:

  1. Get the first and the last N rows of a dataframe.
  2. Using the info() function.
  3. Get descriptive statistics with the describe() function.

Before we begin, let's first import pandas and create a sample dataframe that we will be using throughout this tutorial.

import pandas as pd

# employee data
data = {
    "Name": ["Tim", "Shaym", "Noor", "Esha", "Sam", "James", "Lily"],
    "Age": [26, 28, 27, 32, 24, 31, 33],
    "Department": ["Marketing", "Product", "Product", "HR", "Product", "HR", "Marketing"],
    "Salary": [60000, 70000, 82000, 55000, 58000, 55000, 65000]
}

# create pandas dataframe
df = pd.DataFrame(data)

# display the dataframe
df
Enter fullscreen mode Exit fullscreen mode
Name Age Department Salary
0 Tim 26 Marketing 60000
1 Shaym 28 Product 70000
2 Noor 27 Product 82000
3 Esha 32 HR 55000
4 Sam 24 Product 58000
5 James 31 HR 55000
6 Lily 33 Marketing 65000

We have a dataframe with information of some employee in an office.

Get the first and the last N rows of a dataframe

After loading or creating a dataframe, a good first step is to look at the first few rows to see if the data is as expected or not. Or, if there are any obvious issues with the data (for example, missing fields, etc.).

You can use the pandas dataframe head() function to get the first n rows of the dataframe. Pass the number of rows you want from the top as an argument. By default, n is 5.

# get the first five rows
df.head(5)
Enter fullscreen mode Exit fullscreen mode
Name Age Department Salary
0 Tim 26 Marketing 60000
1 Shaym 28 Product 70000
2 Noor 27 Product 82000
3 Esha 32 HR 55000
4 Sam 24 Product 58000

You can similarly get the last n rows of the dataframe, using the pandas dataframe tail() function. Pass the number of rows you want from the bottom as an argument. By default, n is 5.

# get the last five rows
df.tail(5)
Enter fullscreen mode Exit fullscreen mode
Name Age Department Salary
2 Noor 27 Product 82000
3 Esha 32 HR 55000
4 Sam 24 Product 58000
5 James 31 HR 55000
6 Lily 33 Marketing 65000

Use the info() function

You can use the pandas dataframe info() function to get a concise summary of the dataframe. It gives information such as the column dtypes, count of non-null values in each column, the memory usage of the dataframe, etc.

# summary of the dataframe
df.info()
Enter fullscreen mode Exit fullscreen mode
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        7 non-null      object
 1   Age         7 non-null      int64 
 2   Department  7 non-null      object
 3   Salary      7 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 352.0+ bytes
Enter fullscreen mode Exit fullscreen mode

Get descriptive statistics with the describe() function

The pandas dataframe describe() function returns some descriptive statistics for a dataframe. For example, for numerical columns, it returns the count, mean, standard deviation, min, max, percentile values, etc.

# get dataframe's descriptive statistics
df.describe()
Enter fullscreen mode Exit fullscreen mode
Age Salary
count 7.000000 7.000000
mean 28.714286 63571.428571
std 3.352327 9778.499252
min 24.000000 55000.000000
25% 26.500000 56500.000000
50% 28.000000 60000.000000
75% 31.500000 67500.000000
max 33.000000 82000.000000

Note that the pandas dataframe describe() function, by default includes only the numeric columns when generating the dataframe’s description.

You can, however, specify other columns types (or all the columns) to include the statistics for using the include parameter.

# get descriptive statistics for object type the columns
df.describe(include='object')
Enter fullscreen mode Exit fullscreen mode
Name Department
count 7 7
unique 7 3
top Tim Product
freq 1 3

For object type columns, we get the information about the count, number of unique values, top (the most frequent value), and freq (the count of the most frequent value in the column).

These descriptive statistics give us valuable insights into the distribution of the data in different columns.

Top comments (0)