Arum Puri

Posted on Sep 17, 2024 • Edited on Nov 7, 2024

Mastering Pandas in Python: A Beginner's Guide to Data Analysis

#pandas #datascience #machinelearning #beginners

In today’s data-driven world, the ability to efficiently clean and analyze large datasets is a key skill. This is where Pandas, one of Python’s most powerful libraries, comes into play. Whether you're handling time series data, numerical data, or categorical data, Pandas provides you with tools that make data manipulation easy and intuitive. Let's jump into Pandas and see how it can transform your approach to data analysis.

Installing pandas

To start using Pandas, you’ll need to install it. Like any other Python library, Pandas can be installed via pip by running the following command:

pip install pandas

Pandas Data Structures

Pandas have series and dataframe for data structure. They provide a solid foundation for a wide variety of data tasks.

1. Series

From Panda's documentation, a Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

import pandas as pd

# Creating a Series
s = pd.Series(data, index=index)

# Creating a Series from a list
data = pd.Series([10, 20, 30, 40])

# Creating a Series from a dictionary
data_dict = pd.Series({'a': 10, 'b': 20, 'c': 30})

2. DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different value types (numeric, string, Boolean, etc.). You can think of it like a spreadsheet SQL table or a dict of Series objects

import pandas as pd

data = {
    'Name': ['Harry Potter', 'Hermione Granger', 'Ron Weasley', 'Draco Malfoy', 'Luna Lovegood'],
    'House': ['Gryffindor', 'Gryffindor', 'Gryffindor', 'Slytherin', 'Ravenclaw'],
    'Patronus': ['Stag', 'Otter', 'Jack Russell Terrier', 'None', 'Hare'],
    'Favorite Subject': ['Defense Against the Dark Arts', 'Arithmancy', 'Divination', 'Potions', 'Charms'],
    'Quidditch Position': ['Seeker', 'None', 'Keeper', 'None', 'None'],
    'OWL Scores': [7, 11, 7, 8, 9]
}

df = pd.DataFrame(data)
print(df)

Data Manipulation with Pandas

Once you have your data in a DataFrame, Pandas provides powerful methods to explore, clean, and transform it. Let’s start with some of the most commonly used methods for exploring data.

1. Exploring Data

head()

The head() method returns the headers and a specified number of rows, starting from the top. The default number of elements to display is five, but you may pass a custom number.

>>> df.head(3)
              Name       House             Patronus                  Favorite Subject Quidditch Position  OWL Scores
0     Harry Potter  Gryffindor                  Stag   Defense Against the Dark Arts            Seeker           7
1  Hermione Granger  Gryffindor                Otter                        Arithmancy               None          11
2      Ron Weasley  Gryffindor  Jack Russell Terrier                      Divination             Keeper           7

tail()

The tail() method returns the headers and a specified number of rows, starting from the bottom.

>>> df.tail(2)
              Name       House  Patronus Favorite Subject Quidditch Position  OWL Scores
3     Draco Malfoy   Slytherin      None           Potions               None           8
4    Luna Lovegood  Ravenclaw      Hare             Charms               None           9

info()

The DataFrames object has a method called info(), that gives you more information about the data set.

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Name               5 non-null      object
 1   House              5 non-null      object
 2   Patronus           5 non-null      object
 3   Favorite Subject   5 non-null      object
 4   Quidditch Position 5 non-null      object
 5   OWL Scores         5 non-null      int64 
dtypes: int64(1), object(5)
memory usage: 368.0 bytes

describe()

The describe() methods give us the overall statistics of the dataset. It gives us values of min, max, mean, and standard deviation.

>>> df.describe()
       OWL Scores
count    5.000000
mean     8.400000
std      1.673320
min      7.000000
25%      7.000000
50%      8.000000
75%      9.000000
max     11.000000

2.Filtering

In data analysis, filtering helps you narrow down the data you're interested in. Pandas have several ways to filter data. The most simple and straightforward is direct Boolean indexing, especially filtering based on specific conditions (e.g., filtering based on column values). Let’s look at a few examples. In the first example, we’re selecting rows where the house value is Gryffindor:

import pandas as pd

data = {
    'Name': ['Harry Potter', 'Hermione Granger', 'Ron Weasley', 'Draco Malfoy', 'Luna Lovegood'],
    'House': ['Gryffindor', 'Gryffindor', 'Gryffindor', 'Slytherin', 'Ravenclaw'],
    'Patronus': ['Stag', 'Otter', 'Jack Russell Terrier', 'None', 'Hare'],
    'Favorite Subject': ['Defense Against the Dark Arts', 'Arithmancy', 'Divination', 'Potions', 'Charms'],
    'Quidditch Position': ['Seeker', 'None', 'Keeper', 'None', 'None'],
    'OWL Scores': [7, 11, 7, 8, 9]
}

df = pd.DataFrame(data)

# Filter rows where the House is Gryffindor
gryffindor_students = df[df['House'] == 'Gryffindor']
print(gryffindor_students)

output

               Name       House             Patronus                  Favorite Subject Quidditch Position  OWL Scores
0     Harry Potter  Gryffindor                  Stag   Defense Against the Dark Arts            Seeker           7.00
1  Hermione Granger  Gryffindor                Otter                        Arithmancy               None          11.00
2      Ron Weasley  Gryffindor  Jack Russell Terrier                      Divination             Keeper           7.00

In the second example, we’re filtering data where the OWL score (think of it as a magical equivalent to the SAT in the Harry Potter world) is greater than 8:

# Filter students with OWL Scores greater than 8
high_scorers = df[df['OWL Scores'] > 8]
print(high_scorers)

output

               Name       House Patronus Favorite Subject Quidditch Position  OWL Scores
1  Hermione Granger  Gryffindor    Otter       Arithmancy               None         11.00
4    Luna Lovegood  Ravenclaw     Hare           Charms               None         8.25

Another way to filter data is by using the .loc method. This method allows you to filter using conditions and labels for both rows and columns. If the specified labels don’t exist, it will raise a KeyError:

# Use .loc[] to filter students who scored more than 8 OWLs
high_owl_scores_loc = df.loc[df['OWL Scores'] > 8]
print(high_owl_scores_loc)

output

              Name       House Patronus Favorite Subject Quidditch Position  OWL Scores
1  Hermione Granger  Gryffindor    Otter        Arithmancy               None         11
4    Luna Lovegood  Ravenclaw     Hare           Charms               None          9

At first glance, this may look like direct Boolean indexing. Still, there’s a key difference: .loc provides finer control, letting you select both rows and columns simultaneously, while Boolean indexing primarily filters rows:

# Use .loc[] to filter and select specific columns
gryffindor_students = df.loc[df['House'] == 'Gryffindor', ['Name', 'OWL Scores']]
print(gryffindor_students)

output

            Name  OWL Scores
0   Harry Potter           7
1  Hermione Granger       11
2   Ron Weasley            7

Finally, we have the .iloc method. This is used for position-based filtering, meaning you select rows and columns by their index positions rather than their labels:

third_character = df.iloc[2]
print(third_character)

output

Name                 Ron Weasley
House                  Gryffindor
Patronus     Jack Russell Terrier
Favorite Subject          Divination
Quidditch Position            Keeper
OWL Scores                        7
Name: 2, dtype: object

Select the 1st and last rows (indexes 0 and 4) for columns "House" and "OWL Scores"

first_last_info = df.iloc[[0, 4], [1, 5]]
print(first_last_info)

output

        House  OWL Scores
0  Gryffindor           7
4  Ravenclaw            9

3. Sorting

Sorting data with pandas is straightforward and can be done using the sort_values() method. For example, you can sort a list of students by their OWL scores in ascending order:

# Sort by 'OWL Scores' in ascending order (default)
sorted_by_owl = df.sort_values(by='OWL Scores')
print(sorted_by_owl)

output:

              Name       House              Patronus                  Favorite Subject Quidditch Position  OWL Scores
0     Harry Potter  Gryffindor                   Stag  Defense Against the Dark Arts            Seeker            7
2      Ron Weasley  Gryffindor   Jack Russell Terrier                    Divination            Keeper            7
3      Draco Malfoy  Slytherin                  None                         Potions               None           8
4    Luna Lovegood  Ravenclaw                   Hare                           Charms               None          9
1  Hermione Granger  Gryffindor                   Otter                    Arithmancy               None         11

To sort in descending order, set the ascending parameter to False:

# Sort by 'OWL Scores' in descending order
sorted_by_owl_desc = df.sort_values(by='OWL Scores', ascending=False)
print(sorted_by_owl_desc)

output:

              Name       House              Patronus                  Favorite Subject Quidditch Position  OWL Scores
1  Hermione Granger  Gryffindor                   Otter                    Arithmancy               None         11
4    Luna Lovegood  Ravenclaw                   Hare                           Charms               None          9
3      Draco Malfoy  Slytherin                  None                         Potions               None           8
0     Harry Potter  Gryffindor                   Stag  Defense Against the Dark Arts            Seeker            7
2      Ron Weasley  Gryffindor   Jack Russell Terrier                    Divination            Keeper            7

One of the powerful features of sort_values() is that it allows you to sort by multiple columns. In the example below, students are sorted first by their OWL scores and then by their house:

# Sort by 'OWL Scores' first in descending order, then by 'House' in ascending order
sorted_by_owl_first = df.sort_values(by=['OWL Scores', 'House'], ascending=[False, True])
print(sorted_by_owl_first)

output:

              Name       House              Patronus                  Favorite Subject Quidditch Position  OWL Scores
1  Hermione Granger  Gryffindor                   Otter                    Arithmancy               None         11
4    Luna Lovegood  Ravenclaw                   Hare                           Charms               None          9
3      Draco Malfoy  Slytherin                  None                         Potions               None           8
0     Harry Potter  Gryffindor                   Stag  Defense Against the Dark Arts            Seeker            7
2      Ron Weasley  Gryffindor   Jack Russell Terrier                    Divination            Keeper            7

In this case, the OWL score is the primary criterion for sorting, meaning pandas will prioritize it. If two students have the same OWL score, the house value is used as the secondary criterion for sorting

Exploring, filtering, and sorting data is an essential first step before jumping into tasks like data cleaning or wrangling in the data analysis process. Pandas offers a range of built-in methods that help organize and accelerate these operations. Additionally, Pandas integrates seamlessly with other libraries, such as NumPy or SciPy for numerical computations, Matplotlib for data visualization, and analytical tools like Statsmodels and Scikit-learn. By learning Pandas, you can significantly boost your efficiency in handling and analyzing data, making it a valuable skill for any data professional. Happy coding!

DEV Community

Mastering Pandas in Python: A Beginner's Guide to Data Analysis

Installing pandas

Pandas Data Structures

1. Series

2. DataFrame

Data Manipulation with Pandas

2.Filtering

3. Sorting

Top comments (0)

Read next

Interview Questions on AWS Identity and Access Management (IAM)

How to Create Rock Paper Scissors Game Using HTML CSS and JavaScript

Random Forest Classification: Unveiling the Powerful Machine Learning Technique That's Transforming Decision-Making

DroidSpeak: A Breakthrough in AI-to-AI Communication Speed Using Neural Caching