DEV Community

Cover image for Mastering Pandas in Python: A Beginner's Guide to Data Analysis
Arum Puri
Arum Puri

Posted on • Edited on

Mastering Pandas in Python: A Beginner's Guide to Data Analysis

In today’s data-driven world, the ability to efficiently clean and analyze large datasets is a key skill. This is where Pandas, one of Python’s most powerful libraries, comes into play. Whether you're handling time series data, numerical data, or categorical data, Pandas provides you with tools that make data manipulation easy and intuitive. Let's jump into Pandas and see how it can transform your approach to data analysis.

Installing pandas

To start using Pandas, you’ll need to install it. Like any other Python library, Pandas can be installed via pip by running the following command:

pip install pandas
Enter fullscreen mode Exit fullscreen mode

Pandas Data Structures

Pandas have series and dataframe for data structure. They provide a solid foundation for a wide variety of data tasks.

1. Series

From Panda's documentation, a Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

import pandas as pd

# Creating a Series
s = pd.Series(data, index=index)

# Creating a Series from a list
data = pd.Series([10, 20, 30, 40])

# Creating a Series from a dictionary
data_dict = pd.Series({'a': 10, 'b': 20, 'c': 30})

Enter fullscreen mode Exit fullscreen mode

2. DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different value types (numeric, string, Boolean, etc.). You can think of it like a spreadsheet SQL table or a dict of Series objects

import pandas as pd

data = {
    'Name': ['Harry Potter', 'Hermione Granger', 'Ron Weasley', 'Draco Malfoy', 'Luna Lovegood'],
    'House': ['Gryffindor', 'Gryffindor', 'Gryffindor', 'Slytherin', 'Ravenclaw'],
    'Patronus': ['Stag', 'Otter', 'Jack Russell Terrier', 'None', 'Hare'],
    'Favorite Subject': ['Defense Against the Dark Arts', 'Arithmancy', 'Divination', 'Potions', 'Charms'],
    'Quidditch Position': ['Seeker', 'None', 'Keeper', 'None', 'None'],
    'OWL Scores': [7, 11, 7, 8, 9]
}

df = pd.DataFrame(data)
print(df)

Enter fullscreen mode Exit fullscreen mode

Image description

Data Manipulation with Pandas

Once you have your data in a DataFrame, Pandas provides powerful methods to explore, clean, and transform it. Let’s start with some of the most commonly used methods for exploring data.

1. Exploring Data

  • head()

The head() method returns the headers and a specified number of rows, starting from the top. The default number of elements to display is five, but you may pass a custom number.

>>> df.head(3)
              Name       House             Patronus                  Favorite Subject Quidditch Position  OWL Scores
0     Harry Potter  Gryffindor                  Stag   Defense Against the Dark Arts            Seeker           7
1  Hermione Granger  Gryffindor                Otter                        Arithmancy               None          11
2      Ron Weasley  Gryffindor  Jack Russell Terrier                      Divination             Keeper           7

Enter fullscreen mode Exit fullscreen mode
  • tail()

The tail() method returns the headers and a specified number of rows, starting from the bottom.

>>> df.tail(2)
              Name       House  Patronus Favorite Subject Quidditch Position  OWL Scores
3     Draco Malfoy   Slytherin      None           Potions               None           8
4    Luna Lovegood  Ravenclaw      Hare             Charms               None           9

Enter fullscreen mode Exit fullscreen mode
  • info()

The DataFrames object has a method called info(), that gives you more information about the data set.

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Name               5 non-null      object
 1   House              5 non-null      object
 2   Patronus           5 non-null      object
 3   Favorite Subject   5 non-null      object
 4   Quidditch Position 5 non-null      object
 5   OWL Scores         5 non-null      int64 
dtypes: int64(1), object(5)
memory usage: 368.0 bytes

Enter fullscreen mode Exit fullscreen mode
  • describe()

The describe() methods give us the overall statistics of the dataset. It gives us values of min, max, mean, and standard deviation.

>>> df.describe()
       OWL Scores
count    5.000000
mean     8.400000
std      1.673320
min      7.000000
25%      7.000000
50%      8.000000
75%      9.000000
max     11.000000

Enter fullscreen mode Exit fullscreen mode

2.Filtering

In data analysis, filtering helps you narrow down the data you're interested in. Pandas have several ways to filter data. The most simple and straightforward is direct Boolean indexing, especially filtering based on specific conditions (e.g., filtering based on column values). Let’s look at a few examples. In the first example, we’re selecting rows where the house value is Gryffindor:

import pandas as pd

data = {
    'Name': ['Harry Potter', 'Hermione Granger', 'Ron Weasley', 'Draco Malfoy', 'Luna Lovegood'],
    'House': ['Gryffindor', 'Gryffindor', 'Gryffindor', 'Slytherin', 'Ravenclaw'],
    'Patronus': ['Stag', 'Otter', 'Jack Russell Terrier', 'None', 'Hare'],
    'Favorite Subject': ['Defense Against the Dark Arts', 'Arithmancy', 'Divination', 'Potions', 'Charms'],
    'Quidditch Position': ['Seeker', 'None', 'Keeper', 'None', 'None'],
    'OWL Scores': [7, 11, 7, 8, 9]
}

df = pd.DataFrame(data)


Enter fullscreen mode Exit fullscreen mode
# Filter rows where the House is Gryffindor
gryffindor_students = df[df['House'] == 'Gryffindor']
print(gryffindor_students)

Enter fullscreen mode Exit fullscreen mode

output

               Name       House             Patronus                  Favorite Subject Quidditch Position  OWL Scores
0     Harry Potter  Gryffindor                  Stag   Defense Against the Dark Arts            Seeker           7.00
1  Hermione Granger  Gryffindor                Otter                        Arithmancy               None          11.00
2      Ron Weasley  Gryffindor  Jack Russell Terrier                      Divination             Keeper           7.00

Enter fullscreen mode Exit fullscreen mode

In the second example, we’re filtering data where the OWL score (think of it as a magical equivalent to the SAT in the Harry Potter world) is greater than 8:

# Filter students with OWL Scores greater than 8
high_scorers = df[df['OWL Scores'] > 8]
print(high_scorers)

Enter fullscreen mode Exit fullscreen mode

output

               Name       House Patronus Favorite Subject Quidditch Position  OWL Scores
1  Hermione Granger  Gryffindor    Otter       Arithmancy               None         11.00
4    Luna Lovegood  Ravenclaw     Hare           Charms               None         8.25

Enter fullscreen mode Exit fullscreen mode

Another way to filter data is by using the .loc method. This method allows you to filter using conditions and labels for both rows and columns. If the specified labels don’t exist, it will raise a KeyError:

# Use .loc[] to filter students who scored more than 8 OWLs
high_owl_scores_loc = df.loc[df['OWL Scores'] > 8]
print(high_owl_scores_loc)


Enter fullscreen mode Exit fullscreen mode

output

              Name       House Patronus Favorite Subject Quidditch Position  OWL Scores
1  Hermione Granger  Gryffindor    Otter        Arithmancy               None         11
4    Luna Lovegood  Ravenclaw     Hare           Charms               None          9

Enter fullscreen mode Exit fullscreen mode

At first glance, this may look like direct Boolean indexing. Still, there’s a key difference: .loc provides finer control, letting you select both rows and columns simultaneously, while Boolean indexing primarily filters rows:

# Use .loc[] to filter and select specific columns
gryffindor_students = df.loc[df['House'] == 'Gryffindor', ['Name', 'OWL Scores']]
print(gryffindor_students)

Enter fullscreen mode Exit fullscreen mode

output

            Name  OWL Scores
0   Harry Potter           7
1  Hermione Granger       11
2   Ron Weasley            7

Enter fullscreen mode Exit fullscreen mode

Finally, we have the .iloc method. This is used for position-based filtering, meaning you select rows and columns by their index positions rather than their labels:

third_character = df.iloc[2]
print(third_character)

Enter fullscreen mode Exit fullscreen mode

output

Name                 Ron Weasley
House                  Gryffindor
Patronus     Jack Russell Terrier
Favorite Subject          Divination
Quidditch Position            Keeper
OWL Scores                        7
Name: 2, dtype: object

Enter fullscreen mode Exit fullscreen mode

Select the 1st and last rows (indexes 0 and 4) for columns "House" and "OWL Scores"

first_last_info = df.iloc[[0, 4], [1, 5]]
print(first_last_info)

Enter fullscreen mode Exit fullscreen mode

output

        House  OWL Scores
0  Gryffindor           7
4  Ravenclaw            9

Enter fullscreen mode Exit fullscreen mode

3. Sorting

Sorting data with pandas is straightforward and can be done using the sort_values() method. For example, you can sort a list of students by their OWL scores in ascending order:

# Sort by 'OWL Scores' in ascending order (default)
sorted_by_owl = df.sort_values(by='OWL Scores')
print(sorted_by_owl)

Enter fullscreen mode Exit fullscreen mode

output:

              Name       House              Patronus                  Favorite Subject Quidditch Position  OWL Scores
0     Harry Potter  Gryffindor                   Stag  Defense Against the Dark Arts            Seeker            7
2      Ron Weasley  Gryffindor   Jack Russell Terrier                    Divination            Keeper            7
3      Draco Malfoy  Slytherin                  None                         Potions               None           8
4    Luna Lovegood  Ravenclaw                   Hare                           Charms               None          9
1  Hermione Granger  Gryffindor                   Otter                    Arithmancy               None         11

Enter fullscreen mode Exit fullscreen mode

To sort in descending order, set the ascending parameter to False:

# Sort by 'OWL Scores' in descending order
sorted_by_owl_desc = df.sort_values(by='OWL Scores', ascending=False)
print(sorted_by_owl_desc)

Enter fullscreen mode Exit fullscreen mode

output:

              Name       House              Patronus                  Favorite Subject Quidditch Position  OWL Scores
1  Hermione Granger  Gryffindor                   Otter                    Arithmancy               None         11
4    Luna Lovegood  Ravenclaw                   Hare                           Charms               None          9
3      Draco Malfoy  Slytherin                  None                         Potions               None           8
0     Harry Potter  Gryffindor                   Stag  Defense Against the Dark Arts            Seeker            7
2      Ron Weasley  Gryffindor   Jack Russell Terrier                    Divination            Keeper            7

Enter fullscreen mode Exit fullscreen mode

One of the powerful features of sort_values() is that it allows you to sort by multiple columns. In the example below, students are sorted first by their OWL scores and then by their house:

# Sort by 'OWL Scores' first in descending order, then by 'House' in ascending order
sorted_by_owl_first = df.sort_values(by=['OWL Scores', 'House'], ascending=[False, True])
print(sorted_by_owl_first)


Enter fullscreen mode Exit fullscreen mode

output:

              Name       House              Patronus                  Favorite Subject Quidditch Position  OWL Scores
1  Hermione Granger  Gryffindor                   Otter                    Arithmancy               None         11
4    Luna Lovegood  Ravenclaw                   Hare                           Charms               None          9
3      Draco Malfoy  Slytherin                  None                         Potions               None           8
0     Harry Potter  Gryffindor                   Stag  Defense Against the Dark Arts            Seeker            7
2      Ron Weasley  Gryffindor   Jack Russell Terrier                    Divination            Keeper            7

Enter fullscreen mode Exit fullscreen mode

In this case, the OWL score is the primary criterion for sorting, meaning pandas will prioritize it. If two students have the same OWL score, the house value is used as the secondary criterion for sorting

Exploring, filtering, and sorting data is an essential first step before jumping into tasks like data cleaning or wrangling in the data analysis process. Pandas offers a range of built-in methods that help organize and accelerate these operations. Additionally, Pandas integrates seamlessly with other libraries, such as NumPy or SciPy for numerical computations, Matplotlib for data visualization, and analytical tools like Statsmodels and Scikit-learn. By learning Pandas, you can significantly boost your efficiency in handling and analyzing data, making it a valuable skill for any data professional. Happy coding!

Top comments (0)