DEV Community

Piyush Raj
Piyush Raj

Posted on

Pandas - Basic Data Manipulation - 7 Days of Pandas

Welcome to the second article in the "7 Days of Pandas" series where we cover the pandas library in Python which is used for data manipulation.

In the first article of the series, we looked at how to read and write CSV files with Pandas. In this tutorial, we will look at some of the most common operations that we perform on a dataframe in Pandas.

Pandas is a powerful Python library that is widely used for data manipulation and analysis. It provides a range of functions and methods that allow you to easily manipulate and transform data in a variety of formats. In this tutorial, we will cover the following topics:

  1. Selecting rows and columns
  2. Filtering data
  3. Sorting data
  4. Adding and deleting columns

Before we begin, let's first import pandas and read in a sample data file. We will use the pandas.read_csv() function to read in a CSV file and store it in a DataFrame object.

We'll assume that a CSV file "sample_data.csv" exists in the current working directory that we read into a dataframe.

import pandas as pd

df = pd.read_csv("sample_data.csv")
Enter fullscreen mode Exit fullscreen mode

Now that we have a DataFrame, let's dive into the first topic: selecting rows and columns.

Selecting Rows and Columns

There are several ways to select specific rows and columns from a pandas DataFrame. One way is to use the loc attribute, which allows you to select rows and columns based on their labels. For example, to select the first row of the DataFrame, you can use the following code:

# select the first row
df.loc[0]
Enter fullscreen mode Exit fullscreen mode

To select a specific column, you can pass the column name as a string:

# select column by its name
df.loc[:, "column_name"]
Enter fullscreen mode Exit fullscreen mode

You can also use the iloc attribute to select rows and columns based on their integer indices. For example, to select the first row using iloc, you can use the following code:

# select the first row
df.iloc[0]
Enter fullscreen mode Exit fullscreen mode

To select a specific column, you can pass the column index as an integer:

# select column by column index
df.iloc[:, 0]
Enter fullscreen mode Exit fullscreen mode

Filtering Data

In addition to selecting rows and columns, you can also use pandas to filter your data based on specific conditions.

You can use boolean indexing to filter the data in a dataframe. Boolean indexing allows you to filter a DataFrame based on the values in one or more columns. The idea is the to use a boolean expression that results in a boolean index which we use to filter the original data.

To do this, you pass a boolean expression to the DataFrame's indexing operator, []. For example, to filter the DataFrame to only include rows where the value in the "column_name" column is greater than 5, you can use the following code:

# filter dataframe
df[df["column_name"] > 5]
Enter fullscreen mode Exit fullscreen mode

You can also filter the dataframe on multiple conditions by using the logical operators & (and) and | (or). For example, to filter the DataFrame to only include rows where the value in the "column_name" column is greater than 5 and the value in the "other_column" column is less than 10, you can use the following code:

# filter dataframe on mulitple conditions
df[(df["column_name"] > 5) & (df["other_column"] < 10)]
Enter fullscreen mode Exit fullscreen mode

Alternatively, you can also use the query() function in pandas to filter a dataframe.

Sorting Data

To sort a pandas DataFrame, you can use the pandas dataframe sort_values() method. This method allows you to specify one or multiple columns to sort by, as well as the sort order (ascending or descending).

For example, to sort the DataFrame by the "column_name" column in ascending order, you can use the following code:

# sort dataframe by "column_name" in ascending order
df.sort_values("column_name")
Enter fullscreen mode Exit fullscreen mode

To sort in descending order, you can set the ascending parameter to False:

# sort dataframe by "column_name" in descending order
df.sort_values("column_name", ascending=False)
Enter fullscreen mode Exit fullscreen mode

You can also sort by multiple columns by passing a list of column names:

# sort dataframe by multiple columns
df.sort_values(["column_name_1", "column_name_2"])
Enter fullscreen mode Exit fullscreen mode

Adding and Deleting Columns

To add a new column to a pandas DataFrame, you can simply assign a new value to a column that doesn't exist. For example, to add a new column called "new_column" with a default value of 0 for all rows, you can use the following code:

# create a new column with all values as 0
df["new_column"] = 0
Enter fullscreen mode Exit fullscreen mode

You can also assign different values to each row using a list or another Series object.

There are other methods to add a column as well.

To delete a column from a DataFrame, you can use the drop() method and specify the column name and the axis parameter set to 1 (columns). For example, to delete the "new_column" from the DataFrame, you can use the following code:

# remove the column "new_column" from the dataframe
df = df.drop("new_column", axis=1)
Enter fullscreen mode Exit fullscreen mode

That concludes this tutorial on basic data manipulation with pandas. We hope that you found it useful.

In the coming articles, we'll look at other useful operations in Pandas.

Top comments (0)