DEV Community

Cover image for Pandas Walkthrough
Lians
Lians

Posted on

Pandas Walkthrough

If you're thinking about a career in data science, this is one of the first tools you should learn. Pandas is an excellent tool, particularly for data analysis. Pandas is said to be derived from the term "panel data," and it stands for Python Data Analysis Library in full.

The package supports SQL, Excel (.xmls), and JSON files in addition to.csv files. As a result, there is no need to bother with converting files to csv format. Pandas' ability to re-arrange data into appealing rows and columns is a fascinating feature!

Let’s get started

You must have the anaconda application installed on your PC in order to install pandas. Alternatively,Google Notebook. You may access Jupyter Notebook by typing the keywords Jupyter Notebook into your search bar after installing anaconda. On your machine's browser, the application will open a kernel and a localhost page. If you've completed this successfully, you're ready to begin coding with Pandas. Create a new notebook in Python 3 and type the following in the code cell:

        import pandas as pd
Enter fullscreen mode Exit fullscreen mode

If you've prepared a dataset, the next step is to import it into your notebook. This can be accomplished using;

       df = pd.read_csv(r'_file location_')
         df
Enter fullscreen mode Exit fullscreen mode

However, if you want to use your own data, we can do so by creating a variable inside the code cell. Consider the following scenario:

 data = {'first': ["Lians", "Shem", "Zainab"],
        'last': ["Wanjiku", "Githinji", "Buno"],
        'email': ["[lianswanjiku@gmail.com] 
        (mailto:lianswanjiku@gmail.com)", "[smaina@gmail.com] 
        (mailto:smaina@gmail.com)", "[zainab@gmail.com] 
        (mailto:zainab@gmail.com)"]}
 df = pd.DataFrame (data)
 df
Enter fullscreen mode Exit fullscreen mode

Isn't it straightforward? The next step would be to learn how to use the Pandas tools that are required. In this guide, I'll go through a list of topics that you must study and comprehend in order to be proficient with Pandas.

Series and DataFrames

A dataframe is a combination of many series, whereas a pandas series is a one-dimensional data structure made up of a key-value pair. A dataframe is a data set that has been imported into pandas; however, if you call out a single column, you get a series.

Indexes

An index is a unique identifier for locating a series or dataframe. It is necessary to understand how to create an index, how to use it to call out a row or column and filter data from it, and how to then reset the index. Iloc and loc are two tools for indexing. The distinction between the two is that iloc locates data that is integer positioned, whereas loc uses labels to do so.

iloc

#It returns a series/dataframe that contain values of that line of data that has been indexed.
df.iloc[[0, 1]]

loc

#It returns a series/dataframe that contain values of that line of data that has been indexed.

df.loc[2]

Enter fullscreen mode Exit fullscreen mode

Filtering

The process is used to segregate select rows and columns from the entire dataset during data cleaning. For example, if you’re trying to predict house pricing based on the property’s features, it would be best to filter out these characters. That way, an analyst finds a simplified way to forecast a price. In Pandas, the filter can be applied to a label, for example, ‘Country’.

           df.filter(items=['one', 'three'])
Enter fullscreen mode Exit fullscreen mode

Updating, adding and removing rows and columns

These are also operations that are engaged in data cleansing. You can, for example, change the names of labels in a data set if they appear to be confusing. Sometimes when handling data, it may contain missing values, it is expected to remove (drop) these rows or columns.

1. #Combining rows and columns
       df['full_name'] = df['first'] + ' ' + df['last']                      
2. #let's try to remove a bunch of columns from the dataFrame
      df.drop(columns = ['first', 'last'], inplace =True)**
3. #We can add new elements to the dataframe using the append function
      df.append({'first': 'Tony'}, ignore_index = True)

Enter fullscreen mode Exit fullscreen mode

In order to add a new row, we may have to append the previous ones, as shown below;
Let’s start by creating a new dataframe and calling it df1

people = {'first': ["Stella", "Natalie"],
         'last': ["Smith", "Mikaelson"],
         'email': ["[stelasmith@gmail.com] 
          (mailto:stelasmith@gmail.com)", "[mnatalie@gmail.com] 
           (mailto:mnatalie@gmail.com)"]}
 df1 = pd.DataFrame(people)
 df1

#The we can append it to the already existing one df,
  df.append(df1, ignore_index = True)
Enter fullscreen mode Exit fullscreen mode

Sorting

Sorting data entails arranging it according to a set of criteria. Data can be sorted in Pandas in two ways: by index values or by column labels. or a hybrid of the two Let's take a look at a code snippet for data sorting:

         #Let's arrange the df in the order of alphabetical order in ascending order
              df.sort_values(by='last')
Enter fullscreen mode Exit fullscreen mode

In this case, we're required to arrange the values according to the last names' alphabetical order. As a result, sort by column labels.
The following example sorts data according to their index position

       df.sort_index()
Enter fullscreen mode Exit fullscreen mode

Aggregation and grouping

This is a basic approach of segregating relevant data rows and columns in order to arrive at a faster conclusion, similar to filtering. I won't go into great depth about this because I prefer filtering to grouping and aggregation. However, here's a bit of code for the operations:

#It can be achieved through column labels, index position or a combination of both.

  gender_grp = df.groupby(['GENDER'])
  gender_grp.get_group('Female')

Enter fullscreen mode Exit fullscreen mode

Date and Time

If you're working with a data set that includes a date or time column, you'll need to know how to manipulate it. First, as shown below, we'll describe the date parser variable in pandas.

  d_parser = lambda x: pd.datetime.strptime(x ,'%Y-%m-%d %I-%p')
df = pd.read_csv(r'C:\Users\lian.s\Downloads\ETH_1h.csv', parse_dates= ['Date'] , date_parser = d_parser)
df
Enter fullscreen mode Exit fullscreen mode

We can also create a new column that describes what day it was on a specific date;

df['DayOfWeek'] = df['Date'].dt.day_name()
df
Enter fullscreen mode Exit fullscreen mode

File handling

As previously stated, there are several ways to read files into Pandas; however, in this tutorial, we focused primarily on reading csv files. However, it is recommended that you go through each of the other supported file formats as you learn Pandas. SQL, Excel, and Json, to be specific.

Conclusion

The tutorial has chosen the most important aspects to study when learning Pandas, but there are many more concepts to learn. I recommend that you look through the Pandas documentation

I hope you find this guide useful as you embark on your Data Science journey. Don't forget to share your thoughts and comments in the section below.

Top comments (0)