DEV Community

Cover image for Pandas Library
Zaynul Abedin Miah
Zaynul Abedin Miah

Posted on • Updated on

Pandas Library

Pandas is an open source library built on top of NumPy. It allows fast analysis and data cleaning preparation. Pandas is fast and it has high performance & productivity for users. It also has built in visualization.

Panda series
You can make a series in Pandas from any type of data, including a list, a dictionary, a scalar value, etc. Different types of series are created in the following ways:

An array can be converted into a series by using the array() function and the numpy module.

Image description

Missing Data occurs when a unit or object has no data. Real-world data loss is a major issue. Pandas call missing data NA values. Many DataFrame datasets include missing data, either because it never existed or was never collected.

Image description
Pandas Data Frames
Pandas DataFrame is a tabular data structure with two axes that are labeled and whose size can be changed (rows and columns). A Data frame is a two-dimensional data structure, which means that the data is set up in rows and columns like in a table. Pandas DataFrame is made up of three main parts: rows, columns, and data.

Image description


pandas.DataFrame.drop
DataFrame.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

You can get rid of rows or columns by giving their label names and the axis they belong to, or by giving their index or column names directly. When you use a multi-index, you can remove labels from different levels by stating the level.

Image description


loc vs iloc
The.loc [] method is based on the names or labels of the index. The.iloc [] method, on the other hand, is based on the position of the index. It works like a normal slicing, where we just need to give the positional index number and get the right slice.

Image description
Boolean Dataframes
Pandas dataframes allow for boolean indexing which is quite an effective technique to filter a dataframe for various conditions. In boolean indexing, boolean vectors generated depending on the conditions are used to filter the data.

Image description
Subset selection

Indexing in Pandas means selecting rows and columns of data from a Dataframe. It can be selecting all the rows and the particular number of columns, a particular number of rows, and all the columns or a particular number of rows and columns each. Indexing is also known as Subset selection.

Image description

Working with missing data

We use the fillna(), replace(), and interpolate() functions to fill in NaN values in a dataset. These functions replace NaN values with their own values. All of these functions help fill in missing data in a DataFrame's datasets. The Interpolate() function is used to fill in NA values in the dataframe. Instead of hard-coding the value, it does this by using different interpolation techniques. Code #1: Adding a single value to null values

Image description

groupby
Pandas groupby is used to put data into groups based on their categories and apply a function to each group. It also makes it easier to gather data in an effective way.

With Pandas's dataframe.groupby() function, the data is split into groups based on certain criteria. Pandas objects can be cut in any direction. In a general sense, grouping means to provide a way to link labels to group names.

Image description
pandas.concat
pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)

Let’s understand how we can concatenate two or more Data Frames. A concatenation of two or more data frames can be done using pandas.concat() method. concat() in pandas works by combining Data Frames across rows or columns. We can concat two or more data frames either along rows (axis=0) or along columns (axis=1)

Image description
pandas.DataFrame.merge

# importing the module
import pandas as pd
# creating the first DataFrame
df1 = pd.DataFrame({"fruit" : ["apple", "banana", 
                               "avocado", "grape"],
                    "market_price" : [21, 14, 35, 38]})
display("The first DataFrame")
display(df1)

# creating the second DataFrame
df2 = pd.DataFrame({"fruit" : ["apple", "banana", "grape"],
                    "wholesaler_price" : [65, 68, 71]})
display("The second DataFrame")
display(df2)

# joining the DataFrames
# here both common DataFrame elements are in df1 and df2, 
# so it extracts apple, banana, grapes from df1 and df2.  
display("The merged DataFrame")
pd.merge(df1, df2, on = "fruit", how = "inner")

Enter fullscreen mode Exit fullscreen mode

*Outputs: *
Image description

Dataframe.join()
Pandas Dataframe.join() can be characterized as a method of joining standard fields of various DataFrames. The columns which consist of basic qualities and are utilized for joining are called join key.

Image description

Data Input and Output
You can also read data from files like Html, Excel, SQL, CSV.
In order to work with HTML files and SQL database, along with pandas, we would need to install the below library as well,

  • conda install sqlalchemy

  • conda install lxml

  • conda install html5lib

  • conda install BeautifulSoup4

All codes that I've solved with pandas are given below:

https://github.com/azaynul10/Python-For-Data-Science-And-Machine-Learing-Bootcamp-Exercise-Solutions/blob/78aa3f5acb9bea8a751ebb5395af72925e1a74ad/Pandas_Library1.py

Top comments (0)