- You know basic data structures(
set) in python.
- You are familiar with NumPy Basics. If not, check out my colab notebook where I have explained it from the ground up :)
- You have already setup your Data Science environment.
- Numpy has the following limitations:
- No support for column names.
- datatype of all elements must be the same.
- No pre-built methods for common analysis tasks.
- Pandas can handle a large amount of data at ease!
Pandas library overcomes the limitations of NumPy, and sometimes it is also referred to as a swis army knife of Data Analysis!
And don't you worry about losing the vectorization power, Pandas is built upon NumPy so internally, it makes use of NumPy code extensively!
By the way, the above image is not an over-exaggeration of pandas' capabilities!
Enough of the theory, let me show you some code otherwise you might leave this blog 😛 So fire up your Jupyter notebook/lab or whatever IDE you use for Data Science and let's start!
Lastly, we are going to use a dataset on pokemon to keep things fun and interesting 🐬 You can find it on kaggle.
It is the primary data structure provided by
pandas. Formally, a
Dataframe is a 2-Dimensional labeled tabular data structure. It is a 2D NumPy array on steroids 💪🏻 in a way.
Here's how you read a CSV file as a pandas
Dataframe by passing the file path to the
#it's a convention to import pandas as 'pd' import pandas as pd #read a dataset on pokemon stats pokemon_stats = pd.read_csv("pokemon_stats.csv") pokemon_stats #since it's jupyternotebook, no need of print
You will see a table of this sort in your jupyter notebook. (Not all the rows(1028) and columns(51) are shown so that your entire screen is not occupied!)
There's also a
read_excel() function if your dataset is in excel spreadsheet form.
#read excel file as a dataframe df = pd.read_excel("path/to/my_dataset.xlsx")
🗒️ whenever I say
df, I am referring to
It is a common practice to have a quick glance at some rows from beginning or end.
For this, we have
df.tail(n=5) methods where
n is number of rows to display and its default value is 5.
#display first four rows pokemon_stats.head(4)
#display last four rows pokemon_stats.tail(4)
We'll go into the anatomy of
Dataframe in the next post. You will get to know that it's more than just a table. Until then, enjoy data science!
It is useful whenever the data is structured - data stored in CSV files, excel files, database tables, or simply whenever there is a notion of rows and columns.
⚠️ There was a bit of jargon involved in the explanations but don't worry we are going to visit these concepts again and again in forthcoming posts :)