What is Pandas?
Pandas is a simple yet powerful library used for data manipulation and analysis.
The name Pandas is derived from the word ‘Panel Data’ so not the pandas animal
A term for multidimensional structured data sets. it’s just a cute name to a super-useful Python library!. Pandas runs on top of numpy. Ok let us take a little decoy and talk about numpy.
What is Numpy?
(Numpy) Numerical Python. It is fast. It can crunch numbers way better than a python list or loops.
NumPy arrays are the basis of all computations performed by the NumPy library. They are simple Python lists with a few additional properties.
Now Back to Pandas.
Pandas is quite a game changer when it comes to analyzing data with Python and it is one of the most preferred and widely used tools in data munging/wrangling if not THE most used one. Pandas was created by Wes McKinney in 2015 and has since seen an increased interest in people gravitating to Python for Data mining, and manipulation.
Features of Pandas:
- Series object and Dataframe
- Handling of missing data
- Data alignment
- Group by functionality
- Slicing, Indexing and Sub setting
- Merging and Joining
- Hierarchical labeling of axes
- Robust Input/Output tool
- Time series-specific functionality
Why use Pandas and not just stick with Numpy?
|Pandas performs better than Numpy for 500k rows or more||Performs better for 50k rows or less|
|Pandas provides rich time series functionality, data alignment, friendly statistics, group by, merge and join methods, and lots of other conveniences||NumPy by itself is a fairly low-level tool, and will be very much similar to using MATLAB|
Let us get to the practicals. How do we work with Pandas?
When you want to use Pandas for data analysis, you’ll usually use it in one of three different ways:
- Convert a Python’s list, dictionary or Numpy array to a Pandas data frame
- Open a local file using Pandas, usually a CSV file, but could also be a delimited text file (like TSV), Excel, etc
- Open a remote file or database like a CSV or a JSONon a website through a URL or read from a SQL table/database
Ok, so how do we use it. I suggest work with the Anaconda Python Distribution.
Anaconda is popular because it brings many of the tools used in data science and machine learning with just one install, so it’s great for having short and simple setup. Also comes with Numby, Jupyter Notebook, Pandas, and Python 3 installed.
I use Jupyter Notebook as my preferred Text Editor for Data Science. I still love Visual Studio Code, just that with Data Science you can't help but fall in love with Jupyter and the anaconda family.
Datasets in pandas are either One dimensional or Multi dimensional
|One Dimensional||Two dimensional||Three dimensional|
|Series object||Dataframe||Panel Data|
Let us work with Series object first. It is a one-dimensional labelled array
To use pandas, you have to first import it.
import pandas as pd
You need to have data before you can do anything with it.
So let us create a python list.
data = [1, 2, 3, 4] # creating a python list
series1 = pd.Series(data) #To convert this list into a dataframe
series1 To print out the result. Use Shift + Enter or 'Run' button in your jupyter notebook to run your code.
How to create a DataFrame using a list.
general = ['Eddy', 'Bob', 'Linda', 'Ella'] #Python list
generald = pd.DataFrame(general)
#creating the dataframe. pd.DataFrame not pd.dataframe.
# print result
# print result
Ok, i need to go back to work, paused to put this together. Hopefully this was helpful. Let me know if you have any questions or comments.
Enjoy the rest of your week