Let’s face it: the last thing the world needs is another “ Intro to Pandas ” post. Anybody strange enough to read this blog surely had the same reaction to discovering Pandas as I did: a manic euphoria that can only be described as love at first sight. We wanted to tell the world, and that we did. A lot. Yet here I am, about to helplessly sing cliche praises one more time.
I’m a prisoner of circumstance here. As it turns out, the vast (and I mean _ vast _) majority of our fans have a raging Pandas addiction. They come to our humble mom-and-pop shop here at Hackers and Slackers foaming at the mouth, falling into an absolute raging benders for all Pandas-related content. If I had half a brain, I’d rename this site Pandas and Pandas and delete all non-Pandas-related content. Talk about cash money.
As a middle-ground, I’ve decided to do a bit of housekeeping. My previous “Intro to Pandas” post was an unflattering belligerent mess jotted into a Confluence instance long ago during a Friday night pregame. That mess snuck its way on to this blog, and has gone virtually unnoticed for a year now. I've decided that this probably wasn't the best way to open up a series about the most influential Python library of all time. We're going to try this one more time. For the Pandas.
Pandas is used to analyze and modify tabular data in Python. When we say “tabular data,” we mean any instance in life where data is represented in a table format. Excel, SQL databases, shitty HTML tables.... they’ve all been the same thing with different syntax this whole time. Pandas can achieve anything that any other table can.
If you’re reasonably green to data analysis and are experiencing the “oh-my-God-all-data-professions-are-kinda-just-Excel” realization as we speak, feel free to take a moment. Great, that’s behind us now.
Tabular data in Pandas is referred to as a “DataFrame.” We can’t call everything “tables-” otherwise, our choice of vague terminology would grow horribly confusing when we refer to data in different systems. Between you and me though, DataFrames are basically tables.
So how do we represent two-dimensional data via command line: a concept which inherently interprets and displays information one-dimensionally?
DataFrames consist of individual parts which are easy-to-understand at face value. It’s the complexity of these things together, creating a sum greater than the whole of its parts, which fuels the seemingly endless power of DataFrames. If we want any hope of contributing to the field of Data Science, we need to not only understand the terminology but at least be aware of core concepts of what a DataFrame is beneath the hood. This understanding is what separates engineers from Excel monkeys.
Were you expecting this post just to be a bunch of one-liners in Pandas? Good, I hope you're disappointed. Strap yourself in, we might actually learn something today. Class is now in session, baby. Let's break apart what makes a DataFrame, piece-by-piece:
The most basic description of any table would be a collection of columns and rows. Thinking abstractly, we can use the same definition for columns as we do for rows: a sequence of values separated by cells. The only distinction between the two is the direction (horizontal or vertical). Considering we can flip any table on its side and retain the same meaning, we could almost argue that rows and columns are actually the same thing. In Pandas, that's exactly what's happening: both rows and columns are considered to be a Series.
- Series' are objects native to Pandas (and Numpy) which refer to one-dimensional sequences of data. Another example of a one-dimensional sequence of data could be an array , but series' are much more than arrays: they're a class of their own for many powerful reasons, which we'll see in a moment.
Axis refers to the 'direction' of a series, or in other words, whether a series is a column or a row. A series with an axis of
0is a row, whereas a series with an axis of
1is a column. This should help break the conception that columns are separate entities from rows: instead, they're the same object with a different attribute.
- A series contains labels , which are given visual names for a row/column specifying labels allows us to call upon any labeled series in the same way we would access a value in a Python dictionary. For instance, accessing
dataframe['awayTeamName']returns the entire column matching the header "awayTeamName".
- Every row and column has a numerical index. Most of the time, a row's label will be equivalent to the row's index. While it's common practice to define headers for columns, columns have indexes as well, which simply aren't shown. In this regard, Series share an attribute with lists/arrays, in that they are a collection of indexed values
Consider the last two points: we just described a series to work the same way as a Python dictionary, but also the same way as a Python list. That's right: series' objects are like the biracial offspring of lists and dicts. We can access any column by either its name or its index, and the same goes for rows. Even if we rip a column out from a DataFrame, each cell in that series will still retain the row labels for every cell. This means we can say things like get me column #3, and then find me the value for whatever was in the row labeled "Y". Of course, this works in the reverse as well. It's crazy how things get exponentially more powerful and complicated when we add entire dimensions, isn't it?
If you've made it this far, you've earned the right to start getting hands-on. Luckily, Pandas has plenty of methods to load tabular data into DataFrames, regardless if you're using static files, SQL, or quirkier methods, Pandas has you covered. Here are some of my favorite examples:
import pandas as pd # Reads a local CSV file. csv_df = pd.read_csv('data.csv') # Similar to above excel_df = pd.read_excel('data.xlsx') # Creating tabular data from non-tabular JSON json_df = pd.read_json('data.json') # Direct db access utilizing SQLAlchemy read_sql = read_sql('SELECT * FROM blah', conn=sqlalchemy_engine) # My personal ridiculous favorite: HTML table to DataFrame. read_html = read_html('examplePageWithTable.html) # The strength of Google BigQuery: already officially supported by Pandas read_gbq = read_gbq('SELECT * FROM test_dataset.test_table', projectid)
All of these achieve the same result of creating a DataFrame. No matter what horrible data sources you may have been forced to inherit, Pandas is here to help. Pandas knows our pain. Pandas is love. Pandas is life.
With data loaded, let's see how we can apply our new knowledge of series' to interact with out data.
Pandas has a method for finding a series by label, as well as a separate method for finding a series by index. These methods are
.loc, respectively. Let's say our DataFrame from the example above is stored as a variable named
baseball_df. To get the values of a column by name, we would do the following:
baseball_df = baseball_df.iloc['homeTeamName'] print(baseball_df)
This would return the following:
0 Cubs 1 Indians 2 Padres 3 Diamondbacks 4 Giants 5 Blue Jays 6 Reds 7 Cubs 8 Rockies 9 Yankees Name: homeTeamName, dtype: object
That's our column! We can see the row labels being listed alongside each row's value. Told ya so. Getting a column will also return the column's dtype , or data type. Data types can be set on columns explicitly. If they aren't, Pandas will generally either default to detecting that the data in the column is a float (returned for any column which only holds numerical values, despite number of decimal points) or an ' object' , which is a fancy catch-all meaning "fuck if I know, there's letters and shit in there, it could be anything probably." Pandas doesn't try hard on its own to discern the types of data in each field.
If you're thinking ahead, you might see a looming conflict of interest with
iloc. Since we've established that columns and rows are the same, and we're accessing series' based on criteria that is met by both columns and rows (every table has a first row and a first column), how does Pandas know what we want with
.loc()? Short answer: It doesn't, so it just returns both!
baseball_df = baseball_df.loc print(baseball_df) homeTeamName awayTeamName startTime duration_minutes 0 Cubs Reds 188 1 Indians Astros 194 2 Padres Giants 185 3 Diamondbacks Brewers 211
Ahhh, a 4x4 grid! This does, in fact, satisfy what we asked for- albiet in a clever, intentional way. " Clever and intentional" is actually a great way to describe Pandas as a library. This combination of ease and power is what makes Pandas so magnetic to curious newcomers.
Want another example? How about leveraging the unique attributes of series' to splice DataFrames as though they were arrays?
sliced_df = df.loc['homeTeamName':'awayTeamName'] print(sliced_df) homeTeamName awayTeamName 0 Cubs Reds 1 Indians Astros 2 Padres Giants 3 Diamondbacks Brewers
...Did we just do that? We totally did. We were able to slice a two-dimensional set of data by using the same syntax that we'd used to slice arrays, thanks to the power of the series object.
There are a lot more entertaining, mind-blowing ways to introduce people to Pandas. If our goal had been sheer amusement, we would have leveraged the cookie-cutter route to Pandas tutorials: overloading readers with Pandas "tricks" displaying immense power in minimal effort. Unfortunately, we took the applicable approach to actually retaining information. Surely this model of "informational and time consuming" will beat out "useless but instantly gratifying," right? RIGHT?
Fuck it, let's just rebrand to become Pandas and Pandas next week. From now on when people want that quick fix, you can call me Pablo Escobar. Join us next time when we use Pandas data analysis to determine which private Caribbean island offers the best return on investment with all the filthy money we'll make.
And in case you were wondering: it's definitely not the Fyre festival one.