SQL is a must have skill if you are working with data. Pandas is as essential for data analysis in the python. If you're someone who knows pandas and want to learn SQL or you know SQL and want to learn pandas, this post will introduce you to some basic skills.
To work on SQL you need access to a database hosted locally or on the cloud. Some popular databases you can work with are the following:
There are many more but I've tried to list down the popular ones.
To show side by side examples, however, I am not using any database. I have using a SQL engine called pandassql. So to run these examples, you will have to read some data into pandas dataframe and then use pandasql to run SQL queries.
So the steps are as follows:
- !pip install -U pandassql
- import pandas as pd
- from pandassql import sqldf
- data = pd.read_csv()
- Run these examples
SELECTing Data
Once you import the data, select 5 rows from the data just to inspects. Below is an example of how to select top 5 rows in pandas:
data.head()
ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | NaN |
1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | NaN |
2 | 3 | Gunnar Nielsen Aaby | M | 24.0 | NaN | NaN | Denmark | DEN | 1920 Summer | 1920 | Summer | Antwerpen | Football | Football Men's Football | NaN |
3 | 4 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold |
4 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 500 metres | NaN |
You can achieve the same result in SQL:
sqldf("SELECT * FROM data LIMIT 5")
ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | None |
1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | None |
2 | 3 | Gunnar Nielsen Aaby | M | 24.0 | NaN | NaN | Denmark | DEN | 1920 Summer | 1920 | Summer | Antwerpen | Football | Football Men's Football | None |
3 | 4 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold |
4 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 500 metres | None |
Both pandas and SQL allow you to select specific columns from the dataframe(pandas) or table(SQL).
data[['Name', 'Age', 'Team', 'Year']].head()
Name | Age | Team | Year | |
---|---|---|---|---|
0 | A Dijiang | 24.0 | China | 1992 |
1 | A Lamusi | 23.0 | China | 2012 |
2 | Gunnar Nielsen Aaby | 24.0 | Denmark | 1920 |
3 | Edgar Lindenau Aabye | 34.0 | Denmark/Sweden | 1900 |
4 | Christine Jacoba Aaftink | 21.0 | Netherlands | 1988 |
sqldf('SELECT Name, Age, Team, Year FROM data LIMIT 5')
Name | Age | Team | Year | |
---|---|---|---|---|
0 | A Dijiang | 24.0 | China | 1992 |
1 | A Lamusi | 23.0 | China | 2012 |
2 | Gunnar Nielsen Aaby | 24.0 | Denmark | 1920 |
3 | Edgar Lindenau Aabye | 34.0 | Denmark/Sweden | 1900 |
4 | Christine Jacoba Aaftink | 21.0 | Netherlands | 1988 |
Filtering with constraints
iloc
property of a pandas dataframe allows you to select a row by its index as shown below. Here we are particulary selecting the 6th row by applying the filter on index.
data.iloc[6]
ID 5
Name Christine Jacoba Aaftink
Sex F
Age 25.0
Height 185.0
Weight 82.0
Team Netherlands
NOC NED
Games 1992 Winter
Year 1992
Season Winter
City Albertville
Sport Speed Skating
Event Speed Skating Women's 500 metres
Medal NaN
Name: 6, dtype: object
In SQL most databases provide a rowid index to select a specific row from the table.
sqldf("SELECT * FROM data WHERE rowid=6")
ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 1,000 metres | None |
You can filter on a condition applied on column or multiple columns. In case of multiple columns, the conditions are combined with boolen expressions such as AND, OR etc.
Here is how to count records which satisfy the condition "Age of the athelete is greater than 30".
data[data['Age']>30]['ID'].count()
42107
sqldf('SELECT COUNT(*) FROM data WHERE Age > 30')
COUNT(*) | |
---|---|
0 | 42107 |
Below is an example of multiple conditions. We are selecting all records of Hockey
for team India
.
data[(data['Team'] == 'India') & (data['Sport']=='Hockey')]
ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2513 | 1415 | Shakeel Ahmed | M | 21.0 | NaN | NaN | India | IND | 1992 Summer | 1992 | Summer | Barcelona | Hockey | Hockey Men's Hockey | NaN |
4289 | 2453 | Anil Alexander Aldrin | M | 24.0 | NaN | NaN | India | IND | 1996 Summer | 1996 | Summer | Atlanta | Hockey | Hockey Men's Hockey | NaN |
4732 | 2699 | Shaukat Ali | M | 30.0 | NaN | NaN | India | IND | 1928 Summer | 1928 | Summer | Amsterdam | Hockey | Hockey Men's Hockey | Gold |
4735 | 2702 | Syed Ali | M | 19.0 | 169.0 | 60.0 | India | IND | 1976 Summer | 1976 | Summer | Montreal | Hockey | Hockey Men's Hockey | NaN |
4736 | 2703 | Syed Mushtaq Ali | M | 22.0 | 165.0 | 61.0 | India | IND | 1964 Summer | 1964 | Summer | Tokyo | Hockey | Hockey Men's Hockey | Gold |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
253354 | 126867 | Sunil Sowmarpet Vitalacharya | M | 27.0 | 176.0 | 68.0 | India | IND | 2016 Summer | 2016 | Summer | Rio de Janeiro | Hockey | Hockey Men's Hockey | NaN |
256143 | 128249 | Devindar Sunil Walmiki | M | 24.0 | 178.0 | 69.0 | India | IND | 2016 Summer | 2016 | Summer | Rio de Janeiro | Hockey | Hockey Men's Hockey | NaN |
263723 | 131974 | William Xalco | M | 20.0 | 167.0 | 60.0 | India | IND | 2004 Summer | 2004 | Summer | Athina | Hockey | Hockey Men's Hockey | NaN |
264075 | 132142 | Renuka Yadav | F | 22.0 | 159.0 | 53.0 | India | IND | 2016 Summer | 2016 | Summer | Rio de Janeiro | Hockey | Hockey Women's Hockey | NaN |
266934 | 133554 | Sayed Muhammad Yusuf | M | NaN | NaN | NaN | India | IND | 1928 Summer | 1928 | Summer | Amsterdam | Hockey | Hockey Men's Hockey | Gold |
345 rows à 15 columns
sqldf('SELECT * FROM data WHERE Team="India" AND Sport="Hockey"')
ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1415 | Shakeel Ahmed | M | 21.0 | NaN | NaN | India | IND | 1992 Summer | 1992 | Summer | Barcelona | Hockey | Hockey Men's Hockey | None |
1 | 2453 | Anil Alexander Aldrin | M | 24.0 | NaN | NaN | India | IND | 1996 Summer | 1996 | Summer | Atlanta | Hockey | Hockey Men's Hockey | None |
2 | 2699 | Shaukat Ali | M | 30.0 | NaN | NaN | India | IND | 1928 Summer | 1928 | Summer | Amsterdam | Hockey | Hockey Men's Hockey | Gold |
3 | 2702 | Syed Ali | M | 19.0 | 169.0 | 60.0 | India | IND | 1976 Summer | 1976 | Summer | Montreal | Hockey | Hockey Men's Hockey | None |
4 | 2703 | Syed Mushtaq Ali | M | 22.0 | 165.0 | 61.0 | India | IND | 1964 Summer | 1964 | Summer | Tokyo | Hockey | Hockey Men's Hockey | Gold |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
340 | 126867 | Sunil Sowmarpet Vitalacharya | M | 27.0 | 176.0 | 68.0 | India | IND | 2016 Summer | 2016 | Summer | Rio de Janeiro | Hockey | Hockey Men's Hockey | None |
341 | 128249 | Devindar Sunil Walmiki | M | 24.0 | 178.0 | 69.0 | India | IND | 2016 Summer | 2016 | Summer | Rio de Janeiro | Hockey | Hockey Men's Hockey | None |
342 | 131974 | William Xalco | M | 20.0 | 167.0 | 60.0 | India | IND | 2004 Summer | 2004 | Summer | Athina | Hockey | Hockey Men's Hockey | None |
343 | 132142 | Renuka Yadav | F | 22.0 | 159.0 | 53.0 | India | IND | 2016 Summer | 2016 | Summer | Rio de Janeiro | Hockey | Hockey Women's Hockey | None |
344 | 133554 | Sayed Muhammad Yusuf | M | NaN | NaN | NaN | India | IND | 1928 Summer | 1928 | Summer | Amsterdam | Hockey | Hockey Men's Hockey | Gold |
345 rows à 15 columns
SORTING
Sorting is another common operation required during data analysis. You can sort on one or more columns and in ascending (low to high) or descending (high to low) order.
data.sort_values(by='Age', ascending=False).head()
ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
257054 | 128719 | John Quincy Adams Ward | M | 97.0 | NaN | NaN | United States | USA | 1928 Summer | 1928 | Summer | Amsterdam | Art Competitions | Art Competitions Mixed Sculpturing, Statues | NaN |
98118 | 49663 | Winslow Homer | M | 96.0 | NaN | NaN | United States | USA | 1932 Summer | 1932 | Summer | Los Angeles | Art Competitions | Art Competitions Mixed Painting, Unknown Event | NaN |
60863 | 31173 | Thomas Cowperthwait Eakins | M | 88.0 | NaN | NaN | United States | USA | 1932 Summer | 1932 | Summer | Los Angeles | Art Competitions | Art Competitions Mixed Painting, Unknown Event | NaN |
60861 | 31173 | Thomas Cowperthwait Eakins | M | 88.0 | NaN | NaN | United States | USA | 1932 Summer | 1932 | Summer | Los Angeles | Art Competitions | Art Competitions Mixed Painting, Unknown Event | NaN |
60862 | 31173 | Thomas Cowperthwait Eakins | M | 88.0 | NaN | NaN | United States | USA | 1932 Summer | 1932 | Summer | Los Angeles | Art Competitions | Art Competitions Mixed Painting, Unknown Event | NaN |
sqldf('SELECT * from data ORDER BY Age DESC LIMIT 5')
ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 128719 | John Quincy Adams Ward | M | 97.0 | None | None | United States | USA | 1928 Summer | 1928 | Summer | Amsterdam | Art Competitions | Art Competitions Mixed Sculpturing, Statues | None |
1 | 49663 | Winslow Homer | M | 96.0 | None | None | United States | USA | 1932 Summer | 1932 | Summer | Los Angeles | Art Competitions | Art Competitions Mixed Painting, Unknown Event | None |
2 | 31173 | Thomas Cowperthwait Eakins | M | 88.0 | None | None | United States | USA | 1932 Summer | 1932 | Summer | Los Angeles | Art Competitions | Art Competitions Mixed Painting, Unknown Event | None |
3 | 31173 | Thomas Cowperthwait Eakins | M | 88.0 | None | None | United States | USA | 1932 Summer | 1932 | Summer | Los Angeles | Art Competitions | Art Competitions Mixed Painting, Unknown Event | None |
4 | 31173 | Thomas Cowperthwait Eakins | M | 88.0 | None | None | United States | USA | 1932 Summer | 1932 | Summer | Los Angeles | Art Competitions | Art Competitions Mixed Painting, Unknown Event | None |
Joining
Joining one dataset with another is a common operation while working with data. This is a simple aspect of SQL which stumps more people than it should.
I reckon, it is important to work with a simple example using few rows of data to get this right. That is what I am going to do using synthetic tables employee
and assets
with dummy data.
Please notice the column we are joining on, usually termed as key. If the key has duplicate values, the records multiply.
EmployeeID | DepartmentID | Name | |
---|---|---|---|
0 | 100 | 100 | John Doe |
1 | 101 | 100 | Will green |
2 | 102 | 200 | Wilson Miner |
3 | 104 | 200 | Rochel Dmello |
4 | 105 | 300 | Dickie Bird |
AssetID | EmployeeID | DepartmentID | |
---|---|---|---|
0 | aab | 100 | 100 |
1 | aba | 101 | 100 |
2 | baa | 201 | 200 |
3 | cde | 202 | 200 |
4 | efg | 103 | 300 |
Inner Join
The simplest type of join is the inner
join.
This operation returns the common records from the left table (employee) and the right table (asset).
Pandas uses merge method to join to tables.
employee.merge(asset, on='EmployeeID')
\
EmployeeID | DepartmentID_x | Name | AssetID | DepartmentID_y | |
---|---|---|---|---|---|
0 | 100 | 100 | John Doe | aab | 100 |
1 | 101 | 100 | Will green | aba | 100 |
sqldf('SELECT * FROM employee E JOIN asset A ON E.EmployeeID = A.EmployeeID')
EmployeeID | DepartmentID | Name | AssetID | EmployeeID | DepartmentID | |
---|---|---|---|---|---|---|
0 | 100 | 100 | John Doe | aab | 100 | 100 |
1 | 101 | 100 | Will green | aba | 101 | 100 |
Left Outer Join
In Left Outer Join, you get all records from the left table, and only matching records from the right table.
Note the NaN
and None
values resulting in rows for which key doesn't match.
employee.merge(asset, on='EmployeeID', how='left')
EmployeeID | DepartmentID_x | Name | AssetID | DepartmentID_y | |
---|---|---|---|---|---|
0 | 100 | 100 | John Doe | aab | 100 |
1 | 101 | 100 | Will green | aba | 100 |
2 | 102 | 200 | Wilson Miner | NaN | NaN |
3 | 104 | 200 | Rochel Dmello | NaN | NaN |
4 | 105 | 300 | Dickie Bird | NaN | NaN |
sqldf('SELECT E.*, A.* FROM employee E LEFT JOIN asset A ON E.EmployeeID = A.EmployeeID')
EmployeeID | DepartmentID | Name | AssetID | EmployeeID | DepartmentID | |
---|---|---|---|---|---|---|
0 | 100 | 100 | John Doe | aab | 100.0 | 100 |
1 | 101 | 100 | Will green | aba | 101.0 | 100 |
2 | 102 | 200 | Wilson Miner | None | NaN | None |
3 | 104 | 200 | Rochel Dmello | None | NaN | None |
4 | 105 | 300 | Dickie Bird | None | NaN | None |
Right Outer Join
Right join has the opposite effect of the left join.
employee.merge(asset, on='EmployeeID', how='right')
EmployeeID | DepartmentID_x | Name | AssetID | DepartmentID_y | |
---|---|---|---|---|---|
0 | 100 | 100.0 | John Doe | aab | 100 |
1 | 101 | 100.0 | Will green | aba | 100 |
2 | 201 | NaN | NaN | baa | 200 |
3 | 202 | NaN | NaN | cde | 200 |
4 | 103 | NaN | NaN | efg | 300 |
In fact pandassql doesn't support Right Join but you can achieve the same result by calling asset
as the left table and employee
as the right table.
sqldf('SELECT E.*, A.* FROM asset A LEFT JOIN employee E ON A.EmployeeID = E.EmployeeID')
EmployeeID | DepartmentID | Name | AssetID | EmployeeID | DepartmentID | |
---|---|---|---|---|---|---|
0 | 100.0 | 100.0 | John Doe | aab | 100 | 100 |
1 | 101.0 | 100.0 | Will green | aba | 101 | 100 |
2 | NaN | NaN | None | baa | 201 | 200 |
3 | NaN | NaN | None | cde | 202 | 200 |
4 | NaN | NaN | None | efg | 103 | 300 |
Another type of join pandasql doesn't support is the Full outer join, which returns all matching as well as not matching records from both the tables.
You need not worry about these missing type of joins in pandassql, all database that I have worked with support Right Join and Full Outer Join. With some trickery you can even achieve these in pandassql but it's not required for this conversation.
Queries with aggregates
Aggregates are another SQL technique where you group your records based on one or more columns and perform some aggregate calculation on another set of columns such as MIN, MAX, SUM, COUNT, AVG, and so on.
Here is an example of count where we count Medal
s by team, sort them in descending order and show top % results.
data.groupby('Team', as_index=False)['Medal'].count().sort_values('Medal', ascending=False).head()
Team | Medal | |
---|---|---|
1095 | United States | 5219 |
976 | Soviet Union | 2451 |
398 | Germany | 1984 |
412 | Great Britain | 1673 |
361 | France | 1550 |
sqldf('SELECT Team, COUNT(Medal) FROM data GROUP BY Team ORDER BY COUNT(Medal) DESC LIMIT 5')
Team | COUNT(Medal) | |
---|---|---|
0 | United States | 5219 |
1 | Soviet Union | 2451 |
2 | Germany | 1984 |
3 | Great Britain | 1673 |
4 | France | 1550 |
Here is another example where we filter just one sport swimming
and do the earlier calculation.
data[data['Sport']=='Swimming'].groupby('Team', as_index=False)['Medal'].count().sort_values('Medal', ascending=False).head()
Team | Medal | |
---|---|---|
197 | United States | 1066 |
10 | Australia | 401 |
56 | East Germany | 152 |
70 | Germany | 152 |
72 | Great Britain | 127 |
sqldf('SELECT Team, COUNT(Medal) FROM data Where Sport = "Swimming" GROUP BY Team ORDER BY COUNT(Medal) DESC LIMIT 5')
Team | COUNT(Medal) | |
---|---|---|
0 | United States | 1066 |
1 | Australia | 401 |
2 | Germany | 152 |
3 | East Germany | 152 |
4 | Great Britain | 127 |
There is a lot more you can do with SQL, like filtering on aggregated data usin HAVING clause. Partition data by row, rank or dense rank. Use analytical functions to compute values across rows or over a window.
There is even more that you can do with pandas. Here I am listing some resources if you're ready for a deep dive.
Top comments (0)