Python dictionary and Pandas dataframe are the most frequent data structures used in dealing with data. The Pandas DataFrame, is a standard popular data structure to work with tabular data for advanced data analysis. In this article, we will get hands-on practice with how to
- create,
- manipulate,
- select,
- add,
- update,
- delete
data in dictionaries and dataframes.
List
First, let's talk about the basic Python data type: list. Imagine that we work for the World Bank and want to keep track of the population of each country.
Let's say we have 2021 population data of each country:
- India(1,393,409,030),
- Burma(54,806,010),
- Thailand(69,950,840),
- Singapore(5,453,570), and so on.
These data are based on Population Data | The World Bank.
To keep track about which population belongs to which country, we create 2 lists as follow, with the names of the countries in the same order as the populations.
# lists
countries = ['India', 'Burma', 'Thailand', 'Singapore']
populations = [1393409030, 54806010, 69950840, 5453570]
Now suppose that we want to get the population of Burma. First, we have to figure out where in the list Burma is, so that we can use this position to get the correct population. We will use the method index()
to get the index.
burma_index = countries.index('Burma')
print(burma_index)
Output:
1
We get 1
as the index of 'Burma' because the index of python's list starts from 0. Now, we can use this index to subset the populations
list, to get the population corresponding to Burma.
print(populations[burma_index])
output:
54806010
As expected, we get 54806010
, the population of Burma.
Motivation for Dictionaries
So we have two lists, and used the index to connect corresponding elements in both lists. It worked, but it's a pretty terrible approach: it's not convenient and not intuitive. Wouldn't it be easier if we had a way to connect each country directly to its population, without using an index?
Dictionary
This is where the "dictionary" comes into play. Let's convert this population data to a dictionary. To create the dictionary, we need curly brackets. Next, inside the curly brackets, we have a bunch of what are called key:value
pairs.
my_dict = {
"key1":"value1",
"key2":"value2",
}
In our case,
- the keys are the country names, and
- the values are the corresponding populations.
The first key is India, and its corresponding value is 1,393,409,030. Notice the colon that separates the key and value here. Let's do the same thing for the three other key-value pairs, and store the dictionary under the name country_population
.
country_population = {'India':1393409030, 'Burma':54806010, 'Thailand':69950840, 'Singapore':5453570}
If we want to find the population for Burma, we simply type world_population
, and then the string "Burma"
inside square brackets.
print(country_population["Burma"])
output:
54806010
In other words, we pass the key in square brackets, and we get the corresponding value. This approach is not only intuitive, it's also very efficient, because Python can make the lookup of these keys very fast, even for huge dictionaries.
Create a Dictionary
We will create a dictionary of countries
and capitals
data where the country names are the keys and the capitals are the corresponding values.
- With the strings in
countries
andcapitals
, create a dictionary calledasia
with 4 key:value pairs. Beware of capitalization! Strings in the code, are case-sensitive. - Print out
asia
to see if the result is what we expected.
# From string in countries and capitals, create dictionary called asia
asia = {'India':'New Delhi', 'Burma':'Yangon', 'Thailand':'Bangkok', 'Singapore':'Singapore'}
# Print
print(asia)
# Print type of asia
print(type(asia))
output:
{'India': 'New Delhi', 'Burma': 'Yangon', 'Thailand': 'Bangkok', 'Singapore': 'Singapore'}
<class 'dict'>
Great! <class 'dict'>
means that the class
of asia
is a dictionary. class
is out of this article's scope and we will explain it in another article which focus on class
. Now that we've built our first dictionary.
Manipulating a Dictionary
If the keys of a dictionary are chosen wisely, accessing the values in a dictionary is easy and intuitive. For example, to get the capital for India from asia
we can use India
as the key.
print(asia['India'])
output:
New Delhi
We can check out which keys are in asia
by calling the keys() method on asia
.
# Print out the keys in asia
print(asia.keys())
# Print out value that belongs to key 'Burma'
print(asia['Burma'])
output:
dict_keys(['India', 'Burma', 'Thailand', 'Singapore'])
Yangon
Next, we created the dictionary country_population
, which basically is a set of key value pairs. we could easily access the population of Burma
, by passing the key in square brackets, like this.
country_population = {'India':1393409030, 'Burma':54806010, 'Thailand':69950840, 'Singapore':5453570}
print(country_population['Burma'])
output:
54806010
Note: For this lookup to work properly, the keys in a dictionary should be unique.
If we try to add another key:value
pair to country_population
with the same key, Burma
, for example,
country_population = {'India':1393409030, 'Burma':54806010, 'Thailand':69950840, 'Singapore':5453570, 'Burma':54800000,}
we'll see that the resulting country_population
dictionary still contains four pairs. The last pair('Burma':54800000
) that we specified in the curly brackets was kept in the resulting dictionary.
country_population = {'India':1393409030, 'Burma':54806010, 'Thailand':69950840, 'Singapore':5453570, 'Burma':54800000,}
print(country_population)
output:
{'India': 1393409030, 'Burma': 54800000, 'Thailand': 69950840, 'Singapore': 5453570}
let's see how we can add more data to a dictionary that already exists.
Add data to a Dictionary
Our country_population
dictionary currently does not have china's data. We want to add "China":1412360000
to country_population
.
# Before adding China data
country_population = {'India':1393409030, 'Burma':54806010, 'Thailand':69950840, 'Singapore':5453570}
print(country_population)
output:
{'India': 1393409030, 'Burma': 54806010, 'Thailand': 69950840, 'Singapore': 5453570}
To add this information, simply write the key "China"
in square brackets and assign population 1412360000
to it with the equals sign.
# After adding China data
country_population["China"] = 1412360000
print(country_population)
output:
{'India': 1393409030, 'Burma': 54806010, 'Thailand': 69950840, 'Singapore': 5453570, 'China': 1412360000}
Now if you check out world_population
again, indeed, China
is in there. To check this with code, you can also write 'China' in country_population
which gives us True
if the key China
is in there. Note that China
is string
type and case sensitive.
print('China' in country_population)
output:
True
Update data in a Dictionary
With the syntax dict_name[key]=value
, we can also change values, for example, to update the population of China
to 1412000000
. Because each key in a dictionary is unique, Python knows that we're not trying to create a new pair, but want to update the pair that's already in there.
country_population["China"] = 1412000000
print(country_population)
output:
{'India': 1393409030, 'Burma': 54806010, 'Thailand': 69950840, 'Singapore': 5453570, 'China': 1412000000}
Delete data from a Dictionary
Suppose now that we want to remove it. We can do this with del
, again pointing to China
inside square brackets. If we print country_population
again, China
is no longer in our dictionary.
del(country_population['China'])
print(country_population)
output:
{'India': 1393409030, 'Burma': 54806010, 'Thailand': 69950840, 'Singapore': 5453570}
List vs Dictionary
Using lists and dictionaries, is pretty similar. We can select, update and remove values with square brackets.There are some big differences though. The list is a sequence of values that are indexed by a range of numbers. The dictionary, on the other hand, is indexed by unique keys.
List | Dictionary | |
---|---|---|
Select, update, remove | use []
|
use []
|
Indexed by | range of numbers | unique keys |
Use | when a collection of values, order matters, selecting entire subsets |
when lookup table with unique keys |
When to use which one? Well, if we have a collection of values where the order matters, and we want to easily select entire subsets of data, we'll want to go with a list.
If, on the other hand, we need some sort of look up table, where looking for data should be fast and where we can specify unique keys, a dictionary is the preferred option.
Nested Dictionaries
Remember lists? They could contain anything, even other lists. Well, for dictionaries the same holds. Dictionaries can contain key:value
pairs where the values are again dictionaries.
As an example, have a look at the code where another version of asia
- the dictionary we've been working with all along. The keys are still the country names, but the values are dictionaries that contain more information than just the capital.
# Dictionary of dictionaries
asia = {'India': {'capital':'New Delhi', 'population':1393409030},
'Burma': {'capital':'Yangon', 'population':54806010},
'Thailand': {'capital':'Bangkok', 'population':69950840},
'Singapore': {'capital':'Singapore', 'population':5453570},
}
It's perfectly possible to chain square brackets to select elements. To fetch the population
for Burma
from asia
,
print(asia['Burma']['population'])
output:
54806010
- Use chained square brackets to select and print out the capital of
Burma
.
# Print out the capital of Burma
print(asia['Burma']['capital'])
output:
Yangon
Great! It's time to learn about a new data structure!
Tabular dataset examples
As a data scientist, we'll often be working with tons of data. The form of this data can vary greatly, but we can make it down to a tabular structure which is the form of a table like in a spreadsheet. Let's have a look at some examples.
Suppose we're working in a chemical factory and have a ton of temperature measurements to analyze. This data can come in the following form:
temperature | measured at | location |
---|---|---|
76 | 2021-03-01 12:00:01 | chamber 1 |
86 | 2021-03-01 12:00:01 | chamber 2 |
72 | 2021-03-01 12:00:01 | chamber 1 |
88 | 2021-03-01 12:00:01 | chamber 2 |
- every row is a measurement, or an observation, and
- columns are different variables.
For each measurement, there is the temperature, but also the date and time of the measurement, and the location.
Another example: we have information on India, Burma, Thailand and so on. We can again build a table with this data.
Country | Capital | Population |
---|---|---|
India | New Delhi | 1393409030 |
Burma | Yangon | 54806010 |
Thailand | Bangkok | 69950840 |
Singapore | Singapore | 5453570 |
China | Beijing | 1412360000 |
Each row is an observation and represents a country. Each observation has the same variables: the country name, the capital and the population.
Datasets in Python
To start working on this data in Python, we'll need some kind of rectangular data structure. How about the 2D NumPy array? Well, it's an option, but not necessarily the best one. There are different data types and NumPy arrays are not great at handling these.
Datasets containing different data types
In the above data, the country and capital are string
types while the population is float
type. Our datasets will typically comprise different data types, so we need a tool that's better suited. To easily and efficiently handle this data, there's the Pandas package.
Pandas
Pandas is
- an open source library,
- built on the NumPy package,
- easy-to-use data structures,
- a high level data manipulation tool.
making it very interesting for data scientists all over the world. In pandas, we store the tabular data in an object called a DataFrame
. Have a look at the Pandas DataFrame version of the data:
DataFrame
Country | Capital | Population | |
---|---|---|---|
IND | India | New Delhi | 1393409030 |
MMR | Myanmar | Yangon | 54806010 |
THA | Thailand | Bangkok | 69950840 |
SGP | Singapore | Singapore | 5453570 |
CHN | China | Beijing | 1412360000 |
The rows represent the observations, and the columns represent the variables. Also notice that each row has a unique row label: IND
for India, MMR
for Myanmar, and so on. The columns, or variables, also have labels: country, capital, and so on. Notice that the values in the different columns have different types. But how can we create this DataFrame in the first place? Well, there are different ways.
Create a DataFrame from Dictionary
First of all, we can build it manually, starting from a dictionary. Using the distinctive curly brackets, we create key value pairs. The keys are the column labels, and the values are the corresponding columns, in list form.
asia_dict = {
'country':['India', 'Myanmar', 'Thailand', 'Singapore', 'China'],
'capital':['New Delhi', 'Yangon', 'Bangkok', 'Singapore', 'Beijing'],
'population':[1393409030,54806010,69950840, 5453570, 1412360000]
}
After importing the pandas package as pd
, we can create a DataFrame from the dictionary using pd.DataFrame
.
import pandas as pd
asia_df = pd.DataFrame(asia_dict)
print(type(asia_df))
print(asia_df)
output:
<class 'pandas.core.frame.DataFrame'>
country capital population
0 India New Delhi 1393409030
1 Myanmar Yangon 54806010
2 Thailand Bangkok 69950840
3 Singapore Singapore 5453570
4 China Beijing 1412360000
If we check out asia_df
now, we see that Pandas
assigned some automatic row labels, 0 up to 4. To specify them manually, we can set the index
attribute of asia_df
to a list with the correct labels.
asia_df.index = ['IND', 'MMR', 'THA', 'SGP', 'CHN']
print(asia_df)
output:
country capital population
IND India New Delhi 1393409030
MMR Myanmar Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
CHN China Beijing 1412360000
The resulting asia_df
DataFrame is the same one as we saw before. Using a dictionary approach is fine, but what if we're working with tons of data, which is typically the case as a data scientist? Well, we won't build the DataFrame manually. Instead, we import data from an external file that contains all this data.
Create a DataFrame from CSV file
Suppose the countries' data that we used before comes in the form of a CSV file called countries.csv
. CSV is short for comma separated values. The countries.csv
file used in this article, can be downloaded at this link.
Let's try to import this data using Pandas read_csv
function. We pass the path to the csv file as an argument.
countries = pd.read_csv('path\to\countries.csv')
print(countries)
output:
Unnamed: 0 country capital population
0 IND India New Delhi 1393409030
1 MMR Myanmar Yangon 54806010
2 THA Thailand Bangkok 69950840
3 SGP Singapore Singapore 5453570
4 CHN China Beijing 1412360000
If we print countries
, there's still something wrong. The row labels are seen as a column. To solve this, we'll have to tell the read_csv
function that the first column contains the row indexes. We do this by setting the index_col
argument, like this.
countries = pd.read_csv('path\to\countries.csv', index_col=0)
print(countries)
output:
country capital population
IND India New Delhi 1393409030
MMR Myanmar Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
CHN China Beijing 1412360000
This time countries
nicely contains the row and column labels. The read_csv
function features many more arguments that allow us to customize our data importing. Check out its documentation for more details.
Indexing and selecting data in DataFrames
This is important to make accessing columns, rows and single elements in our DataFrame easy. There are numerous ways in which we can index and select data from DataFrames. We're going to see about how to use
- square brackets
[]
, - advanced data access methods,
-
loc
and -
iloc
,
-
that make Pandas extra powerful.
Access data using square brackets [ ]
Suppose that we only want to select the country column from countries
. How to do this with square brackets? Well, we type countries
, and then the column label inside square brackets. Python prints out the entire column, together with the row labels.
print(countries['country'])
output:
IND India
MMR Myanmar
THA Thailand
SGP Singapore
CHN China
Name: country, dtype: object
But there's something strange here. The last line says Name: country, dtype: object
. We're clearly not dealing with a regular DataFrame here. Let's find out about the type of the object that gets returned, with the type
function as follows.
print(type(countries['country']))
output:
<class 'pandas.core.series.Series'>
So we're dealing with a Pandas Series here. In a simplified sense, we can think of the Series as a 1-dimensional array that can be labeled, just like the DataFrame. If we put together a bunch of Series, we can create a DataFrame.
If we want to select the country column but keep the data in a DataFrame, we'll need double square brackets, like this.
print(countries[['country']])
output:
country
IND India
MMR Myanmar
THA Thailand
SGP Singapore
CHN China
If we check out the type of this result, we will see it is DataFrame type.
print(type(countries[['country']]))
output:
<class 'pandas.core.frame.DataFrame'>
Note that the single bracket version gives a Pandas Series, the double bracket version gives a Pandas DataFrame.
We can perfectly extend this call to select two columns, country and capital, for example. If we look at it from a different angle, we're actually putting a list with column labels inside another set of square brackets, and end up with a sub DataFrame
, containing only the country and capital columns.
print(countries[['country', 'capital']])
output:
country capital
IND India New Delhi
MMR Myanmar Yangon
THA Thailand Bangkok
SGP Singapore Singapore
CHN China Beijing
You can also use the same square brackets to select rows from a DataFrame. The way to do it is by specifying a slice. To get the second and third rows of countries
, we use the slice 1 colon 3. Remember that the end of the slice is exclusive and that the index starts at zero.
print(countries[1:3])
output:
country capital population
MMR Myanmar Yangon 54806010
THA Thailand Bangkok 69950840
These square brackets work, but it only offers limited functionality. Ideally, we'd want something similar to 2D NumPy arrays.
To do a similar thing with Pandas, we have 2 ways.
-
loc
is label-based, which means that we have to specify rows and columns based on their row and column labels. -
iloc
is integer index based, which we have to specify rows and columns by their integer index.
Let's start with loc
first.
Access data using loc
Let's have another look at the countries
DataFrame, and try to get the row for Myanmar. We put the label of the row of interest in square brackets after loc
.
print(countries.loc['MMR'])
output:
country Myanmar
capital Yangon
population 54806010
Name: MMR, dtype: object
We get a Pandas Series, containing all the row's information, rather inconveniently shown on different lines.
To get a DataFrame, we have to put the 'MMR'
string inside another pair of brackets.
print(countries.loc[['MMR']])
output:
country capital population
MMR Myanmar Yangon 54806010
Selecting Rows using loc
We can also select multiple rows at the same time. Suppose we want to also include India and Thailand. Simply add some more row labels to the list.
print(countries.loc[['MMR', 'IND', 'THA']])
output:
country capital population
MMR Myanmar Yangon 54806010
IND India New Delhi 1393409030
THA Thailand Bangkok 69950840
This was only selecting entire rows, that's something you could also do with the basic square brackets. The difference here is that we can extend your selection with a comma and a specification of the columns of interest.
Selecting Rows & Columns using loc
Let's extend the previous call to only include the country and capital columns. We add a comma, and a list of column labels we want to keep.
print(countries.loc[['MMR', 'IND', 'THA'], ['country', 'capital']])
output:
country capital
MMR Myanmar Yangon
IND India New Delhi
THA Thailand Bangkok
The intersection gets returned.
Selecting Columns using loc
we can also use loc
to select all rows but only a specific number of columns. Simply replace the first list that specifies the row labels with a colon, a slice going from beginning to end.
print(countries.loc[:, ['country', 'capital']])
output:
country capital
IND India New Delhi
MMR Myanmar Yangon
THA Thailand Bangkok
SGP Singapore Singapore
CHN China Beijing
This time, the result contains all rows, but only two columns.
So, let's take a step back. Simple square brackets countries[['country', 'capital']]
work fine if we want to get columns. To get rows, we can use slicing countries[1:4]
.
- row access:
countries[1:4]
- column access:
countries[['country', 'capital']]
The loc
function is more versatile: we can select rows, columns, but also rows and columns at the same time. When you use loc
, subsetting becomes remarkable simple.
- row access:
countries.loc[['MMR', 'IND', 'THA']]
- column access:
countries.loc[:, ['country', 'capital']]
- row and column access:
countries.loc[['MMR', 'IND', 'THA'], ['country', 'capital']]
The only difference is that we use labels with loc
, not the positions of the elements. If we want to subset Pandas DataFrames based on their position, or index, you'll need the iloc
function.
Access data using iloc
In loc
, you use the 'MMR'
string in double square brackets, to get a DataFrame, like this.
print(countries.loc[['MMR']])
output:
country capital population
MMR Myanmar Yangon 54806010
In iloc
, we use the index 1 instead of MMR
. The results are exactly the same.
# return Series type
print(countries.iloc[1])
output:
country Myanmar
capital Yangon
population 54806010
Name: MMR, dtype: object
# return DataFrame type
print(countries.iloc[[1]])
output:
country capital population
MMR Myanmar Yangon 54806010
Selecting Rows using iloc
To get the rows for Myanmar, India and Thailand, the code is like this when using loc
,
print(countries.loc[['MMR', 'IND', 'THA']])
output:
country capital population
MMR Myanmar Yangon 54806010
IND India New Delhi 1393409030
THA Thailand Bangkok 69950840
We can now use a list with the index(in the order we want) to get the same result.
print(countries.iloc[[1,0,2]])
output:
country capital population
MMR Myanmar Yangon 54806010
IND India New Delhi 1393409030
THA Thailand Bangkok 69950840
Selecting Rows & Columns using iloc
To only keep the country and capital column, which we did as follows with loc
,
print(countries.loc[['IND', 'MMR', 'THA'],['country', 'capital']])
output:
country capital
IND India New Delhi
MMR Myanmar Yangon
THA Thailand Bangkok
we put the indexes 0 and 1 in a list after the comma, referring to the country and capital column when using iloc
.
print(countries.iloc[[0,1,2,],[0,1]])
output:
country capital
IND India New Delhi
MMR Myanmar Yangon
THA Thailand Bangkok
Selecting Columns using iloc
Finally, you can keep all rows and keep only the country and capital column in a similar fashion. With loc
, this is how it's done.
print(countries.loc[:,['country', 'capital']])
output:
country capital
IND India New Delhi
MMR Myanmar Yangon
THA Thailand Bangkok
SGP Singapore Singapore
CHN China Beijing
For iloc
, it's like this.
print(countries.iloc[:,[0,1]])
output:
country capital
IND India New Delhi
MMR Myanmar Yangon
THA Thailand Bangkok
SGP Singapore Singapore
CHN China Beijing
loc
and iloc
are pretty similar, the only difference is how we refer to columns and rows. We aced indexing and selecting data from Pandas DataFrames!
Update data in a DataFrame
Updating data in dataframe is similar to selecting data from dataframe. First we select the data we want to update and assign it with new data. In the following we will try to update Country Name Myanmar
to Myanmar(Burma)
. Note that we can do it using loc
or iloc
.
# Before updateing data
print(countries)
output:
country capital population
IND India New Delhi 1393409030
MMR Myanmar Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
CHN China Beijing 1412360000
- Change Myanmar to Myanmar(Burma)
# Update data using loc
countries.loc[['MMR'], ['country']] = 'Myanmar(Burma)'
print(countries)
output:
country capital population
IND India New Delhi 1393409030
MMR Myanmar(Burma) Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
CHN China Beijing 1412360000
- Change Myanmar(Burma) to Myanmar
# Update data using iloc
countries.iloc[[1], [0]] = 'Myanmar'
print(countries)
output:
country capital popualation
IND India New Delhi 1393409030
MMR Myanmar Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
CHN China Beijing 1412360000
Delete data in DataFrame
During cleaning a dataset, we might want to remove some row of data from a dataframe. We can do it by using the drop
method on the dataframe. Let's try to remove China row from dataframe.
# Before delete/drop data
print(countries)
output:
country capital population
IND India New Delhi 1393409030
MMR Myanmar(Burma) Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
CHN China Beijing 1412360000
# we pass ['CHN'], telling we want to remove row/column related to 'CHN'
# axis=0 means,we want to drop row(s)
# inplace=True means dropping takes place on original data
countries.drop(['CHN'], axis=0, inplace=True)
print(countries)
output:
country capital population
IND India New Delhi 1393409030
MMR Myanmar(Burma) Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
Printing countries
shows that the data row we want to remove is no longer in the dataframe countries
. Next let's try to remove a column population
from dataframe.
# we pass ["population"], telling we want to remove row/column related to "population"
# axis=1 means,we want to drop column(s)
# inplace=True means dropping takes place on original data
countries.drop(["population"], axis=1, inplace=True)
print(countries)
output:
country capital
IND India New Delhi
MMR Myanmar(Burma) Yangon
THA Thailand Bangkok
SGP Singapore Singapore
As we expected, the column population
is dropped from the dataframe.
Add data to DataFrame
What if we want to add data to a datafame. We can do it using square brackets[]
. Let's try to add the popualation data we dropped in the previous one.
# before adding data
print(countries)
output:
country capital
IND India New Delhi
MMR Myanmar(Burma) Yangon
THA Thailand Bangkok
SGP Singapore Singapore
# Add population column data
# the length of column data need to be same as the number of the rows in dataframe
countries["population"] = [1393409030,54806010,69950840,5453570]
print(countries)
output:
country capital population
IND India New Delhi 1393409030
MMR Myanmar(Burma) Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
Great! Do note that pandas does not know which population data belong to which country and will add the data in the order we give. Now, let's add our China data row back to the dataframe countries
. Since our data having index label CHN
, we need to add using loc
.
countries.loc['CHN'] = ['China', 'Beijing', 1412360000]
print(countries)
output:
country capital population
IND India New Delhi 1393409030
MMR Myanmar(Burma) Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
CHN China Beijing 1412360000
Super!! Now we mastered how to create, select, add, update, delete data in Python dictionaries and Pandas dataframes.
Connect & Discuss with us on LinkedIn
Top comments (0)