DEV Community

Cover image for Mastering Python Dictionaries & Pandas DataFrames
PyProDev
PyProDev

Posted on • Updated on • Originally published at linkedin.com

Mastering Python Dictionaries & Pandas DataFrames

Python dictionary and Pandas dataframe are the most frequent data structures used in dealing with data. The Pandas DataFrame, is a standard popular data structure to work with tabular data for advanced data analysis. In this article, we will get hands-on practice with how to

  • create,
  • manipulate,
    • select,
    • add,
    • update,
    • delete

data in dictionaries and dataframes.

List

First, let's talk about the basic Python data type: list. Imagine that we work for the World Bank and want to keep track of the population of each country.

Let's say we have 2021 population data of each country:

  • India(1,393,409,030),
  • Burma(54,806,010),
  • Thailand(69,950,840),
  • Singapore(5,453,570), and so on.

These data are based on Population Data | The World Bank.

To keep track about which population belongs to which country, we create 2 lists as follow, with the names of the countries in the same order as the populations.

# lists
countries = ['India', 'Burma', 'Thailand', 'Singapore']
populations = [1393409030, 54806010, 69950840, 5453570]
Enter fullscreen mode Exit fullscreen mode

Now suppose that we want to get the population of Burma. First, we have to figure out where in the list Burma is, so that we can use this position to get the correct population. We will use the method index() to get the index.

burma_index = countries.index('Burma')
print(burma_index)
Enter fullscreen mode Exit fullscreen mode
Output:
1
Enter fullscreen mode Exit fullscreen mode

We get 1 as the index of 'Burma' because the index of python's list starts from 0. Now, we can use this index to subset the populations list, to get the population corresponding to Burma.

print(populations[burma_index])
Enter fullscreen mode Exit fullscreen mode
output:
54806010
Enter fullscreen mode Exit fullscreen mode

As expected, we get 54806010, the population of Burma.

Motivation for Dictionaries

So we have two lists, and used the index to connect corresponding elements in both lists. It worked, but it's a pretty terrible approach: it's not convenient and not intuitive. Wouldn't it be easier if we had a way to connect each country directly to its population, without using an index?

Dictionary

This is where the "dictionary" comes into play. Let's convert this population data to a dictionary. To create the dictionary, we need curly brackets. Next, inside the curly brackets, we have a bunch of what are called key:value pairs.

my_dict = {
   "key1":"value1",
   "key2":"value2",
}
Enter fullscreen mode Exit fullscreen mode

In our case,

  • the keys are the country names, and
  • the values are the corresponding populations.

The first key is India, and its corresponding value is 1,393,409,030. Notice the colon that separates the key and value here. Let's do the same thing for the three other key-value pairs, and store the dictionary under the name country_population.

country_population = {'India':1393409030, 'Burma':54806010, 'Thailand':69950840, 'Singapore':5453570}
Enter fullscreen mode Exit fullscreen mode

If we want to find the population for Burma, we simply type world_population, and then the string "Burma" inside square brackets.

print(country_population["Burma"])
Enter fullscreen mode Exit fullscreen mode
output:
54806010
Enter fullscreen mode Exit fullscreen mode

In other words, we pass the key in square brackets, and we get the corresponding value. This approach is not only intuitive, it's also very efficient, because Python can make the lookup of these keys very fast, even for huge dictionaries.

Create a Dictionary

We will create a dictionary of countries and capitals data where the country names are the keys and the capitals are the corresponding values.

  • With the strings in countries and capitals, create a dictionary called asia with 4 key:value pairs. Beware of capitalization! Strings in the code, are case-sensitive.
  • Print out asia to see if the result is what we expected.
# From string in countries and capitals, create dictionary called asia
asia = {'India':'New Delhi', 'Burma':'Yangon', 'Thailand':'Bangkok', 'Singapore':'Singapore'}

# Print 
print(asia)
# Print type of asia
print(type(asia))
Enter fullscreen mode Exit fullscreen mode
output:
{'India': 'New Delhi', 'Burma': 'Yangon', 'Thailand': 'Bangkok', 'Singapore': 'Singapore'}
<class 'dict'>
Enter fullscreen mode Exit fullscreen mode

Great! <class 'dict'> means that the class of asia is a dictionary. class is out of this article's scope and we will explain it in another article which focus on class. Now that we've built our first dictionary.

Manipulating a Dictionary

If the keys of a dictionary are chosen wisely, accessing the values in a dictionary is easy and intuitive. For example, to get the capital for India from asia we can use India as the key.

print(asia['India'])
Enter fullscreen mode Exit fullscreen mode
output:
New Delhi
Enter fullscreen mode Exit fullscreen mode

We can check out which keys are in asia by calling the keys() method on asia.

# Print out the keys in asia
print(asia.keys())

# Print out value that belongs to key 'Burma'
print(asia['Burma'])
Enter fullscreen mode Exit fullscreen mode
output:
dict_keys(['India', 'Burma', 'Thailand', 'Singapore'])
Yangon
Enter fullscreen mode Exit fullscreen mode

Next, we created the dictionary country_population, which basically is a set of key value pairs. we could easily access the population of Burma, by passing the key in square brackets, like this.

country_population = {'India':1393409030, 'Burma':54806010, 'Thailand':69950840, 'Singapore':5453570}
print(country_population['Burma'])
Enter fullscreen mode Exit fullscreen mode
output:
54806010
Enter fullscreen mode Exit fullscreen mode

Note: For this lookup to work properly, the keys in a dictionary should be unique.

If we try to add another key:value pair to country_population with the same key, Burma, for example,

country_population = {'India':1393409030, 'Burma':54806010, 'Thailand':69950840, 'Singapore':5453570, 'Burma':54800000,}
Enter fullscreen mode Exit fullscreen mode

we'll see that the resulting country_population dictionary still contains four pairs. The last pair('Burma':54800000) that we specified in the curly brackets was kept in the resulting dictionary.

country_population = {'India':1393409030, 'Burma':54806010, 'Thailand':69950840, 'Singapore':5453570, 'Burma':54800000,}
print(country_population)
Enter fullscreen mode Exit fullscreen mode
output:
{'India': 1393409030, 'Burma': 54800000, 'Thailand': 69950840, 'Singapore': 5453570}
Enter fullscreen mode Exit fullscreen mode

let's see how we can add more data to a dictionary that already exists.

Add data to a Dictionary

Our country_population dictionary currently does not have china's data. We want to add "China":1412360000 to country_population.

# Before adding China data
country_population = {'India':1393409030, 'Burma':54806010, 'Thailand':69950840, 'Singapore':5453570}
print(country_population)
Enter fullscreen mode Exit fullscreen mode
output:
{'India': 1393409030, 'Burma': 54806010, 'Thailand': 69950840, 'Singapore': 5453570}
Enter fullscreen mode Exit fullscreen mode

To add this information, simply write the key "China" in square brackets and assign population 1412360000 to it with the equals sign.

# After adding China data
country_population["China"] = 1412360000
print(country_population)
Enter fullscreen mode Exit fullscreen mode
output:
{'India': 1393409030, 'Burma': 54806010, 'Thailand': 69950840, 'Singapore': 5453570, 'China': 1412360000}
Enter fullscreen mode Exit fullscreen mode

Now if you check out world_population again, indeed, China is in there. To check this with code, you can also write 'China' in country_population which gives us True if the key China is in there. Note that China is string type and case sensitive.

print('China' in country_population)
Enter fullscreen mode Exit fullscreen mode
output:
True
Enter fullscreen mode Exit fullscreen mode

Update data in a Dictionary

With the syntax dict_name[key]=value, we can also change values, for example, to update the population of China to 1412000000. Because each key in a dictionary is unique, Python knows that we're not trying to create a new pair, but want to update the pair that's already in there.

country_population["China"] = 1412000000
print(country_population)
Enter fullscreen mode Exit fullscreen mode
output:
{'India': 1393409030, 'Burma': 54806010, 'Thailand': 69950840, 'Singapore': 5453570, 'China': 1412000000}
Enter fullscreen mode Exit fullscreen mode

Delete data from a Dictionary

Suppose now that we want to remove it. We can do this with del, again pointing to China inside square brackets. If we print country_population again, China is no longer in our dictionary.

del(country_population['China'])
print(country_population)
Enter fullscreen mode Exit fullscreen mode
output:
{'India': 1393409030, 'Burma': 54806010, 'Thailand': 69950840, 'Singapore': 5453570}
Enter fullscreen mode Exit fullscreen mode

List vs Dictionary

Using lists and dictionaries, is pretty similar. We can select, update and remove values with square brackets.There are some big differences though. The list is a sequence of values that are indexed by a range of numbers. The dictionary, on the other hand, is indexed by unique keys.

List Dictionary
Select, update, remove use [] use []
Indexed by range of numbers unique keys
Use when a collection of values,
order matters,
selecting entire subsets
when lookup table with unique keys

When to use which one? Well, if we have a collection of values where the order matters, and we want to easily select entire subsets of data, we'll want to go with a list.

If, on the other hand, we need some sort of look up table, where looking for data should be fast and where we can specify unique keys, a dictionary is the preferred option.

Nested Dictionaries

Remember lists? They could contain anything, even other lists. Well, for dictionaries the same holds. Dictionaries can contain key:value pairs where the values are again dictionaries.

As an example, have a look at the code where another version of asia - the dictionary we've been working with all along. The keys are still the country names, but the values are dictionaries that contain more information than just the capital.

# Dictionary of dictionaries
asia = {'India': {'capital':'New Delhi', 'population':1393409030},
        'Burma': {'capital':'Yangon', 'population':54806010},
        'Thailand': {'capital':'Bangkok', 'population':69950840},
        'Singapore': {'capital':'Singapore', 'population':5453570},
        }
Enter fullscreen mode Exit fullscreen mode

It's perfectly possible to chain square brackets to select elements. To fetch the population for Burma from asia,

print(asia['Burma']['population'])
Enter fullscreen mode Exit fullscreen mode
output:
54806010
Enter fullscreen mode Exit fullscreen mode
  • Use chained square brackets to select and print out the capital of Burma.
# Print out the capital of Burma
print(asia['Burma']['capital'])
Enter fullscreen mode Exit fullscreen mode
output:
Yangon
Enter fullscreen mode Exit fullscreen mode

Great! It's time to learn about a new data structure!

Tabular dataset examples

As a data scientist, we'll often be working with tons of data. The form of this data can vary greatly, but we can make it down to a tabular structure which is the form of a table like in a spreadsheet. Let's have a look at some examples.

Suppose we're working in a chemical factory and have a ton of temperature measurements to analyze. This data can come in the following form:

temperature measured at location
76 2021-03-01 12:00:01 chamber 1
86 2021-03-01 12:00:01 chamber 2
72 2021-03-01 12:00:01 chamber 1
88 2021-03-01 12:00:01 chamber 2
  • every row is a measurement, or an observation, and
  • columns are different variables.

For each measurement, there is the temperature, but also the date and time of the measurement, and the location.

Another example: we have information on India, Burma, Thailand and so on. We can again build a table with this data.

Country Capital Population
India New Delhi 1393409030
Burma Yangon 54806010
Thailand Bangkok 69950840
Singapore Singapore 5453570
China Beijing 1412360000

Each row is an observation and represents a country. Each observation has the same variables: the country name, the capital and the population.

Datasets in Python

To start working on this data in Python, we'll need some kind of rectangular data structure. How about the 2D NumPy array? Well, it's an option, but not necessarily the best one. There are different data types and NumPy arrays are not great at handling these.

Datasets containing different data types

In the above data, the country and capital are string types while the population is float type. Our datasets will typically comprise different data types, so we need a tool that's better suited. To easily and efficiently handle this data, there's the Pandas package.

Pandas

Pandas is

  • an open source library,
  • built on the NumPy package,
  • easy-to-use data structures,
  • a high level data manipulation tool.

making it very interesting for data scientists all over the world. In pandas, we store the tabular data in an object called a DataFrame. Have a look at the Pandas DataFrame version of the data:

DataFrame

Country Capital Population
IND India New Delhi 1393409030
MMR Myanmar Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
CHN China Beijing 1412360000

The rows represent the observations, and the columns represent the variables. Also notice that each row has a unique row label: IND for India, MMR for Myanmar, and so on. The columns, or variables, also have labels: country, capital, and so on. Notice that the values in the different columns have different types. But how can we create this DataFrame in the first place? Well, there are different ways.

Create a DataFrame from Dictionary

First of all, we can build it manually, starting from a dictionary. Using the distinctive curly brackets, we create key value pairs. The keys are the column labels, and the values are the corresponding columns, in list form.

asia_dict = {
    'country':['India', 'Myanmar', 'Thailand', 'Singapore', 'China'],
    'capital':['New Delhi', 'Yangon', 'Bangkok', 'Singapore', 'Beijing'],
    'population':[1393409030,54806010,69950840, 5453570, 1412360000]
}
Enter fullscreen mode Exit fullscreen mode

After importing the pandas package as pd, we can create a DataFrame from the dictionary using pd.DataFrame.

import pandas as pd
asia_df = pd.DataFrame(asia_dict)
print(type(asia_df))
print(asia_df)
Enter fullscreen mode Exit fullscreen mode
output:
<class 'pandas.core.frame.DataFrame'>
     country    capital  population
0      India  New Delhi  1393409030
1    Myanmar     Yangon    54806010
2   Thailand    Bangkok    69950840
3  Singapore  Singapore     5453570
4      China    Beijing  1412360000
Enter fullscreen mode Exit fullscreen mode

If we check out asia_df now, we see that Pandas assigned some automatic row labels, 0 up to 4. To specify them manually, we can set the index attribute of asia_df to a list with the correct labels.

asia_df.index = ['IND', 'MMR', 'THA', 'SGP', 'CHN']
print(asia_df)
Enter fullscreen mode Exit fullscreen mode
output:
       country    capital  population
IND      India  New Delhi  1393409030
MMR    Myanmar     Yangon    54806010
THA   Thailand    Bangkok    69950840
SGP  Singapore  Singapore     5453570
CHN      China    Beijing  1412360000
Enter fullscreen mode Exit fullscreen mode

The resulting asia_df DataFrame is the same one as we saw before. Using a dictionary approach is fine, but what if we're working with tons of data, which is typically the case as a data scientist? Well, we won't build the DataFrame manually. Instead, we import data from an external file that contains all this data.

Create a DataFrame from CSV file

Suppose the countries' data that we used before comes in the form of a CSV file called countries.csv. CSV is short for comma separated values. The countries.csv file used in this article, can be downloaded at this link.

Let's try to import this data using Pandas read_csv function. We pass the path to the csv file as an argument.

countries = pd.read_csv('path\to\countries.csv')
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
  Unnamed: 0    country    capital  population
0        IND      India  New Delhi  1393409030
1        MMR    Myanmar     Yangon    54806010
2        THA   Thailand    Bangkok    69950840
3        SGP  Singapore  Singapore     5453570
4        CHN      China    Beijing  1412360000
Enter fullscreen mode Exit fullscreen mode

If we print countries, there's still something wrong. The row labels are seen as a column. To solve this, we'll have to tell the read_csv function that the first column contains the row indexes. We do this by setting the index_col argument, like this.

countries = pd.read_csv('path\to\countries.csv', index_col=0)
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
       country    capital  population
IND      India  New Delhi  1393409030
MMR    Myanmar     Yangon    54806010
THA   Thailand    Bangkok    69950840
SGP  Singapore  Singapore     5453570
CHN      China    Beijing  1412360000
Enter fullscreen mode Exit fullscreen mode

This time countries nicely contains the row and column labels. The read_csv function features many more arguments that allow us to customize our data importing. Check out its documentation for more details.

Indexing and selecting data in DataFrames

This is important to make accessing columns, rows and single elements in our DataFrame easy. There are numerous ways in which we can index and select data from DataFrames. We're going to see about how to use

  • square brackets [],
  • advanced data access methods,
    • loc and
    • iloc,

that make Pandas extra powerful.

Access data using square brackets [ ]

Suppose that we only want to select the country column from countries. How to do this with square brackets? Well, we type countries, and then the column label inside square brackets. Python prints out the entire column, together with the row labels.

print(countries['country'])
Enter fullscreen mode Exit fullscreen mode
output:
IND        India
MMR      Myanmar
THA     Thailand
SGP    Singapore
CHN        China
Name: country, dtype: object
Enter fullscreen mode Exit fullscreen mode

But there's something strange here. The last line says Name: country, dtype: object. We're clearly not dealing with a regular DataFrame here. Let's find out about the type of the object that gets returned, with the type function as follows.

print(type(countries['country']))
Enter fullscreen mode Exit fullscreen mode
output:
<class 'pandas.core.series.Series'>
Enter fullscreen mode Exit fullscreen mode

So we're dealing with a Pandas Series here. In a simplified sense, we can think of the Series as a 1-dimensional array that can be labeled, just like the DataFrame. If we put together a bunch of Series, we can create a DataFrame.

If we want to select the country column but keep the data in a DataFrame, we'll need double square brackets, like this.

print(countries[['country']])
Enter fullscreen mode Exit fullscreen mode
output:
       country
IND      India
MMR    Myanmar
THA   Thailand
SGP  Singapore
CHN      China
Enter fullscreen mode Exit fullscreen mode

If we check out the type of this result, we will see it is DataFrame type.

print(type(countries[['country']]))
Enter fullscreen mode Exit fullscreen mode
output:
<class 'pandas.core.frame.DataFrame'>
Enter fullscreen mode Exit fullscreen mode

Note that the single bracket version gives a Pandas Series, the double bracket version gives a Pandas DataFrame.

We can perfectly extend this call to select two columns, country and capital, for example. If we look at it from a different angle, we're actually putting a list with column labels inside another set of square brackets, and end up with a sub DataFrame, containing only the country and capital columns.

print(countries[['country', 'capital']])
Enter fullscreen mode Exit fullscreen mode
output:
       country    capital
IND      India  New Delhi
MMR    Myanmar     Yangon
THA   Thailand    Bangkok
SGP  Singapore  Singapore
CHN      China    Beijing
Enter fullscreen mode Exit fullscreen mode

You can also use the same square brackets to select rows from a DataFrame. The way to do it is by specifying a slice. To get the second and third rows of countries, we use the slice 1 colon 3. Remember that the end of the slice is exclusive and that the index starts at zero.

print(countries[1:3])
Enter fullscreen mode Exit fullscreen mode
output:
      country  capital  population
MMR   Myanmar   Yangon    54806010
THA  Thailand  Bangkok    69950840
Enter fullscreen mode Exit fullscreen mode

These square brackets work, but it only offers limited functionality. Ideally, we'd want something similar to 2D NumPy arrays.

To do a similar thing with Pandas, we have 2 ways.

  • loc is label-based, which means that we have to specify rows and columns based on their row and column labels.
  • iloc is integer index based, which we have to specify rows and columns by their integer index.

Let's start with loc first.

Access data using loc

Let's have another look at the countries DataFrame, and try to get the row for Myanmar. We put the label of the row of interest in square brackets after loc.

print(countries.loc['MMR'])
Enter fullscreen mode Exit fullscreen mode
output:
country        Myanmar
capital         Yangon
population    54806010
Name: MMR, dtype: object
Enter fullscreen mode Exit fullscreen mode

We get a Pandas Series, containing all the row's information, rather inconveniently shown on different lines.

To get a DataFrame, we have to put the 'MMR' string inside another pair of brackets.

print(countries.loc[['MMR']])
Enter fullscreen mode Exit fullscreen mode
output:
     country capital  population
MMR  Myanmar  Yangon    54806010
Enter fullscreen mode Exit fullscreen mode

Selecting Rows using loc

We can also select multiple rows at the same time. Suppose we want to also include India and Thailand. Simply add some more row labels to the list.

print(countries.loc[['MMR', 'IND', 'THA']])
Enter fullscreen mode Exit fullscreen mode
output:
      country    capital  population
MMR   Myanmar     Yangon    54806010
IND     India  New Delhi  1393409030
THA  Thailand    Bangkok    69950840
Enter fullscreen mode Exit fullscreen mode

This was only selecting entire rows, that's something you could also do with the basic square brackets. The difference here is that we can extend your selection with a comma and a specification of the columns of interest.

Selecting Rows & Columns using loc

Let's extend the previous call to only include the country and capital columns. We add a comma, and a list of column labels we want to keep.

print(countries.loc[['MMR', 'IND', 'THA'], ['country', 'capital']])
Enter fullscreen mode Exit fullscreen mode
output:
      country    capital
MMR   Myanmar     Yangon
IND     India  New Delhi
THA  Thailand    Bangkok
Enter fullscreen mode Exit fullscreen mode

The intersection gets returned.

Selecting Columns using loc

we can also use loc to select all rows but only a specific number of columns. Simply replace the first list that specifies the row labels with a colon, a slice going from beginning to end.

print(countries.loc[:, ['country', 'capital']])
Enter fullscreen mode Exit fullscreen mode
output:
       country    capital
IND      India  New Delhi
MMR    Myanmar     Yangon
THA   Thailand    Bangkok
SGP  Singapore  Singapore
CHN      China    Beijing
Enter fullscreen mode Exit fullscreen mode

This time, the result contains all rows, but only two columns.

So, let's take a step back. Simple square brackets countries[['country', 'capital']] work fine if we want to get columns. To get rows, we can use slicing countries[1:4].

  • row access: countries[1:4]
  • column access: countries[['country', 'capital']]

The loc function is more versatile: we can select rows, columns, but also rows and columns at the same time. When you use loc, subsetting becomes remarkable simple.

  • row access: countries.loc[['MMR', 'IND', 'THA']]
  • column access: countries.loc[:, ['country', 'capital']]
  • row and column access: countries.loc[['MMR', 'IND', 'THA'], ['country', 'capital']]

The only difference is that we use labels with loc, not the positions of the elements. If we want to subset Pandas DataFrames based on their position, or index, you'll need the iloc function.

Access data using iloc

In loc, you use the 'MMR' string in double square brackets, to get a DataFrame, like this.

print(countries.loc[['MMR']])
Enter fullscreen mode Exit fullscreen mode
output:
     country capital  population
MMR  Myanmar  Yangon    54806010
Enter fullscreen mode Exit fullscreen mode

In iloc, we use the index 1 instead of MMR. The results are exactly the same.

# return Series type
print(countries.iloc[1])
Enter fullscreen mode Exit fullscreen mode
output:
country        Myanmar
capital         Yangon
population    54806010
Name: MMR, dtype: object
Enter fullscreen mode Exit fullscreen mode
# return DataFrame type
print(countries.iloc[[1]])
Enter fullscreen mode Exit fullscreen mode
output:
     country capital  population
MMR  Myanmar  Yangon    54806010
Enter fullscreen mode Exit fullscreen mode

Selecting Rows using iloc

To get the rows for Myanmar, India and Thailand, the code is like this when using loc,

print(countries.loc[['MMR', 'IND', 'THA']])
Enter fullscreen mode Exit fullscreen mode
output:
      country    capital  population
MMR   Myanmar     Yangon    54806010
IND     India  New Delhi  1393409030
THA  Thailand    Bangkok    69950840
Enter fullscreen mode Exit fullscreen mode

We can now use a list with the index(in the order we want) to get the same result.

print(countries.iloc[[1,0,2]])
Enter fullscreen mode Exit fullscreen mode
output:
      country    capital  population
MMR   Myanmar     Yangon    54806010
IND     India  New Delhi  1393409030
THA  Thailand    Bangkok    69950840
Enter fullscreen mode Exit fullscreen mode

Selecting Rows & Columns using iloc

To only keep the country and capital column, which we did as follows with loc,

print(countries.loc[['IND', 'MMR', 'THA'],['country', 'capital']])
Enter fullscreen mode Exit fullscreen mode
output:
      country    capital
IND     India  New Delhi
MMR   Myanmar     Yangon
THA  Thailand    Bangkok
Enter fullscreen mode Exit fullscreen mode

we put the indexes 0 and 1 in a list after the comma, referring to the country and capital column when using iloc.

print(countries.iloc[[0,1,2,],[0,1]])
Enter fullscreen mode Exit fullscreen mode
output:
      country    capital
IND     India  New Delhi
MMR   Myanmar     Yangon
THA  Thailand    Bangkok
Enter fullscreen mode Exit fullscreen mode

Selecting Columns using iloc

Finally, you can keep all rows and keep only the country and capital column in a similar fashion. With loc, this is how it's done.

print(countries.loc[:,['country', 'capital']])
Enter fullscreen mode Exit fullscreen mode
output:
       country    capital
IND      India  New Delhi
MMR    Myanmar     Yangon
THA   Thailand    Bangkok
SGP  Singapore  Singapore
CHN      China    Beijing
Enter fullscreen mode Exit fullscreen mode

For iloc, it's like this.

print(countries.iloc[:,[0,1]])
Enter fullscreen mode Exit fullscreen mode
output:
       country    capital
IND      India  New Delhi
MMR    Myanmar     Yangon
THA   Thailand    Bangkok
SGP  Singapore  Singapore
CHN      China    Beijing
Enter fullscreen mode Exit fullscreen mode

loc and iloc are pretty similar, the only difference is how we refer to columns and rows. We aced indexing and selecting data from Pandas DataFrames!

Update data in a DataFrame

Updating data in dataframe is similar to selecting data from dataframe. First we select the data we want to update and assign it with new data. In the following we will try to update Country Name Myanmar to Myanmar(Burma). Note that we can do it using loc or iloc.

# Before updateing data
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
       country    capital  population
IND      India  New Delhi  1393409030
MMR    Myanmar     Yangon    54806010
THA   Thailand    Bangkok    69950840
SGP  Singapore  Singapore     5453570
CHN      China    Beijing  1412360000
Enter fullscreen mode Exit fullscreen mode
  • Change Myanmar to Myanmar(Burma)
# Update data using loc
countries.loc[['MMR'], ['country']] = 'Myanmar(Burma)'
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
            country    capital  population
IND           India  New Delhi  1393409030
MMR  Myanmar(Burma)     Yangon    54806010
THA        Thailand    Bangkok    69950840
SGP       Singapore  Singapore     5453570
CHN           China    Beijing  1412360000
Enter fullscreen mode Exit fullscreen mode
  • Change Myanmar(Burma) to Myanmar
# Update data using iloc
countries.iloc[[1], [0]] = 'Myanmar'
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
       country    capital  popualation
IND      India  New Delhi   1393409030
MMR    Myanmar     Yangon     54806010
THA   Thailand    Bangkok     69950840
SGP  Singapore  Singapore      5453570
CHN      China    Beijing   1412360000
Enter fullscreen mode Exit fullscreen mode

Delete data in DataFrame

During cleaning a dataset, we might want to remove some row of data from a dataframe. We can do it by using the drop method on the dataframe. Let's try to remove China row from dataframe.

# Before delete/drop data
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
            country    capital  population
IND           India  New Delhi  1393409030
MMR  Myanmar(Burma)     Yangon    54806010
THA        Thailand    Bangkok    69950840
SGP       Singapore  Singapore     5453570
CHN           China    Beijing  1412360000
Enter fullscreen mode Exit fullscreen mode
# we pass ['CHN'], telling we want to remove row/column related to 'CHN'
# axis=0 means,we want to drop row(s)
# inplace=True means dropping takes place on original data
countries.drop(['CHN'], axis=0, inplace=True)
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
            country    capital  population
IND           India  New Delhi  1393409030
MMR  Myanmar(Burma)     Yangon    54806010
THA        Thailand    Bangkok    69950840
SGP       Singapore  Singapore     5453570
Enter fullscreen mode Exit fullscreen mode

Printing countries shows that the data row we want to remove is no longer in the dataframe countries. Next let's try to remove a column population from dataframe.

# we pass ["population"], telling we want to remove row/column related to "population"
# axis=1 means,we want to drop column(s)
# inplace=True means dropping takes place on original data
countries.drop(["population"], axis=1, inplace=True)
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
            country    capital
IND           India  New Delhi
MMR  Myanmar(Burma)     Yangon
THA        Thailand    Bangkok
SGP       Singapore  Singapore
Enter fullscreen mode Exit fullscreen mode

As we expected, the column population is dropped from the dataframe.

Add data to DataFrame

What if we want to add data to a datafame. We can do it using square brackets[]. Let's try to add the popualation data we dropped in the previous one.

# before adding data
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
            country    capital
IND           India  New Delhi
MMR  Myanmar(Burma)     Yangon
THA        Thailand    Bangkok
SGP       Singapore  Singapore
Enter fullscreen mode Exit fullscreen mode
# Add population column data
# the length of column data need to be same as the number of the rows in dataframe
countries["population"] = [1393409030,54806010,69950840,5453570]
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
            country    capital  population
IND           India  New Delhi  1393409030
MMR  Myanmar(Burma)     Yangon    54806010
THA        Thailand    Bangkok    69950840
SGP       Singapore  Singapore     5453570
Enter fullscreen mode Exit fullscreen mode

Great! Do note that pandas does not know which population data belong to which country and will add the data in the order we give. Now, let's add our China data row back to the dataframe countries. Since our data having index label CHN, we need to add using loc.

countries.loc['CHN'] = ['China', 'Beijing', 1412360000]
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
            country    capital  population
IND           India  New Delhi  1393409030
MMR  Myanmar(Burma)     Yangon    54806010
THA        Thailand    Bangkok    69950840
SGP       Singapore  Singapore     5453570
CHN           China    Beijing  1412360000
Enter fullscreen mode Exit fullscreen mode

Super!! Now we mastered how to create, select, add, update, delete data in Python dictionaries and Pandas dataframes.


See the original article.

Connect & Discuss with us on LinkedIn


Top comments (0)