DEV Community

Monica Powell
Monica Powell

Posted on • Updated on

How to Use the TMDB API to Find Films with the Highest Revenue

png

Get Out has been one of the most talked about films in 2017 and as of April 2017 the highest grossing debut film based on an original screenplay in history. We want to programmatically find out how Get Out ranked amongst other 2017 American films and which films have earned the most revenue in 2017. This tutorial assumes most readers have basic working knowledge of Python.

Prequisites

  • Install the following python packages and run them ideally in a virtualenv.

    • config
    • requests
    • locale
    • pandas
    • matplotlib
  • In addition to installing the above dependencies we will need to request an API key from The Movie DB (TMDB). TMDB has a free API to programmatically access information about movies.

    • In order to request an API key from TMDB:
      1. Create a free account
      2. Check your e-mail to verify your account.
      3. Visit the API Settings page in your Account Settings and request an API key
      4. You should now have an API key and be ready to go!
import config # to hide TMDB API keys
import requests # to make TMDB API calls
import locale # to format currency as USD
locale.setlocale( locale.LC_ALL, '' )

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter # to format currency on charts axis

api_key = config.tmdb_api_key # get TMDB API key from config.py file
Enter fullscreen mode Exit fullscreen mode

If you plan on committing your project to GitHub or another public repository and need help setting up config you should read this article about using config to hide API keys.

Part 1: Determine the highest earning American films of 2017

In this section we will request 2017 data from TMDB, store the data we recieve as a json into a dataframe and then use matplotlib to visualize our data.

Make API Call to TMDB to return the data of interest

In order to get the highest earning films from TMDB an API request needs to be constructed to return films with a primary_release_year of 2017 sorted in descending order by revenue.

response = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2017&sort_by=revenue.desc')
Enter fullscreen mode Exit fullscreen mode
highest_revenue = response.json() # store parsed json response

# uncomment the next line to get a peek at the highest_revenue json structure
# highest_revenue

highest_revenue_films = highest_revenue['results']
Enter fullscreen mode Exit fullscreen mode

Create dataframe from JSON returned from TMDB API call

Let's store the JSON data returned from our API call in a dataframe to store each film and its associated revenue.

# define column names for our new dataframe
columns = ['film', 'revenue']

# create dataframe with film and revenue columns
df = pandas.DataFrame(columns=columns)
Enter fullscreen mode Exit fullscreen mode

Now to add the data to our dataframe we will need to loop through the data.

# for each of the highest revenue films make an api call for that specific movie to return the budget and revenue
for film in highest_revenue_films:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df.loc[len(df)]=[film['title'],film_revenue['revenue']] # store title and revenue in our dataframe    
Enter fullscreen mode Exit fullscreen mode

Below is what the dataframe head (top 5 lines) looks like after iterating through the films our API call returned.

df.head()
Enter fullscreen mode Exit fullscreen mode
.dataframe thead tr:only-child th {
    text-align: right;
}

.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
Enter fullscreen mode Exit fullscreen mode
film revenue
0 Beauty and the Beast 1221782049
1 The Fate of the Furious 1212583865
2 Guardians of the Galaxy Vol. 2 744784722
3 Logan 608674100
4 Kong: Skull Island 565151307

Let's actually see the data with matplotlib

We will create a horizontal bar chart using matplotlib to display the revenue earned for each film.

matplotlib.style.use('ggplot')
fig, ax = plt.subplots()
df.plot(kind="barh", y='revenue', color = ['#624ea7', '#599ad3', '#f9a65a', '#9e66ab', 'purple'], x=df['film'], ax=ax)

#format xaxis in terms of currency
formatter = FuncFormatter(currency)
ax.xaxis.set_major_formatter(formatter)
ax.legend().set_visible(False)

avg = df['revenue'].mean()

# Add a line for the average
ax.axvline(x=avg, color='b', label='Average', linestyle='--', linewidth=1)

ax.set(title='American Films with Highest Revenue (2017)', xlabel='Revenue', ylabel='Film')
Enter fullscreen mode Exit fullscreen mode
[<matplotlib.text.Text at 0x111f8aba8>,
 <matplotlib.text.Text at 0x111f20978>,
 <matplotlib.text.Text at 0x111fad2e8>]
Enter fullscreen mode Exit fullscreen mode

png

Part 2: Determine the highest earning American films of all-time

In this section we will request all-time data from TMDB, store the data we recieve as a json into a dataframe and then use matplotlib to visualize our data. Our API call will be similar to the one we used in the previous section but sans &primary_release_year=2017.

Requesting, formatting and storing API data

response = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&sort_by=revenue.desc')
highest_revenue_ever = response.json()
highest_revenue_films_ever = highest_revenue_ever['results']

columns = ['film', 'revenue', 'budget', 'release_date']
highest_revenue_ever_df = pandas.DataFrame(columns=columns)

for film in highest_revenue_films_ever:
    # print(film['title'])

    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    # print(film_revenue)

    # print(locale.currency(film_revenue['revenue'], grouping=True ))

    # Lord of the Rings duplicate w/ bad data was being returned  https://www.themoviedb.org/movie/454499-the-lord-of-the-rings
    # It's budget was $281 which is way too low for a top-earning film. Therefore in order to be added to dataframe the film
    # budget must be greater than $281.

    if film_revenue['budget'] > 281:
        # print(film_revenue['budget'])
        # add film title, revenue, budget and release date to the dataframe
        highest_revenue_ever_df.loc[len(highest_revenue_ever_df)]=[film['title'],film_revenue['revenue'], (film_revenue['budget'] * -1), film_revenue['release_date']]

highest_revenue_ever_df.head()    

Enter fullscreen mode Exit fullscreen mode
.dataframe thead tr:only-child th {
    text-align: right;
}

.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
Enter fullscreen mode Exit fullscreen mode
film revenue budget release_date
0 Avatar 2781505847 -237000000 2009-12-10
1 Star Wars: The Force Awakens 2068223624 -245000000 2015-12-15
2 Titanic 1845034188 -200000000 1997-11-18
3 The Avengers 1519557910 -220000000 2012-04-25
4 Jurassic World 1513528810 -150000000 2015-06-09

Calculate the gross profit

We can calculate the gross profit by subtracting total revenue from amount spent. Earlier we made the budget values negative therefore we need to add the revenue to the (negative) budget to get the gross profit which is effectively subtraction.

highest_revenue_ever_df['gross'] = highest_revenue_ever_df['revenue'] + highest_revenue_ever_df['budget']
Enter fullscreen mode Exit fullscreen mode

What does the dataframe look like now?

highest_revenue_ever_df.head()
Enter fullscreen mode Exit fullscreen mode
.dataframe thead tr:only-child th {
    text-align: right;
}

.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
Enter fullscreen mode Exit fullscreen mode
film revenue budget release_date gross
0 Avatar 2781505847 -237000000 2009-12-10 2544505847
1 Star Wars: The Force Awakens 2068223624 -245000000 2015-12-15 1823223624
2 Titanic 1845034188 -200000000 1997-11-18 1645034188
3 The Avengers 1519557910 -220000000 2012-04-25 1299557910
4 Jurassic World 1513528810 -150000000 2015-06-09 1363528810

Plotting data in matplotlib with horizontal bar charts and a scatter plot

fig, ax = plt.subplots()
highest_revenue_ever_df.plot(kind="barh", y='revenue', color = ['#624ea7', '#599ad3', '#f9a65a', '#9e66ab', 'purple'], x=highest_revenue_ever_df['film'], ax=ax)
formatter = FuncFormatter(currency)
ax.xaxis.set_major_formatter(formatter)
ax.legend().set_visible(False)
ax.set(title='American Films with Highest Revenue (All Time)', xlabel='Revenue', ylabel='Film')
Enter fullscreen mode Exit fullscreen mode
[<matplotlib.text.Text at 0x111c90e48>,
 <matplotlib.text.Text at 0x111f85588>,
 <matplotlib.text.Text at 0x1120f0e48>]
Enter fullscreen mode Exit fullscreen mode

png

fig, ax = plt.subplots()
highest_revenue_ever_df.plot(kind="barh", y='gross', color = ['#624ea7', '#599ad3', '#f9a65a', '#9e66ab', 'purple'], x=highest_revenue_ever_df['film'], ax=ax)
formatter = FuncFormatter(currency)
ax.xaxis.set_major_formatter(formatter)
ax.legend().set_visible(False)
ax.set(title='Gross Profit of the American Films with Highest Revenue (All Time)', xlabel='Gross Profit', ylabel='Film')
Enter fullscreen mode Exit fullscreen mode
[<matplotlib.text.Text at 0x112285cf8>,
 <matplotlib.text.Text at 0x1120bf198>,
 <matplotlib.text.Text at 0x11234de10>]
Enter fullscreen mode Exit fullscreen mode

png

fig, ax = plt.subplots()
highest_revenue_ever_df.plot(kind='scatter', y='gross', x='budget', ax=ax)
formatter = FuncFormatter(currency)
ax.xaxis.set_major_formatter(formatter)
ax.yaxis.set_major_formatter(formatter)
ax.set(title='Profit vs Budget of the American Films with Highest Revenue (All Time)', xlabel='Budget', ylabel='Gross Profit')

Enter fullscreen mode Exit fullscreen mode
[<matplotlib.text.Text at 0x112b67f98>,
 <matplotlib.text.Text at 0x112b29518>,
 <matplotlib.text.Text at 0x112b8e550>]
Enter fullscreen mode Exit fullscreen mode

png

# Adding release year to dataframe
# highest_revenue_ever_df['year'] = pd.DatetimeIndex(highest_revenue_ever_df['release_date']).year
# print(highest_revenue_ever_df)
Enter fullscreen mode Exit fullscreen mode

Limitations

The above data and graphs do not account for inflation (the TMDB API returns by revenue unadjusted by inflation) therefore the earnings from more recent films are more weighted than their earlier counterparts. When looking at all time data inflation should be adjusted for however when looking over a shorter time period adjusting for inflation might not be necessary. Older films would appear above if inflation was taken into account, as it is now, the oldest film on this list was The Titanic in 1997.

Lead photo is Chris Washington, played by Daniel Kaluuya, from Get Out. Universal Pictures

A version of this article was originally published by Monica Powell on www.datalogues.com

Top comments (3)

Collapse
 
mlinardy profile image
mlinardy

Hi @m0nica ,

Thanks for writing this tutorial. I am learning Python and your tutorial helps a lot!

Just a question on this bit of code please:

film_revenue = requests.get('api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')

how did you know the format of the API link for revenue (the link inside requests.get)? Is there a specific documentation I can follow?

Thanks,
Marvin

Collapse
 
m0nica profile image
Monica Powell

I followed the documentation at developers.themoviedb.org/3/gettin.... I'd recommend starting there to get a better sense of what the TMDB API supports and how API requests should be formatted.

Collapse
 
jeffpollet profile image
Solidarity&Resist

This is so great, and exactly what I was looking for to start playing with tmdb's api, thank you! I am a newbie, but I think I found one little error. You import pandas as pd, but later do df = pandas.DataFrame(columns=columns), which should be pd.DataFrame(columns=columns).

Thanks again!