In this tutorial, we will be extracting data from the top 5 soccer leagues globally: the EPL, La Liga, Serie A, Bundesliga, and Ligue 1. However, to obtain this data, we will be extracting it from an official sports website (understat.com). We will then learn how to store and manage these data with Fauna.
Prerequisites
To fully understand this tutorial, you are required to have the following in place:
- Python 3.7 or newer.
- Basic understanding of Fauna.
- Basic knowledge of Django.
- A text editor.
With the above prerequisites out of the way, we can now begin building our web scraper application.
Introduction to Fauna
Fauna is a client-side serverless database that uses GraphQL and the Fauna Query Language (FQL) to support various data types and relational databases in a serverless API. You can learn more about Fauna in their official documentation here. If this is the first time you have heard about Fauna, visit my previous article here for a brief introduction.
Installing the required libraries
The libraries required for this tutorial are as follows:
- numpy — fundamental package for scientific computing with Python
- pandas — library providing high-performance, easy-to-use data structures, and data analysis tools
- requests — is the only Non-GMO HTTP library for Python, safe for human consumption. (love this line from official docs :D)
- BeautifulSoup — a python library for pulling data out of HTML and XML files.
To install the libraries required for this tutorial, run the following commands below:
pip install numpy
pip install pandas
pip install requests
pip install bs4
Building the Python Web Scraper
Now that we have all the required libraries installed, let’s get to building our web scraper.
Importing the Python libraries
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json
Carrying out Site Research
The first step in any web scraping project is researching the web page you want to scrape and learn how it works. That is critical to finding where to get the data from the site; therefore, this is where we’ll begin.
We can see on the home page that the site has data for six European leagues. However, we will be extracting data for just the top 5 leagues(teams excluding RFPL).
We can also notice that data on the site starts from 2014/2015 to 2020/2021. Let’s create variables to handle only the information we require.
# create urls for all seasons of all leagues
base_url = 'https://understat.com/league'
leagues = ['La_liga', 'EPL', 'Bundesliga', 'Serie_A', 'Ligue_1']
seasons = ['2016', '2017', '2018', '2019', '2020']
The next step is to figure out where the data on the web page is stored. To do so, open Developer Tools in Chrome, navigate to the Network tab, locate the data file (in this example, 2018), and select the “Response” tab. After executing requests, this is what we'll get. get(URL)
After looking through the web page's content, we discovered that the data is saved beneath the "script" element in the teamsData
variable and is JSON encoded. As a result, we'll need to track down this tag, extract JSON from it, and convert it to a Python-readable data structure.
Decoding the JSON data
season_data = dict()
for season in seasons:
url = base_url+'/'+league+'/'+season
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
# Based on the structure of the webpage, I found that data is in the JSON variable, under <script> tags
scripts = soup.find_all('script')
string_with_json_obj = ''
# Find data for teams
for el in scripts:
if 'teamsData' in str(el):
string_with_json_obj = str(el).strip()
# strip unnecessary symbols and get only JSON data
ind_start = string_with_json_obj.index("('")+2
ind_end = string_with_json_obj.index("')")
json_data = string_with_json_obj[ind_start:ind_end]
json_data = json_data.encode('utf8').decode('unicode_escape')
#print(json_data)
After running the python code above, you should get data that we’ve cleaned up.
Understanding the Data
When we start looking at the data, we realise it's a dictionary of dictionaries with three keys: id, title, and history. Ids are also used as keys in the dictionary's initial layer.
Therefore, we can deduce that history has information on every match a team has played in its league (League Cup or Champions League games are omitted).
After reviewing the first layer dictionary, we can begin to compile a list of team names.
# Get teams and their relevant ids and put them into separate dictionary
teams = {}
for id in data.keys():
teams[id] = data[id]['title']
We see that column names frequently appear; therefore, we put them in a separate list. Also, look at how the sample values appear.
columns = []
# Check the sample of values per each column
values = []
for id in data.keys():
columns = list(data[id]['history'][0].keys())
values = list(data[id]['history'][0].values())
break
Now let’s get data for all teams. Uncomment the print statement in the code below to print the data to your console.
# Getting data for all teams
dataframes = {}
for id, team in teams.items():
teams_data = []
for row in data[id]['history']:
teams_data.append(list(row.values()))
df = pd.DataFrame(teams_data, columns=columns)
dataframes[team] = df
# print('Added data for {}.'.format(team))
After you have completed this code, we will have a dictionary of DataFrames with the key being the team's name and the value being the DataFrame containing all of the team's games.
Manipulating Data to Table
When we look at the DataFrame content, we can see that metrics like PPDA and OPPDA (ppda and ppda allowed) are represented as total sums of attacking/defensive actions.
However, they are shown as coefficients in the original table. Let's clean that up.
for team, df in dataframes.items():
dataframes[team]['ppda_coef'] = dataframes[team]['ppda'].apply(lambda x: x['att']/x['def'] if x['def'] != 0 else 0)
dataframes[team]['oppda_coef'] = dataframes[team]['ppda_allowed'].apply(lambda x: x['att']/x['def'] if x['def'] != 0 else 0)
We now have all of our numbers, but for every game. The totals for the team are what we require. Let's look at the columns we need to add up. To do so, we returned to the original table on the website and discovered that all measures should be added together, with only PPDA and OPPDA remaining as means in the end. First, let’s define the columns we need to sum and mean.
cols_to_sum = ['xG', 'xGA', 'npxG', 'npxGA', 'deep', 'deep_allowed', 'scored', 'missed', 'xpts', 'wins', 'draws', 'loses', 'pts', 'npxGD']
cols_to_mean = ['ppda_coef', 'oppda_coef']
Finally, let’s calculate the totals and means.
for team, df in dataframes.items():
sum_data = pd.DataFrame(df[cols_to_sum].sum()).transpose()
mean_data = pd.DataFrame(df[cols_to_mean].mean()).transpose()
final_df = sum_data.join(mean_data)
final_df['team'] = team
final_df['matches'] = len(df)
frames.append(final_df)
full_stat = pd.concat(frames)
full_stat = full_stat[['team', 'matches', 'wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'xG', 'npxG', 'xGA', 'npxGA', 'npxGD', 'ppda_coef', 'oppda_coef', 'deep', 'deep_allowed', 'xpts']]
full_stat.sort_values('pts', ascending=False, inplace=True)
full_stat.reset_index(inplace=True, drop=True)
full_stat['position'] = range(1,len(full_stat)+1)
full_stat['xG_diff'] = full_stat['xG'] - full_stat['scored']
full_stat['xGA_diff'] = full_stat['xGA'] - full_stat['missed']
full_stat['xpts_diff'] = full_stat['xpts'] - full_stat['pts']
cols_to_int = ['wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'deep', 'deep_allowed']
full_stat[cols_to_int] = full_stat[cols_to_int].astype(int)
In the code above, we reordered columns for better readability, sorted rows based on points, reset the index, and added column ‘position’.
We also added the differences between the expected metrics and real metrics.
Lastly, we converted the floats to integers where appropriate.
Beautifying the Final Output of the Dataframe
Finally, let’s beautify our data to become similar to the site data in the image above. To do this, run the python code below.
col_order = ['position', 'team', 'matches', 'wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'xG', 'xG_diff', 'npxG', 'xGA', 'xGA_diff', 'npxGA', 'npxGD', 'ppda_coef', 'oppda_coef', 'deep', 'deep_allowed', 'xpts', 'xpts_diff']
full_stat = full_stat[col_order]
full_stat = full_stat.set_index('position')
# print(full_stat.head(20))
To print a part of the beautified data, uncomment the print statement in the code above.
Compiling the Final Code
To get all the data, we need to loop through all the leagues and seasons then manipulate it to be exportable as a CSV file.
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json
# create urls for all seasons of all leagues
base_url = 'https://understat.com/league'
leagues = ['La_liga', 'EPL', 'Bundesliga', 'Serie_A', 'Ligue_1']
seasons = ['2016', '2017', '2018', '2019', '2020']
full_data = dict()
for league in leagues:
season_data = dict()
for season in seasons:
url = base_url+'/'+league+'/'+season
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
# Based on the structure of the webpage, I found that data is in the JSON variable, under <script> tags
scripts = soup.find_all('script')
string_with_json_obj = ''
# Find data for teams
for el in scripts:
if 'teamsData' in str(el):
string_with_json_obj = str(el).strip()
# print(string_with_json_obj)
# strip unnecessary symbols and get only JSON data
ind_start = string_with_json_obj.index("('")+2
ind_end = string_with_json_obj.index("')")
json_data = string_with_json_obj[ind_start:ind_end]
json_data = json_data.encode('utf8').decode('unicode_escape')
# convert JSON data into Python dictionary
data = json.loads(json_data)
# Get teams and their relevant ids and put them into separate dictionary
teams = {}
for id in data.keys():
teams[id] = data[id]['title']
# EDA to get a feeling of how the JSON is structured
# Column names are all the same, so we just use first element
columns = []
# Check the sample of values per each column
values = []
for id in data.keys():
columns = list(data[id]['history'][0].keys())
values = list(data[id]['history'][0].values())
break
# Getting data for all teams
dataframes = {}
for id, team in teams.items():
teams_data = []
for row in data[id]['history']:
teams_data.append(list(row.values()))
df = pd.DataFrame(teams_data, columns=columns)
dataframes[team] = df
# print('Added data for {}.'.format(team))
for team, df in dataframes.items():
dataframes[team]['ppda_coef'] = dataframes[team]['ppda'].apply(lambda x: x['att']/x['def'] if x['def'] != 0 else 0)
dataframes[team]['oppda_coef'] = dataframes[team]['ppda_allowed'].apply(lambda x: x['att']/x['def'] if x['def'] != 0 else 0)
cols_to_sum = ['xG', 'xGA', 'npxG', 'npxGA', 'deep', 'deep_allowed', 'scored', 'missed', 'xpts', 'wins', 'draws', 'loses', 'pts', 'npxGD']
cols_to_mean = ['ppda_coef', 'oppda_coef']
frames = []
for team, df in dataframes.items():
sum_data = pd.DataFrame(df[cols_to_sum].sum()).transpose()
mean_data = pd.DataFrame(df[cols_to_mean].mean()).transpose()
final_df = sum_data.join(mean_data)
final_df['team'] = team
final_df['matches'] = len(df)
frames.append(final_df)
full_stat = pd.concat(frames)
full_stat = full_stat[['team', 'matches', 'wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'xG', 'npxG', 'xGA', 'npxGA', 'npxGD', 'ppda_coef', 'oppda_coef', 'deep', 'deep_allowed', 'xpts']]
full_stat.sort_values('pts', ascending=False, inplace=True)
full_stat.reset_index(inplace=True, drop=True)
full_stat['position'] = range(1,len(full_stat)+1)
full_stat['xG_diff'] = full_stat['xG'] - full_stat['scored']
full_stat['xGA_diff'] = full_stat['xGA'] - full_stat['missed']
full_stat['xpts_diff'] = full_stat['xpts'] - full_stat['pts']
cols_to_int = ['wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'deep', 'deep_allowed']
full_stat[cols_to_int] = full_stat[cols_to_int].astype(int)
col_order = [‘league’,’year’,'position', 'team', 'matches', 'wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'xG', 'xG_diff', 'npxG', 'xGA', 'xGA_diff', 'npxGA', 'npxGD', 'ppda_coef', 'oppda_coef', 'deep', 'deep_allowed', 'xpts', 'xpts_diff']
full_stat = full_stat[col_order]
full_stat = full_stat.set_index('position')
# print(full_stat.head(20))
season_data[season] = full_stat
df_season = pd.concat(season_data)
full_data[league] = df_season
To analyse our data with Django, we need to export the data to a CSV file. To do this, copy and paste the code below.
data = pd.concat(full_data)
data.to_csv('understat.com.csv')
Setting up the Fauna Database
You must first establish an account before you can build a Fauna database. You may create another after you've created one by clicking the CREATE DATABASE button on the dashboard.
Creating the Database Collections
Create one collection in your database, the SOCCER_DATA
collection. The SOCCER_DATA
collection will store the data we extract to our CSV file in the database. For the History days
and TTL
, use the default values provided and save.
Creating the Collection Indexes
Create two indexes for your collections; check_name
and get_all_teams_data
. The get_all_teams_data
index will allow you to scroll through data in the SOCCER_DATA
collection. It has one term, which is the matches
field. This term will enable you to match data with the matches for easy querying.
The check_name
will allow you to scroll through data in the SOCCER_DATA
collection. This index will enable matching with the team
field to perform queries.
Generating the Database API Key
To generate an API key, go to the security tab on the left side of your dashboard, then click New Key
to create a key. You will then be required to provide a database to connect to the key. After providing the information required, click the SAVE
button.
After saving your key, Fauna will present you with your secret key in the image above (hidden here for privacy). Make sure to copy your secret key from the dashboard and save it somewhere you can easily retrieve it for later use.
Setting Up the Django Application
The Django application will collect data from the CSV file, store it in the Fauna database and display it on the homepage user interface. To create a Django project, run the commands below:
django-admin startproject FAUNA_WEBSCRAPER
Django-admin startapp APP
After creating the Django project, we need to create another application before diving into the main code logic. Follow the steps below to set up your Django application.
- Add
APP
to installed apps in the settings.py file as seen in the code below:
INSTALLED_APPS = [
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'APP',
]
- Update the project
urls.py
file with python code below:
from django.contrib import admin
from django.urls import path, include
from django.conf import settings
from django.contrib.staticfiles.urls import staticfiles_urlpatterns
from django.contrib.staticfiles.urls import static
urlpatterns = [
path('admin/', admin.site.urls),
path('', include("APP.urls")),
]
urlpatterns += staticfiles_urlpatterns()
urlpatterns += static(settings.MEDIA_URL, document_root=settings.MEDIA_ROOT)
- Create a new
urls.py
file in theAPP
folder, then paste the python code below.
from django.conf import settings
from django.conf.urls.static import static
from django.urls import path, include
from . import views
app_name = "App"
urlpatterns = [
path("", views.index, name="index"),
]
- Create two new folders in your
APP
directory to handle the application user interface(i.e.templates
andstatic
folder). - Create two new files in the
templates
folder, an HTML file and a CSS file. The HTML should beindex.html
, and the CSS file should bestyle.css
. Copy and paste the code below in their respective files.
{% load static %}
<!DOCTYPE html>
<html>
<head>
<link href="{% static 'style.css' %}" rel="stylesheet">
<h1>Fauna Sport Aggregator</h1>
</head>
<body>
<table class="content-table">
<thead>
<tr>
<th>Position</th>
<th>Team</th>
<th>Matches</th>
<th>Wins</th>
<th>Draws</th>
<th>Loses</th>
<th>Points</th>
</tr>
</thead>
<tbody>
{% for team in team_list_data %}
<tr>
<td>{{team.position}}</td>
<td>{{team.team}}</td>
<td>{{team.matches}}</td>
<td>{{team.wins}}</td>
<td>{{team.draws}}</td>
<td>{{team.loses}}</td>
<td>{{team.points}}</td>
</tr>
{% endfor %}
</tbody>
</table>
</body>
</html>
#style.css
h1 {
color: #0000A0;
font-family: sans-serif;
}
.content-table {
border-collapse: collapse;
margin: 25px 0;
font-size: 1.0em;
font-weight: bold;
font-family: sans-serif;
min-width: 400px;
border-radius: 5px 5px 0 0;
overflow-hidden: hidden;
box-shadow: 0 0 20px rgba(0, 0, 0, 0.15);
}
.content-table thead tr {
background-color: #0000A0;
color: #FFFFFF;
text-align: left;
font-weight: bold;
}
.content-table th,
.content-table td {
padding: 15px 25px;
}
.content-table tbody tr {
border-bottom: 1px solid #dddddd;
}
.content-table tbody tr:nth-of-type(even) {
background-color: #f3f3f3;
color: #0000A0;
font-weight: bold
}
.content-table tbody tr:last-of-type {
border-bottom: 2px solid #0000A0;
}
.content-table tbody tr:hover {
background-color: #D3D3D3;
}
- Copy the CSV file generated earlier to the
APP
folder. - Copy and paste the python code below in the
views.py
file.
from django.shortcuts import render
import csv
from faunadb import query as q
from faunadb.objects import Ref
from faunadb.client import FaunaClient
# Create your views here.
client = FaunaClient(secret="fauna_secret_key", domain="db.eu.fauna.com",
# NOTE: Use the correct domain for your database's Region Group.
port=443,
scheme="https")
def index(request):
team_data_list = []
headers_list = []
paginate_data=client.query(q.paginate(q.match(q.index("get_all_teams_data"), "38")))
all_data = [
q.get(
q.ref(q.collection("SOCCER_DATA"), paginate.id())
) for paginate in paginate_data["data"]
]
team_list_data = [i["data"] for i in client.query(all_data)]
context={"team_list_data":client.query(team_list_data)}
index = 0
with open("understat.com.csv", 'r') as data:
for line in csv.reader(data):
index += 1
if index > 1:
team_dict = {}
for i, elem in enumerate(headers_list):
team_dict[elem] = line[i]
team_data_list.append(team_dict)
else:
headers_list = list(line)
for record in team_data_list[:20]:
try:
team_data_check = client.query(q.get(q.match(q.index('check_name'),record["team"])))
except:
team_record = client.query(q.create(q.collection("SOCCER_DATA"), {
"data": {
"matches":record["matches"],
"position":record["position"],
"team":record["team"],
"wins": record["wins"],
"draws": record["draws"],
"loses":record["loses"],
"points":record["pts"]
}
}))
return render(request,"index.html",context)
- Run the application with the following commands. 9.
python manage.py migrate
python manage.py runserver
After following the steps listed above, your SOCCER_DATA
collection should be like the one in the image below.
Go to http://127:0.0.1.8000 to see your Django application. Your homepage should be similar to the one in the image below.
The Views File.
This file is where we built the logic for the backend. We started by importing the required modules as seen in the python code below:
from django.shortcuts import render
import csv
from faunadb import query as q
from faunadb.objects import Ref
from faunadb.client import FaunaClient
Next, we initialised the Fauna client using the secret key we generated earlier.
client = FaunaClient(secret="fnAETH3xELAAxp99WYA-8_xMLqFM1uMTfAwZYmZO", domain="db.eu.fauna.com",
# NOTE: Use the correct domain for your database's Region Group.
port=443,
scheme="https")
We then needed to extract the data from our excel sheet before saving it into our SOCCER_DATA
collection. We did this by reading and saving the data in the CSV file to a list variable team_data_list
in our program.
team_data_list = []
headers_list = []
index = 0
with open("understat.com.csv", 'r') as data:
for line in csv.reader(data):
index += 1
if index > 1:
team_dict = {}
for i, elem in enumerate(headers_list):
team_dict[elem] = line[i]
team_data_list.append(team_dict)
else:
headers_list = list(line)
Next, we needed to extract data from our list team_data_list
variable, then stored them in the SOCCER_DATA
collection. To do this, we iterated every data in the list and created a request with the Fauna client to save the data passed in the dictionary we defined to the collection. We only saved the first 20 data from the list for this tutorial, so we don’t exceed our free account rate limit.
for record in team_data_list[:20]:
try:
team_data_check = client.query(q.get(q.match(q.index('check_name'),record["team"])))
except:
team_record = client.query(q.create(q.collection("SOCCER_DATA"), {
"data": {
"matches":record["matches"],
"position":record["position"],
"team":record["team"],
"wins": record["wins"],
"draws": record["draws"],
"loses":record["loses"],
"points":record["pts"]
}
}))
Next, we queried the data in the collection using the get_all_team_data1
Fauna index to match data for all teams with 38 matches. We paginated the queried data using Fauna’s paginate
method, then created a list variable called team_list_data
from it. Finally, we passed the list’s data to the front end user interface for display on a table.
Conclusion
In this article, we built a python web scraper to extract soccer data from a website. We also learned how to utilise this data in a Django application with Fauna's serverless database. We saw how easy it is to integrate Fauna into a Python application and got the chance to explore some of its core features and functionalities.
If you have any questions, don't hesitate to contact me on Twitter: @LordChuks3.
Written in connection with the Write with Fauna Program.
Top comments (1)
Awesome