Extracting Data from Transfermarkt: An Introduction to WebScraping

#python #begginers #webscraping #beautifulsoup

This a translated version of my tutorial originaly published in Brazilian Portuguese. The repository with the code from this tutorial is in my gitlab profile.

Getting data and transforming it into information is the foundation of fields such as Data Science. Sometimes obtaining it is very simple, for example, you can, right now, visit the Brazilian government website data.gov.br and get access to several raw data files from the government and then perform the analysis of a .csv file (a file format that transmits data) in an easy, simple and fast way.

However, in some situations the data is somewhat difficult to obtain, for example, you may need to receive data that is only available on a web page to perform an analysis. In this situation you can use Beautiful Soup, a Python library, to perform web scraping.

Beautiful Soup is the most popular Python library for receiving web data, it is capable of extracting data from HTML and XML files, it has several methods that make the search for specific data on web pages rather simple an fast.

For this tutorial, we will extract data from the website Transfermarkt which is a web plataform that contains news and data about games, transfers, clubs and players from the football/soccer world.

Transfermarkt Homepage

We will receive the name, country of the previous league and the price of the 25 most expensive players in the history of the AFC Ajax club, this information can be found on the Transfermarkt page.

Page which contains the informations about the 25 biggest AFC Ajax signs

Extracting Data

Before obtaining the data itself, we will import the libraries required for the execution of the program, these will be: Beautiful Soup, Pandas and Requests.



import requests
from bs4 import BeautifulSoup
import pandas as pd

After that, we will download the webpage in our program using the requests library, which requests the information from the page, and the BeautifulSoup library, which transforms the data received in requests (a Response object) into aBeautifulSoup object that will be used in data extraction.



"""
To make the request to the page we have to inform the
website that we are a browser and that is why we
use the headers variable
"""
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

# endereco_da_pagina stands for the data page address
endereco_da_pagina = "https://www.transfermarkt.co.uk/ajax-amsterdam/transferrekorde/verein/610/saison_id//pos//detailpos/0/w_s//altersklasse//plus/1"

# In the objeto_response variable we will the download of the web page
objeto_response = requests.get(endereco_da_pagina, headers=headers)

"""
Now we will create a BeautifulSoup object from our object_response.
The 'html.parser' parameter represents which parser we will use when creating our object,
a parser is a software responsible for converting an entry to a data structure.
"""
pagina_bs = BeautifulSoup(objeto_response.content, 'html.parser')

pagina_bs is now a variable that contains all the HTML content inside our data page.

Now let's extract the data that is in our variable, note that the information we need is in a table. Each row in this table represents a player, with his name, represented in HTML by an anchor (<a>) with the class "spielprofil_tooltip", country of origin league, represented as a flag image (<img>) with a class "flaggenrahmen" in the seventh column (<td>) of each row, and cost represented by a table cell (<td>) of the class "rechts hauptlink"

We will then get this data using the BeautifulSoup library.

First we will get the names of the players.



nomes_jogadores = [] # List that will receive all the players names

# The find_all () method is able to return all tags that meet restrictions within parentheses
tags_jogadores = pagina_bs.find_all("a", {"class": "spielprofil_tooltip"})
# In our case, we are finding all anchors with the class "spielprofil_tooltip"

# Now we will get only the names of all players
for tag_jogador in tags_jogadores:
    nomes_jogadores.append(tag_jogador.text)

Now we will get the countries of the players’s previous leagues.



pais_jogadores = [] # List that will receive all the names of the countries of the players’s previous leagues.

tags_ligas = pagina_bs.find_all("td",{"class": None})
# Now we will receive all the cells in the table that have no class atribute set

for tag_liga in tags_ligas:
    # The find() function will find the first image whose class is "flaggenrahmen" and has a title
    imagem_pais = tag_liga.find("img", {"class": "flaggenrahmen"}, {"title":True})
    # The country_image variable will be a structure with all the image information,
    # one of them is the title that contains the name of the country of the flag image
    if(imagem_pais != None): # We will test if we have found any matches than add them
        pais_jogadores.append(imagem_pais['title'])

Finally, we will get the players' prices.



custos_jogadores = []

tags_custos = pagina_bs.find_all("td", {"class": "rechts hauptlink"})

for tag_custo in tags_custos:
    texto_preco = tag_custo.text
    # The price text contains characters that we don’t need like £ (euros) and m (million) so we’ll remove them
    texto_preco = texto_preco.replace("£", "").replace("m","")
    # We will now convert the value to a numeric variable (float)
    preco_numerico = float(texto_preco)
    custos_jogadores.append(preco_numerico)

Now that we have got all the data we wanted, let's make it understandable to improve any analysis we want to do. For this, we will use the pandas library and its DataFrame class, which is a class that represents a tabular data structure, that is, it is similar to a common table.



# Creating a DataFrame with our data
df = pd.DataFrame({"Jogador":nomes_jogadores,"Preço (milhão de euro)":custos_jogadores,"País de Origem":pais_jogadores})

# Printing our gathered data
print(df)