DEV Community

Cover image for Analysis of 120 Years of Olympics Data
Esther Njihia
Esther Njihia

Posted on

Analysis of 120 Years of Olympics Data

Overview

The Olympic Games are one of the most prestigious and widely-followed sporting events in the world. With a history dating back over a century, the Olympics have seen countless athletes from all corners of the globe compete for glory and honor. In this project, we will use Python, SQL and Tableau to analyze 120 years of Olympics data sourced from Kaggle.

Data Collection

The data for this analysis was sourced from Kaggle and includes information on athletes, teams, events, and medals. The data was collected for all Summer and Winter Olympics from 1896 to 2016.
Data source: Kaggle

Data Cleaning and Preparation

Before we could begin our analysis, the data needed to be cleaned and prepared for use. This involved removing any duplicate or irrelevant data, correcting any errors or inconsistencies, and formatting the data in a way that would be compatible with our analysis tools.

Data Cleaning with Python

Load data

df = pd.read_csv("data/athlete_events.csv")
Enter fullscreen mode Exit fullscreen mode

Calculate the number of missing values in the data

df.isnull().sum()
Enter fullscreen mode Exit fullscreen mode

We got some missing values with the age, height and weight variables. How about we fix that.
Fill in the age missing values with its mean.

df['Age'].fillna(df['Age'].mean(), inplace= True)
Enter fullscreen mode Exit fullscreen mode

Repeat the same for the height and weight variables.

SQL Analysis

Once the data was cleaned and prepared, we used SQL to extract and analyze the data. We were able to answer questions such as:
1.What is the Gender distribution of athletes across different sports and event?

SELECT Sport, Event, Sex, COUNT(Name) as athlete_count
FROM Athlete
GROUP BY Sport, Event, Sex
ORDER BY athlete_count DESC;
Enter fullscreen mode Exit fullscreen mode

2.Which city and season have the most successful athletes?

SELECT City, Season, COUNT(Medal) as Medal_count
FROM Athlete
WHERE Medal IS NOT NULL
GROUP BY City, Season
ORDER BY Medal_count DESC
Enter fullscreen mode Exit fullscreen mode

3.Which athletes are the most successful in terms of number of medals?

SELECT Name, COUNT(Medal) as medal_count
FROM Athlete
GROUP BY Name
ORDER BY medal_count DESC;
Enter fullscreen mode Exit fullscreen mode

4.Trend analysis to identify any patterns in the number of medals over time
By year

SELECT Year, COUNT(Medal) as medal_count
FROM Athlete
GROUP BY Year
ORDER BY Year ;
Enter fullscreen mode Exit fullscreen mode

5.Use statistical analysis to find factors associated with winning more medals.

  • first create a sub query to find the total count of medals per city and season
WITH city_season_medals AS (
    SELECT City, Season, COUNT(Medal) AS medal_count
    FROM Athlete
    GROUP BY City, Season
)
Enter fullscreen mode Exit fullscreen mode
  • use statistical methods to find the relationship between city & season and the total number of medals. Fo this instance I used z-score. The higher the z-score the higher the correlation.
SELECT City, Season, medal_count, (medal_count - AVG(medal_count) OVER()) / STDEV(medal_count) OVER() AS z_score
FROM city_season_medals
ORDER BY z_score DESC
Enter fullscreen mode Exit fullscreen mode

These are some of the questions that were answered during the SQL analysis.

Visualization with Tableau

To make the data more accessible and understandable, Tableau was used to create visualizations of our findings. Some of the visualizations we created include:

  • A pie chart showing the gender distribution across all sports
  • A bar chart showing the most successful athletes over time.
  • line charts to show trends and patterns in the number of medals won over time

View the whole Olympics dashboard:Olympics Dashboard

Conclusion

Through this analysis, I was able to gain a deeper understanding of the history of the Olympics and the performances of countries and athletes over the past 120 years. By using SQL and Tableau, I was able to extract valuable insights from the data and present them in a clear and accessible way.

Code and Data

The code and data used for this project is available at Github

By using the above techniques, we can also use the data to predict future results, and also to identify patterns and trends that can be used to improve performance and achieve success in future Olympic games.

Top comments (1)

Collapse
 
sourov profile image
Sourov Ahmed

Thanks for sharing.