The Olympic Games are one of the most prestigious and widely-followed sporting events in the world. With a history dating back over a century, the Olympics have seen countless athletes from all corners of the globe compete for glory and honor. In this project, we will use Python, SQL and Tableau to analyze 120 years of Olympics data sourced from Kaggle.
The data for this analysis was sourced from Kaggle and includes information on athletes, teams, events, and medals. The data was collected for all Summer and Winter Olympics from 1896 to 2016.
Data source: Kaggle
Before we could begin our analysis, the data needed to be cleaned and prepared for use. This involved removing any duplicate or irrelevant data, correcting any errors or inconsistencies, and formatting the data in a way that would be compatible with our analysis tools.
df = pd.read_csv("data/athlete_events.csv")
Calculate the number of missing values in the data
We got some missing values with the age, height and weight variables. How about we fix that.
Fill in the age missing values with its mean.
df['Age'].fillna(df['Age'].mean(), inplace= True)
Repeat the same for the height and weight variables.
Once the data was cleaned and prepared, we used SQL to extract and analyze the data. We were able to answer questions such as:
1.What is the Gender distribution of athletes across different sports and event?
SELECT Sport, Event, Sex, COUNT(Name) as athlete_count FROM Athlete GROUP BY Sport, Event, Sex ORDER BY athlete_count DESC;
2.Which city and season have the most successful athletes?
SELECT City, Season, COUNT(Medal) as Medal_count FROM Athlete WHERE Medal IS NOT NULL GROUP BY City, Season ORDER BY Medal_count DESC
3.Which athletes are the most successful in terms of number of medals?
SELECT Name, COUNT(Medal) as medal_count FROM Athlete GROUP BY Name ORDER BY medal_count DESC;
4.Trend analysis to identify any patterns in the number of medals over time
SELECT Year, COUNT(Medal) as medal_count FROM Athlete GROUP BY Year ORDER BY Year ;
5.Use statistical analysis to find factors associated with winning more medals.
- first create a sub query to find the total count of medals per city and season
WITH city_season_medals AS ( SELECT City, Season, COUNT(Medal) AS medal_count FROM Athlete GROUP BY City, Season )
- use statistical methods to find the relationship between city & season and the total number of medals. Fo this instance I used z-score. The higher the z-score the higher the correlation.
SELECT City, Season, medal_count, (medal_count - AVG(medal_count) OVER()) / STDEV(medal_count) OVER() AS z_score FROM city_season_medals ORDER BY z_score DESC
These are some of the questions that were answered during the SQL analysis.
To make the data more accessible and understandable, Tableau was used to create visualizations of our findings. Some of the visualizations we created include:
- A pie chart showing the gender distribution across all sports
- A bar chart showing the most successful athletes over time.
- line charts to show trends and patterns in the number of medals won over time
View the whole Olympics dashboard:Olympics Dashboard
Through this analysis, I was able to gain a deeper understanding of the history of the Olympics and the performances of countries and athletes over the past 120 years. By using SQL and Tableau, I was able to extract valuable insights from the data and present them in a clear and accessible way.
The code and data used for this project is available at Github
By using the above techniques, we can also use the data to predict future results, and also to identify patterns and trends that can be used to improve performance and achieve success in future Olympic games.