DEV Community

Cover image for Data Analysis of the Titanic with Python!
Pavan Belagatti
Pavan Belagatti

Posted on

Data Analysis of the Titanic with Python!

The sinking of the RMS Titanic remains one of the most tragic maritime disasters in history. While the event itself has been extensively covered in literature and film, delving into its dataset provides a unique perspective, bringing stories of its passengers and crew to life through numbers and patterns.

Using Python, one of the most versatile programming languages, and SingleStore Notebooks, we can uncover layers of insights that narrate tales of hope, despair, survival, and loss. This article aims to guide you through an analytical journey, examining the Titanic dataset with Python. By the end, you won't just understand the data – you'll feel the stories and the emotions intertwined within. But first, let's understand some data analytics basics.

What is Data Analytics?

Data analytics is the systematic approach of examining, cleaning, transforming, and interpreting raw data to discover valuable insights and information. It leverages advanced statistical, mathematical, and computational techniques to identify underlying patterns, trends, and relationships within vast datasets.

data analytics image

By doing so, data analytics provides a foundation for decision-making, allowing organizations to understand their performance, customer behavior, and market dynamics. Furthermore, it plays a pivotal role in optimizing operations, enhancing customer experiences, and predicting future scenarios.

How Data Analysis Works?

data analysis

Data analysis begins with "Data Collection," where raw data is gathered from various sources. Once collected, the next step is "Data Cleaning," where any inconsistencies, errors, or irrelevant data points are removed or corrected to ensure the quality of the data. After cleaning, the data undergoes "Data Transformation," where it is structured or reformatted to be suitable for analysis.

The "Data Analysis" step involves examining the transformed data to extract meaningful insights, patterns, or trends. This analysis is then visually represented in the "Data Visualization" stage, making it easier to interpret and understand.

Finally, based on the visualized data, one can derive "Insight & Decision Making," where informed decisions are made or strategies are formulated based on the insights gained from the analysis.

Enough of the theory, let's get our hands dirty with an amazing tutorial. Let's go!

Prerequisite:

What is SingleStore?

SingleStore (formerly known as MemSQL) is a distributed, relational database management system (RDBMS) designed for high-performance, real-time analytics, and massive data ingestion.

What is SingleStore Notebooks Feature?

Notebooks have become increasingly popular in the data science community as they provide an efficient way to explore, analyze and visualize data, making it easier to communicate insights and results. SingleStore's Notebook feature is based on the popular Jupyter Notebook, which is widely used in data science and machine learning communities.

One interesting fact about SingleStore Notebooks is that, they allow users to query SingleStore's distributed SQL database directly from within the notebook interface.

As soon as you sign up, make sure to select the 'Notebooks' tab.

SingleStore Notebooks

Create a blank Notebook, we will start from the scratch.
blank notebook

As soon as you create a Notebook, you will land on this studio/dashboard. Here, we can add our commands and execute them.

notebooks playground

Let's Start the Data Analysis🧐

To run the Titanic dataset analysis, you would need to install the required libraries (pandas, seaborn, and matplotlib).
You can install them using pip:

pip install pandas seaborn matplotlib
Enter fullscreen mode Exit fullscreen mode

Make sure to run the above command inside the Notebook's dashboard as shown below.

install dep

Click on 'Run selected cell' once you place your command to make sure it executes.

  • Next step is to import the necessary libraries.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Enter fullscreen mode Exit fullscreen mode

Load the dataset (Seaborn provides built-in datasets, including the Titanic dataset)

data = sns.load_dataset("titanic")
data.head()
Enter fullscreen mode Exit fullscreen mode

Once you execute the above dataset command, this is what you see (shown below)

titanic dataset

  • Let's visualize the distribution of ages of the passengers.
plt.figure(figsize=(10, 6))
sns.histplot(data=data, x='age', kde=True, hue='sex')
plt.title('Age Distribution by Gender')
plt.show()
Enter fullscreen mode Exit fullscreen mode

The output is as below,

age distribution

  • Let's analyze survival rates based on passenger class.
plt.figure(figsize=(10, 6))
sns.barplot(data=data, x='class', y='survived', hue='sex')
plt.title('Survival Rate by Passenger Class and Gender')
plt.show()
Enter fullscreen mode Exit fullscreen mode

The output should be as below,
titanic data example

  • Let's analyze survival rates based on embarkation port.
plt.figure(figsize=(10, 6))
sns.countplot(data=data, x='embarked', hue='survived')
plt.title('Survival Count based on Embarkation Port')
plt.show()
Enter fullscreen mode Exit fullscreen mode

This will show the survival count based on where passengers boarded the Titanic (C = Cherbourg; Q = Queenstown; S = Southampton).

The output should be as below.
survival at posrt

  • Let's analyze survival based on fare and class.
plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x='class', y='fare', hue='survived')
plt.ylim(0, 300)  # Limiting y-axis to 300 for better visualization
plt.title('Fare distribution by Passenger Class and Survival')
plt.show()
Enter fullscreen mode Exit fullscreen mode

The output should be as below,
titanic passanger survival

  • Let's see the survival count based on the family size. Creating a new column 'FamilySize' by adding 'sibsp' & 'parch'
data['family_size'] = data['sibsp'] + data['parch']
plt.figure(figsize=(10, 6))
sns.countplot(data=data, x='family_size', hue='survived')
plt.title('Survival Count based on Family Size')
plt.show()
Enter fullscreen mode Exit fullscreen mode

The output is as below,

family size

  • Let's analyze survival rates by the number of siblings/spouses aboard.
plt.figure(figsize=(10, 6))
sns.countplot(data=data, x='sibsp', hue='survived')
plt.title('Survival Count based on Number of Siblings/Spouses Aboard')
plt.show()
Enter fullscreen mode Exit fullscreen mode

The output is as below,
sibling titanic

We can do a lot more analysis and understand more about the statistical numbers involved.

Data Analysis Findings

Based on the analysis we've discussed above, here's a summary of findings for the Titanic incident:

  • Gender and Survival: Women had a significantly higher survival rate than men.

  • Passenger Class: First-class passengers had a higher survival rate, indicating socio-economic status played a role in survival chances.

  • Embarkation Port: The survival count varied based on the embarkation port, potentially reflecting the socio-economic distribution of passengers from these ports.

  • Fare Distribution: The majority of passengers paid lower fares, aligning with a larger number of third-class tickets.

  • Fare and Survival: Within each passenger class, there wasn't a consistent pattern to suggest that higher fares directly led to better survival chances.

  • Siblings/Spouses: Those with one sibling or spouse onboard seemed to have a slightly better survival rate than those alone or with many siblings/spouses.

  • Parents/Children: Passengers traveling alone or with one parent/child had higher survival rates compared to larger families.

  • Family Size: Solo travelers and those with a small family size (1-3 members) had better survival outcomes than larger families.

  • Titles and Survival: Certain titles extracted from names, potentially indicating social status or profession, had varied survival rates.

  • Age Distribution: Younger passengers (children) had a better survival rate, while the elderly had lower survival chances. Middle-aged individuals, especially males, formed the bulk of casualties.

The analysis of the Titanic dataset isn't just about understanding a shipwreck; it's about understanding humanity, society, and the interplay of various factors during critical events. Through such datasets, we bridge the past with the present, gaining insights that are both retrospective and forward-looking.

Top comments (1)

Collapse
 
pizofreude profile image
Pizofreude

Exactly what I've been looking for! Data Engineering ftw!