Pavan Belagatti

Posted on Oct 10, 2023

Data Analysis of the Titanic with Python!

#python #datascience #dataengineering #database

The sinking of the RMS Titanic remains one of the most tragic maritime disasters in history. While the event itself has been extensively covered in literature and film, delving into its dataset provides a unique perspective, bringing stories of its passengers and crew to life through numbers and patterns.

Using Python, one of the most versatile programming languages, and SingleStore Notebooks, we can uncover layers of insights that narrate tales of hope, despair, survival, and loss. This article aims to guide you through an analytical journey, examining the Titanic dataset with Python. By the end, you won't just understand the data – you'll feel the stories and the emotions intertwined within. But first, let's understand some data analytics basics.

What is Data Analytics?

Data analytics is the systematic approach of examining, cleaning, transforming, and interpreting raw data to discover valuable insights and information. It leverages advanced statistical, mathematical, and computational techniques to identify underlying patterns, trends, and relationships within vast datasets.

By doing so, data analytics provides a foundation for decision-making, allowing organizations to understand their performance, customer behavior, and market dynamics. Furthermore, it plays a pivotal role in optimizing operations, enhancing customer experiences, and predicting future scenarios.

How Data Analysis Works?

Data analysis begins with "Data Collection," where raw data is gathered from various sources. Once collected, the next step is "Data Cleaning," where any inconsistencies, errors, or irrelevant data points are removed or corrected to ensure the quality of the data. After cleaning, the data undergoes "Data Transformation," where it is structured or reformatted to be suitable for analysis.

The "Data Analysis" step involves examining the transformed data to extract meaningful insights, patterns, or trends. This analysis is then visually represented in the "Data Visualization" stage, making it easier to interpret and understand.

Finally, based on the visualized data, one can derive "Insight & Decision Making," where informed decisions are made or strategies are formulated based on the insights gained from the analysis.

Enough of the theory, let's get our hands dirty with an amazing tutorial. Let's go!

Prerequisite:

What is SingleStore?

SingleStore (formerly known as MemSQL) is a distributed, relational database management system (RDBMS) designed for high-performance, real-time analytics, and massive data ingestion.

What is SingleStore Notebooks Feature?

Notebooks have become increasingly popular in the data science community as they provide an efficient way to explore, analyze and visualize data, making it easier to communicate insights and results. SingleStore's Notebook feature is based on the popular Jupyter Notebook, which is widely used in data science and machine learning communities.

One interesting fact about SingleStore Notebooks is that, they allow users to query SingleStore's distributed SQL database directly from within the notebook interface.

As soon as you sign up, make sure to select the 'Notebooks' tab.

Create a blank Notebook, we will start from the scratch.

As soon as you create a Notebook, you will land on this studio/dashboard. Here, we can add our commands and execute them.

Let's Start the Data Analysis🧐

To run the Titanic dataset analysis, you would need to install the required libraries (pandas, seaborn, and matplotlib).
You can install them using pip:



pip install pandas seaborn matplotlib

Make sure to run the above command inside the Notebook's dashboard as shown below.

Click on 'Run selected cell' once you place your command to make sure it executes.

Next step is to import the necessary libraries.



import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Load the dataset (Seaborn provides built-in datasets, including the Titanic dataset)



data = sns.load_dataset("titanic")
data.head()

Once you execute the above dataset command, this is what you see (shown below)

Let's visualize the distribution of ages of the passengers.



plt.figure(figsize=(10, 6))
sns.histplot(data=data, x='age', kde=True, hue='sex')
plt.title('Age Distribution by Gender')
plt.show()

The output is as below,

Let's analyze survival rates based on passenger class.



plt.figure(figsize=(10, 6))
sns.barplot(data=data, x='class', y='survived', hue='sex')
plt.title('Survival Rate by Passenger Class and Gender')
plt.show()

The output should be as below,

Let's analyze survival rates based on embarkation port.



plt.figure(figsize=(10, 6))
sns.countplot(data=data, x='embarked', hue='survived')
plt.title('Survival Count based on Embarkation Port')
plt.show()

This will show the survival count based on where passengers boarded the Titanic (C = Cherbourg; Q = Queenstown; S = Southampton).

The output should be as below.

Let's analyze survival based on fare and class.



plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x='class', y='fare', hue='survived')
plt.ylim(0, 300)  # Limiting y-axis to 300 for better visualization
plt.title('Fare distribution by Passenger Class and Survival')
plt.show()

The output should be as below,

Let's see the survival count based on the family size. Creating a new column 'FamilySize' by adding 'sibsp' & 'parch'



data['family_size'] = data['sibsp'] + data['parch']
plt.figure(figsize=(10, 6))
sns.countplot(data=data, x='family_size', hue='survived')
plt.title('Survival Count based on Family Size')
plt.show()

The output is as below,

Let's analyze survival rates by the number of siblings/spouses aboard.



plt.figure(figsize=(10, 6))
sns.countplot(data=data, x='sibsp', hue='survived')
plt.title('Survival Count based on Number of Siblings/Spouses Aboard')
plt.show()

The output is as below,

We can do a lot more analysis and understand more about the statistical numbers involved.

Data Analysis Findings

Based on the analysis we've discussed above, here's a summary of findings for the Titanic incident:

Gender and Survival: Women had a significantly higher survival rate than men.
Passenger Class: First-class passengers had a higher survival rate, indicating socio-economic status played a role in survival chances.
Embarkation Port: The survival count varied based on the embarkation port, potentially reflecting the socio-economic distribution of passengers from these ports.
Fare Distribution: The majority of passengers paid lower fares, aligning with a larger number of third-class tickets.
Fare and Survival: Within each passenger class, there wasn't a consistent pattern to suggest that higher fares directly led to better survival chances.
Siblings/Spouses: Those with one sibling or spouse onboard seemed to have a slightly better survival rate than those alone or with many siblings/spouses.
Parents/Children: Passengers traveling alone or with one parent/child had higher survival rates compared to larger families.
Family Size: Solo travelers and those with a small family size (1-3 members) had better survival outcomes than larger families.
Titles and Survival: Certain titles extracted from names, potentially indicating social status or profession, had varied survival rates.
Age Distribution: Younger passengers (children) had a better survival rate, while the elderly had lower survival chances. Middle-aged individuals, especially males, formed the bulk of casualties.

The analysis of the Titanic dataset isn't just about understanding a shipwreck; it's about understanding humanity, society, and the interplay of various factors during critical events. Through such datasets, we bridge the past with the present, gaining insights that are both retrospective and forward-looking.

Top comments (1)

Pizofreude • Oct 11 '23

Exactly what I've been looking for! Data Engineering ftw!

DEV Community

Data Analysis of the Titanic with Python!

What is Data Analytics?

How Data Analysis Works?

Prerequisite:

What is SingleStore?

What is SingleStore Notebooks Feature?

Let's Start the Data Analysis🧐

Data Analysis Findings

Top comments (1)

Read next

Demystifying CXL Heterogeneous Systems with Heimdall Benchmark

Small But Mighty: Survey of Small Language Models in the LLM Era

IBM InfoSphere vs. STIBO STEP: Which MDM Wins?

Day 1: Mastering the Basics of Python