6. Visualizing Data with Charts
Our previous quest to unlock the secrets of sorting at Hogwarts is well underway! We've gathered our essential spellbooks (Python libraries) and mended the forgetful pages (filled in missing data). Now, it's time to unleash the true power of data science – the magic of data visualization! 🪄
Imagine Professor Dumbledore himself, his eyes twinkling with wisdom, holding a magical artifact – a shimmering chart. This isn't your ordinary piece of parchment, mind you! It's a canvas upon which raw data is transformed into a breathtaking spectacle, revealing hidden patterns and trends just like a Marauder's Map unveils secret passages. ️
6.1 Distribution of Students Across Houses
Now that we've filled those forgetful pages in our book, it's time to delve deeper into the fascinating world of Hogwarts houses! Remember how Harry, Ron, and Hermione were sorted into their houses based on their unique talents and personalities? Well, we're about to embark on a similar quest, using a magical tool called Matplotlib to create a visual map of how the Hogwarts students are distributed across their houses. ✨
With a wave of our metaphorical wand (or a line of Python code!), Matplotlib will conjure a magnificent bar chart. Think of it like a giant sorting hat, but instead of a tear on its brim, this hat boasts colorful bars that reach for the ceiling. Each bar represents a Hogwarts house – Gryffindor
, Ravenclaw
, Hufflepuff
, and Slytherin
. 🪄
# Importing visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
# Setting the aesthetic style for our plots
sns.set(style="whitegrid")
# Visualizing the distribution of students across houses
plt.figure(figsize=(15, 10))
sns.countplot(x='house', data=hogwarts_df, hue='house', legend=False)
plt.title('Distribution of Students Across Houses')
plt.xlabel('House')
plt.ylabel('Students')
plt.show()
6.2 Distribution of Students Across Houses (With a Twist)
But this isn't just any ordinary painting. We're going to use the magic of data to bring our picture to life. With a flick of our wand (or a click of a mouse), we'll transform cold numbers into a vibrant tapestry that tells a tale as enchanting as any fairy tale. But this time, let's add a bit of twist of spell to show the values of each X and Y axis accordingly so it'd become more informative 💫
# Importing visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
# Setting the aesthetic style for our plots
sns.set(style="whitegrid")
# Visualizing the distribution of students across houses
plt.figure(figsize=(15, 10))
ax = sns.countplot(x='house', data=hogwarts_df, hue='house', legend=False)
# Adding numerical information on top of each bar
for p in ax.patches:
ax.annotate(f'{int(p.get_height())}',
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='bottom',
fontsize=12, color='black',
xytext=(0, 5), # Offset the text slightly above the bar
textcoords='offset points')
plt.title('Distribution of Students Across Houses')
plt.xlabel('House')
plt.ylabel('Students')
plt.show()
6.3 Visualizing Age Distribution
But what if we want to see how the ages of boys and girls differ? Fear not, for we have another spell, the Bar Chart. This spell creates side-by-side towers, comparing the number of boys and girls at each age. It's like two rival houses, Gryffindor and Slytherin, competing for the tallest tower. ⚔️
# Visualizing the age distribution
plt.figure(figsize=(10, 6))
sns.histplot(hogwarts_df['age'], kde=True, color='blue')
plt.title('Age Distribution of Hogwarts Students')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
6.4 Visualizing Relationships Features
Next, we weave a more intricate spell, exploring the relationships between different features in our dataset. For instance, does a student’s heritage influence their choice of pet, or is there a connection between a student’s age and the type of wand they use? This step is like exploring the Forbidden Forest
, uncovering the connections and mysteries that lie within.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Path to your dataset
dataset_path = 'data/hogwarts-students-02.csv'
# Reading the dataset
hogwarts_df = pd.read_csv(dataset_path)
# Plotting the distribution of Hogwarts Houses with student counts
plt.figure(figsize=(10, 5))
sns.countplot(x='house', hue='pet', data=hogwarts_df, palette='viridis')
# Add data labels (student counts) on top of each bar
for container in plt.gca().containers:
plt.bar_label(container)
plt.title('Relationship between "House" and "Choice of Pet"')
plt.xlabel('House')
plt.ylabel('Number of Students')
plt.legend(title='Pet Type')
plt.show()
Through this visualization, we might discover that Muggle-born
students have a penchant for owls
, while Pure-bloods
prefer cats
. These insights are akin to understanding the habits of magical creatures, revealing the subtle nuances that define the Hogwarts community.
6.5 Summarizing the Data
This summary provides key statistics such as the mean
, median
, and standard deviation
of numerical
columns, and unique counts
and modes
for categorical
columns. For instance, we might find that the most common house is Gryffindor, or that the average age of students is 14 years.
summary = hogwarts_df.describe(include='all')
print(summary)
Unnamed: 0 name gender age origin specialty \
count 52.000000 52 52 52.000000 52 52
unique NaN 52 2 NaN 9 24
top NaN Harry Potter Male NaN England Charms
freq NaN 1 27 NaN 35 7
mean 25.500000 NaN NaN 14.942308 NaN NaN
std 15.154757 NaN NaN 2.492447 NaN NaN
min 0.000000 NaN NaN 11.000000 NaN NaN
25% 12.750000 NaN NaN 13.250000 NaN NaN
50% 25.500000 NaN NaN 16.000000 NaN NaN
75% 38.250000 NaN NaN 17.000000 NaN NaN
max 51.000000 NaN NaN 18.000000 NaN NaN
house blood_status pet wand_type patronus \
count 52 52 52 52 52
unique 6 4 9 28 15
top Gryffindor Half-blood Owl Ash Non-corporeal
freq 18 25 36 4 36
mean NaN NaN NaN NaN NaN
std NaN NaN NaN NaN NaN
min NaN NaN NaN NaN NaN
25% NaN NaN NaN NaN NaN
50% NaN NaN NaN NaN NaN
75% NaN NaN NaN NaN NaN
max NaN NaN NaN NaN NaN
quidditch_position boggart favorite_class house_points
count 52 52 52 52.000000
unique 5 11 21 NaN
top Seeker Failure Charms NaN
freq 47 40 9 NaN
mean NaN NaN NaN 119.200000
std NaN NaN NaN 53.057128
min NaN NaN NaN 10.000000
25% NaN NaN NaN 77.500000
50% NaN NaN NaN 119.600000
75% NaN NaN NaN 160.000000
max NaN NaN NaN 200.000000
6.5.1 Summary of the Results
- Count: The number of non-null values in each column.
- Unique: The number of unique values in each column.
- Top: The most frequent value in each column.
- Freq: The number of times the most frequent value appears.
- Mean: The arithmetic mean of the values in each column.
- Std: The standard deviation of the values in each column.
- Min: The minimum value in each column.
- 25%: The 25th percentile (lower quartile) of the values in each column.
- 50%: The 50th percentile (median) of the values in each column.
- 75%: The 75th percentile (upper quartile) of the values in each column.
- Max: The maximum value in each column.
6.5.2 Key Observations
- Age: The mean age is 14.942308, with a standard deviation of 2.492447. The age range is from 11 to 18.
- Gender: There are only two unique values: Male and Female.
- Origin: There are nine unique values, with England being the most frequent.
- Specialty: There are 24 unique values, with Charms being the most frequent.
- House: There are six unique values, with Gryffindor being the most frequent.
- Blood Status: There are four unique values, with Half-blood being the most frequent.
- Pet: There are nine unique values, with Owl being the most frequent.
- Wand Type: There are 28 unique values, with Ash being the most frequent.
- Patronus: There are 15 unique values, with Non-corporeal being the most frequent.
- Quidditch Position: There are five unique values, with Seeker being the most frequent.
- Boggart: There are 11 unique values, with Failure being the most frequent.
- Favorite Class: There are 21 unique values, with Charms being the most frequent.
- House Points: The mean is 119.200000, with a standard deviation of 53.057128. The range is from 10 to 200.
6.5.3 Insights
- Age Distribution: The age distribution is relatively narrow, with most students being between 13 and 17 years old.
- Gender: The dataset is skewed towards males.
- Specialty and House: The most frequent values in these columns suggest that students tend to specialize in Charms and are part of Gryffindor house.
- Blood Status: The most frequent value suggests that most students are Half-blood.
- Pet and Wand Type: The most frequent values in these columns suggest that students often have pets like Owls and use wands made of Ash.
- Patronus: The most frequent value suggests that many students have Non-corporeal patronuses.
- Quidditch Position: The most frequent value suggests that many students play the role of Seeker in Quidditch.
- Boggart and Favorite Class: The most frequent values in these columns suggest that students often fear Failure and enjoy studying Charms.
- House Points: The mean and range of house points suggest that students in this dataset have varying levels of achievement and participation.
These insights can help you better understand the characteristics of the students in the Hogwarts dataset.
6.6 Correlation Matrix
Finally, we perform statistical analysis to quantify relationships and trends within our data. This step is akin to Snape carefully measuring potion ingredients to ensure the perfect brew. The correlation matrix and its visualization show us how different features relate to each other. For example, we might find a strong correlation between age and year at Hogwarts, as expected. Understanding these relationships helps us build more accurate models and make informed predictions.
The correlation matrix and its visualization show us how different features relate to each other. For example, we might find a strong correlation between age and year at Hogwarts, as expected. Understanding these relationships helps us build more accurate models and make informed predictions.
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Loading the dataset
dataset_path = 'data/hogwarts-students-02.csv' # Path to our dataset
hogwarts_df = pd.read_csv(dataset_path)
# Displaying the first few rows to understand the structure of the dataset
print(hogwarts_df.head())
# Checking the data types of each column to identify numerical and categorical data
print(hogwarts_df.dtypes)
# Selecting only numerical columns for correlation matrix
numerical_df = hogwarts_df.select_dtypes(include=[np.number])
# Calculating the correlation matrix using only numerical data
correlation_matrix = numerical_df.corr()
print(correlation_matrix)
# Visualizing the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix of Hogwarts Student Features')
plt.show()
name gender age origin specialty \
0 Harry Potter Male 11 England Defense Against the Dark Arts
1 Hermione Granger Female 11 England Transfiguration
2 Ron Weasley Male 11 England Chess
3 Draco Malfoy Male 11 England Potions
4 Luna Lovegood Female 11 Ireland Creatures
house blood_status pet wand_type patronus \
0 Gryffindor Half-blood Owl Holly Stag
1 Gryffindor Muggle-born Cat Vine Otter
2 Gryffindor Pure-blood Rat Ash Jack Russell Terrier
3 Slytherin Pure-blood Owl Hawthorn Non-corporeal
4 Ravenclaw Half-blood Owl Fir Hare
quidditch_position boggart favorite_class \
0 Seeker Dementor Defense Against the Dark Arts
1 Seeker Failure Arithmancy
2 Keeper Spider Charms
3 Seeker Lord Voldemort Potions
4 Seeker Her mother Creatures
house_points
0 150.0
1 200.0
2 50.0
3 100.0
4 120.0
name object
gender object
age int64
origin object
specialty object
house object
blood_status object
pet object
wand_type object
patronus object
quidditch_position object
boggart object
favorite_class object
house_points float64
dtype: object
age house_points
age 1.000000 0.315227
house_points 0.315227 1.000000
The correlation analysis results provided show the correlation coefficients between the age
and house_points
columns in the dataset. Here’s a breakdown of what can be implied from these results, as the following.
6.6.1 Correlation Coefficients Interpretation
A correlation coefficient is like a magical measuring tape, helping us understand how closely two things are linked. It's a number between -1
and 1
, and the closer it is to either end, the stronger the connection. Think of it as a magical spell that reveals hidden relationships!
age house_points
age 1.000000 0.315227
house_points 0.315227 1.000000
A positive correlation is like a friendship charm; as one thing increases, so does the other. For instance, if height and weight have a strong positive correlation, taller students tend to weigh more. On the other hand, a negative correlation is like a mischievous Pixies' prank; as one thing increases, the other decreases. If hours of sleep and tiredness have a strong negative correlation, those who sleep more tend to be less tired.
6.6.2 Correlation Value Analysis:
The correlation coefficient between age
and house_points
is 0.315227. This value indicates a positive correlation between the two variables. In general, correlation coefficients range from -1 to 1:
- 1 indicates a perfect positive correlation.
- 0 indicates no correlation.
- -1 indicates a perfect negative correlation.
6.6.3 Strength of the Correlation
A correlation of 0.315 suggests a weak to moderate positive correlation. This means that as the age of the students increases, their house points tend to increase as well, but the relationship is not very strong.
6.6.4 Implications:
- Age and Performance: The positive correlation may imply that older students tend to accumulate more house points. This could be due to increased experience, maturity, or participation in activities that earn house points.
- Further Investigation Needed: While there is a correlation, it does not imply causation. Other factors could be influencing both age and house points, such as the year of study, involvement in extracurricular activities, or differences in house dynamics.
- Potential Analysis: Further analysis could involve looking at other variables (like specialty or house) to see if they mediate or moderate the relationship between age and house points.
6.6.4 Correlation Coefficients Summary
In summary, the correlation analysis indicates a weak to moderate positive relationship between age and house points among Hogwarts students. While older students may tend to earn more points, further analysis is necessary to understand the underlying factors contributing to this correlation.
But beware, young wizard! Correlation doesn't always equal causation. Just because two things are linked doesn't mean one causes the other. It's like finding a lost sock and a lucky penny on the same day; they might be connected, but it doesn't mean one caused the other. 🪄✨
6.7 Gemika's Pop-Up Quiz: Visualizing Data with Charts 🧙♂️🪄
And now, dear reader, my son Gemika Haziq Nugroho appears with a twinkle in his eye and a quiz in hand. Are you ready to test your knowledge and prove your mastery of data exploration?
- What magical python libraries used to perform visualization?
- What metric do you use to identify the number of times the most frequent value appears?
- What can be implied from "Blood Status" insight?
Answer these questions with confidence, and you will demonstrate your prowess in the art of data exploration. With our dataset now fully explored and understood, we are ready to embark on the next phase of our magical journey. Onward, and continue to our next deeper discoveries and greater insights! 🌟✨🧙♂️
Top comments (0)