Basic Data Analysis on the Iris Flower Dataset (HNG 11)

#datascience #intern #analyst

This task was part of my data analysis internship with HNG11. It is a requirement for all interns in stage zero to proceed to the next stage. The task was relatively simple. I only had to review a dataset from a list of given options. The objectives are to identify initial insights from the dataset at first glance and to discover patterns, trends, or anomalies.

I chose the Dataset on Iris Flowers and performed Basic Exploratory Data Analysis using Python and the libraries in a notebook file. At an initial glance, the file containing the dataset (.data) has 150 rows of 5 values each (5 columns), with each value on a row separated by a comma (comma-delimited). The first four values are numerical variables, while the last one is a categorical variable which could immediately be identified as the label of the dataset. However, there was no description in the original file for any of the variables.

Accompanied with the data file was another text file giving a clearer description of the variables represented in the data file. With this information, I imported the data into the notebook, and read it into a DataFrame object using pandas library, assigning appropriate names for the columns of the dataset. In order, the columns are ‘sepal length (cm),’ ‘sepal width (cm),’ ‘petal length (cm),’ ‘petal width (cm),’ and ‘class.’

Using appropriate methods in pandas, I discovered the mean of each of the numerical variables ‘sepal length (cm),’ ‘sepal width (cm),’ ‘petal length (cm),’ ‘petal width (cm)’ to be 5.84, 3.05, 3.76 and 1.20 respectively (to 2 d.p.). Also, I observed that the categorical variable ‘class’ had only three unique values for three kinds of Iris flowers: ‘Iris-setosa, ‘Iris-virginica’ and ‘Iris-Versicolour.’ All of this information was also pointed out in the text description file. Another observation was that each of the three values for the categorical variable was represented the same number of times in the dataset; which means there were 50 Iris-Setosa flowers, 50 Iris-Virginica flowers and 50 Iris-Versicolour flowers.

With the aid of plotting and graphing tools, it was clear that a linear relationship exists between the petal width and the petal length, as well as between the petal length and sepal length of the flowers. The Iris-Virginica flowers had the longest petals and sepals, with the Iris-setosa flowers having the shortest ones. This can be seen in the graph below.

There is a clear correlation between the measurements of the sepals and petals of the flowers and their respective class. Meanwhile, the graph would suggest that petal length and width have a higher influence in determining the flower class than the sepal width. This could be considered in making inferences from a new dataset without the label.

DEV Community

Basic Data Analysis on the Iris Flower Dataset (HNG 11)

Top comments (1)

Read next

Selective Attention Boosts Transformer Performance on Language Tasks

Logits of API-Protected LLMs Reveal Proprietary Model Details, Researchers Find

How to Break Into Data Analytics in 2025: A Guide for Beginners with No Experience

All About Parquet Part 09 - Parquet in Data Lake Architectures