DEV Community

Cover image for Exploratory Data Analysis – A Key Step in Machine Learning
Jahnavi-Jonnalagadda
Jahnavi-Jonnalagadda

Posted on • Updated on

Exploratory Data Analysis – A Key Step in Machine Learning

This blog is part of MSP Developer Stories initiative by Microsoft Students Partner (India) Program - https://studentpartners.microsoft.com/ which is aimed for student communities to Learn, Lead and Empower.

The goal of this post is to emphasize the role of Exploratory Data Analysis while solving business problems with Machine Learning and Artificial Intelligence with a detailed case study walkthrough.

A 360° data mindset In this information-driven age, a 360° view has to be taken for the extraordinary volume of data that is being available – historic, current and predictive – so that right data has to be extracted to make better business decisions.

Exploratory Data Analysis (EDA) is an observational approach to understand the characteristics of the data. EDA is essential for a well-defined and structured data science project and it should be performed before any machine learning modelling phase. This helps in Identifying patterns and develop hypotheses.

Case Study : A medium size bikes & cycling accessories manufacturing consultancy is keen on growing the business. We’ll help them analyze their customer and transaction data to optimize marketing strategy.

Preliminary Data Exploration – Identify ways to improve the quality of data

Environment and Code Readiness

  • Create a Jupyter Notebook hosted on Azure
  • Import pandas package to read and write excel data
  • Import matplotlib & seaborn for data visualization
  • Upload the Customer data into the Azure Notebook path

Alt Text

Let’s put the below analysis into various data quality dimensions in a table

Alt Text

Identify Missing Values

Alt Text

Column can be dropped if no relevance

Alt Text

Gender data to be consistent, should be either Male or Female

Alt Text

Check for validity of Transactions data : product first sold date data type float to be converted into date time format

Alt Text

Follow the above code and output for other data sets

Here is the Data Quality Analysis Summary

Alt Text

Data Exploration, Model Development and Interpretation : Understanding the data distributions, feature engineering, data transformations, modelling, results interpretation and reporting.

Customer Age & Gender Distribution : Female category is more than Male; New customers are recommended between 30 to 60 years old

Calculate the age of the customers from date of birth for plotting the graph

Alt Text
Alt Text

Number of Mass Customers under the Wealth Segment are the highest

Alt Text

New customers are from Manufacturing & Finance industry

Alt Text

Customer cars owned data

Alt Text

Visualizations & Interactive Dashboard : Help us highlight key findings and convey the ideas in a more succinct manner. Below dashboards have been built in Power BI desktop. Walkthrough of the building of dashboards in Power BI is out of scope for this blog.

Alt Text
Alt Text

Conclusion, Exploratory Data Analysis is a key process in Machine Learning / Data Science projects. The main pillars of EDA are data cleaning, data preparation, data exploration, and data visualization.

Top comments (0)