INTRODUCTION.
Just like everything in this world, data has its imperfections. Raw data is usually skewed, may have outliers, or too many missing values. A model built on such data results in sub-optimal performance. In a hurry to get to the machine learning stage, some data professionals either entirely skip the exploratory data analysis process or do a very mediocre job. This is a mistake with many implications, which include generating inaccurate models, generating accurate models but on the wrong data, not creating the right types of variables in data preparation, and using resources inefficiently.
What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is a process of describing the data utilizing statistical and visualization techniques to bring important aspects of that data into focus for further analysis. This involves inspecting the dataset from many angles, describing & summarizing it without making any assumptions about its contents.
Why is exploratory data analysis important in data science?
The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, and find interesting relations among the variables.
Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the right questions. EDA can help answer standard deviations, categorical variables, and confidence intervals questions. Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data analysis or modeling, including machine learning.
Exploratory data analysis tools
Specific statistical functions and techniques you can perform with EDA tools include:
- Clustering and dimension reduction techniques, which help create graphical displays of high-dimensional data containing many variables.
- Univariate visualization of each field in the raw dataset, with summary statistics.
- Bivariate visualizations and summary statistics that allow you to assess the relationship between each variable in the dataset and the target variable you’re looking at.
- Multivariate visualizations, for mapping and understanding interactions between different fields in the data.
- K-means Clustering is a clustering method in unsupervised learning where data points are assigned into K groups, i.e. the number of clusters, based on the distance from each group’s centroid. The data points closest to a particular centroid will be clustered under the same category. K-means Clustering is commonly used in market segmentation, pattern recognition, and image compression.
- Predictive models, such as linear regression, use statistics and data to predict outcomes.
Types of exploratory data analysis
There are four primary types of EDA:
- Univariate non-graphical. This is the simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it.
- Univariate graph. Graphical methods are required since non-graphical methods don’t provide a full picture of the data. Common types of univariate graphics include: Stem-and-leaf plots, which show all data values and the shape of the distribution. Histograms are bar plots in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values. Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.
- Multivariate nongraphical: Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.
- Multivariate graphical: Multivariate data uses graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable. Other common types of multivariate graphics include: Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another. Multivariate chart, which is a graphical representation of the relationships between factors and response. Run chart, which is a line graph of data plotted over time. A bubble chart, is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot. Heat map, which is a graphical representation of data where values are depicted by color.
Exploratory Data Analysis Tools
The most common data science tools used to create an EDA include:
- Python: An interpreted, object-oriented programming language with dynamic semantics. Its high-level, built-in data structures, combined with dynamic typing and dynamic binding, make it very attractive for rapid application development, as well as for use as a scripting or glue language to connect existing components. Python and EDA can be used together to identify missing values in a data set, which is important so you can decide how to handle missing values for machine learning.
- R: An open-source programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians in data science in developing statistical observations and data analysis.
DATA VISUALIZATION.
What is data visualization?
Data visualization is the practice of translating information into a visual context, such as a map or graph, to make data easier for the human brain to understand and pull insights from. The main goal of data visualization is to make it easier to identify patterns, trends, and outliers in large data sets. Data visualization is one of the steps of the data science process, which states that after data has been collected, processed, and modeled, it must be visualized for conclusions to be made. Data visualization is also an element of the broader data presentation architecture (DPA) discipline, which aims to identify, locate, manipulate, format, and deliver data in the most efficient way possible.
Why is data visualization important?
- Data visualization provides a quick and effective way to communicate information in a universal manner using visual information.
- It also helps businesses identify which factors affect customer behavior; pinpoint areas that need to be improved or need more attention; make data more memorable for stakeholders; understand when and where to place specific products; and predict sales volumes.
- It also can absorb information quickly, improve insights, and make faster decisions.
- Increases understanding of the next steps that must be taken to improve the organization.
- Improves ability to maintain the audience's interest with information they can understand.
- Ensures easy distribution of information that increases the opportunity to share insights with everyone involved.
- Eliminates the need for data scientists since data is more accessible and understandable.
- Increases the ability to act on findings quickly and achieve success with greater speed and fewer mistakes.
Data Visualization Techniques:
- Charts - line charts, Pie charts, Column charts, Bar charts, Fusion charts, high charts, pictogram charts, histogram charts, waterfall charts, etc.
- Plots - Line plots, Bar plots, Box and whisker plots, scatter plots, bubble plots, violin plots, distribution plots, cartograms, etc.
- Maps - Heat maps, Treemaps, Choropleth Map, etc.
- Diagrams and Matrices - correlation matrix, network diagram, word cloud, Choropleth Map, bullet graphs, highlight table, timeline, etc. success with greater speed and fewer mistakes.
Exploring data using visualization techniques.
For exploratory data analysis, several visualization tools and techniques are in use;
Charts:
For Comparison - comparing variables and values in a dataset.
Distributions - checking the distribution of variables in a dataset.
Proportions - checking the proportionality of the distribution of variables in a dataset.
Plots for:
Trends - Viewing upcoming behaviors in the variables in a dataset.
Relationships - View the correlations between different variables in a dataset.
Outliers - checks for possible variables that are not in range or are above the expected range.
Maps for:
Patterns - used to identify special and regular patterns in the dataset variables.
Structures - they identify the hierarchy of data and the composition of different variables in a dataset.
Intensity - Helps identify the extremeness of variables in a dataset.
Density - helps identify the amount of concentration of values and variables in a dataset.
Diagrams and Matrices for:
Connections - diagrams show entity relations between variables in a dataset.
Summaries - they showcase summaries of data in a dataset. Help identify key performance indicators and quick insights into the data.
Comparison - using keys to identify differences and compare variables in a dataset.
Steps to explore data using visualization techniques;
1. Apply data cleaning and transformation.
Before you create any visualization, you need to make sure that your data is accurate, consistent, and ready for analysis. Data cleaning and transformation are the steps of preparing and modifying your data, such as removing errors, missing values, or duplicates, standardizing formats, merging or splitting variables, or creating new features.
2. Use multiple and interactive visualizations.
Sometimes, one visualization is not enough to explore and confirm your data, especially if you have complex or multidimensional data. You may need to use multiple visualizations to show different aspects, perspectives, or levels of detail of your data.
3. Evaluate and refine your visualizations.
After you create your visualizations, you need to evaluate and refine them to ensure that they are clear, accurate, and relevant. You can use various criteria and methods to assess your visualizations, such as the purpose, audience, message, design, data quality, ethics, and feedback. You can also use visualization tools or libraries, such as Tableau, Power BI, or Matplotlib, to edit and improve your visualizations.
4. Communicate and share your visualizations.
Finally, you need to communicate and share your visualizations with your intended audience, whether it is your colleagues, clients, or the public. You can use different formats and platforms to present and distribute your visualizations, such as reports, dashboards, slides, blogs, or social media. You should also consider the context, tone, and style of your communication, as well as the feedback and response of your audience. You should aim to tell a compelling and trustworthy story with your visualizations, that can inform, persuade, or inspire your audience.
CONCLUSION
It’s easy to collect data, and some people become preoccupied with simply accumulating more complex data or data in mass quantities. But more data is not implicitly better and often serves to confuse the situation. Just because it can be measured doesn’t mean it should. Finding the smallest amount of data that can still convey something meaningful about the contents of the data set is important.EDA and Data Visualization are dependable on each other and being an expert in this field depends on which tools to use for certain domain knowledge.
Top comments (0)