Exploratory Data Analysis is the process of analyzing and investigating a data set to discover patterns, characteristics, trends, anomalies, and relationships. This critical process relies on data visualization methods to accomplish its roles.
The process involves data cleaning, data exploration, feature engineering, and data visualization.
A Variable - a characteristic that can be measured and that can assume different values. Height, age, income, province, etc.
- Missing values treatment - This is a method of analysis that involves identifying and treating missing values and null values in a dataset. The approach involves deleting some rows and columns and implementing filling techniques to insert data.
- Outlier Treatment - Treatment of outliers involves handling extreme values or values above or below the average. It is possible to get poor results if you have outliers. The majority of outliers are removed because they could be the result of an error.
- Variable Transformation - Data is transformed using variable transformations to ensure their normality, linearity, and stability. It involves functions to create data usable by changing the state or the form of the data variables. The data variables are either numerical or categorical.
- Feature Engineering - This is a method of analysis that involves creating new features based on existing ones. It involves identifying and extracting features from a dataset.
- Correlation Analysis - This method of analysis involves discovering data variable patterns and their magnitude. This drives the actions of that relationship between the variables.
- Univariate EDA - Involves looking at a single variable at a time.
- Bivariate EDA - involves looking at two variables at a time.
- Multivariate EDA - Involves looking at three or more variables at a time.
This is the representation of data using a graphical interface. This involves the use of charts, graphs, plots, infographics, animations, and many other visual techniques.
The need for data visualization helps us discover trends, features, data point patterns, and more outlying business parameters.
- Charts - line charts, Pie charts, Column charts, Bar charts, Fusion charts, high charts, pictogram charts, histogram charts, waterfall charts, etc.
- Plots - Line plots, Bar plots, Box and whisker plots, scatter plots, bubble plots, violin plots, distribution plots, cartograms, etc.
- Maps - Heat maps, Treemaps, Choropleth Map, etc.
- Diagrams and Matrices - correlation matrix, network diagram, word cloud, Choropleth Map, bullet graphs, highlight table, timeline, etc
These techniques use various tools and technologies to implement visualizations. These tools depend on the domain being used and have different uses and purposes. E.g. Tableau.
Let's now explore our data. We mostly use ...
Charts - for
- Comparison - comparing variables and values in a dataset.
- Distributions - checking the distribution of variables in a dataset.
- Proportions - checking the proportionality of the distribution of variables in a dataset.
Plots - for
- Trends - Viewing upcoming behaviors in the variables in a dataset.
- Relationships - View the correlations between different variables in a dataset.
- Outliers - checks for possible variables that are not in range or are above the expected range.
Maps - for
- Patterns - used to identify special and regular patterns in the dataset variables.
- Structures - they identify the hierarchy of data and the composition of different variables in a dataset.
- Intensity - Helps identify the extremeness of variables in a dataset.
- Density - helps identify the amount of concentration of values and variables in a dataset.
Diagrams and Matrices - for
- Connections - diagrams show entity relations between variables in a dataset.
- Summaries - they showcase summaries of data in a dataset. Help identify key performance indicators and quick insights into the data.
- Comparison - using keys to identify differences and compare variables in a dataset.
- Understand the Data - know if your data is numerical, categorical, or timely data. This prepares you for the transformation of the data into the appropriate data type and range of data values.
- Identify the problem or question - Know the purpose and expectations of your data and the idea and hypothesis of the EDA.
- Choose the most appropriate visualization techniques to implement - Having known and understood the data, you can identify the best techniques to use for visualization. You will understand if the data is numerical, categorical, time-based, or geographical.
- Visualize the Data - Use the appropriate tools to visualize your data. Like matplotlib, tableau, seaborn, plotly, etc.
- Interpret the data - look for patterns, features, trends, outliers, correlations, and relationships to understand. At this point, you can reiterate and refine the data if expectations are unclear and errors are spotted. Feedback generated drives if the process needs to be refined and re-iteration is needed.
- Communication of findings - present and describe insights gained. Use visuals and reports to communicate findings.
Different exploratory data analysis methods require different Data Visualization techniques. There needs to be consideration of the domain and purpose.
EDA involves various processes to prepare and craft datasets used by models. If EDA fails or is not well crafted, the data visualization techniques used also fail to discover patterns and trends in the datasets. These two processes are dependable on each other. Being an expert in this field depends on which tools to use for certain domain knowledge.