Exploratory Data Analysis
Exploratory Data Analysis (EDA) involves analyzing data using statistics and graphs to gain insight. To sort out anomalies, identify patterns, establish possible relationships and create hypotheses based on statistical methods between variables.
Importance;
The aim is to understand the data, we have to keep in mind while exploring the data, make sure the data is clean and does not have redundancy, missing values, or even null values on the data set. Aiming to derive a conclusion by collecting incites on the data interpretation.
Goals;
A crucial process to make any data-based prediction requires spotting errors, establishing trends and relationships to ensure the obtained results are valid and applicable, easily visualized through charts or graphs to present information accurately and finally through
statistical analysis in the data.
-Finding the distribution of variables in a data set
-Generating a good model to ensure no data quality problems
-Obtaining accurate data estimates
-Forecasting the potential errors in the data estimates
-Making statistical conclusions
-Eliminating anomalies and extra values from the data
-Preparation of our dataset for analysis
-Enhancing machine learning ability to predict the dataset effectively
-Providing more precise outcomes
-Selecting a more effective machine learning model
Steps
-Know the problem and questions to answer
-Understand the dataset
-Define the data
-Choose the type of descriptive statistic
-Visualize the data
-Analyze the possible interactions between the variables of the dataset
-Draw a conclusions from the analysis
Types of Exploratory Data Analysis
Univariate;
The data has only one variable this method used to describe the data; make predictions of population distribution and find any existing patterns.
Bivariate;
A relationship between two data variables using cross-tabulation or statistics.
Multivariate;
The relationship between more data sets displayed using a bar plot or a bar chart.
Exploratory Data Analysis Tools
Python.
Is extensively used to connect existing components and identify missing values in a data set.
Matplotlib.
A python based library, enables creation of explanatory graphs from highly complex data.
R.
An open-source programming language in statistical computing and graphics applicable in statistical observations.
Top comments (0)