Exploratory data analysis is the process by which we try to gain an understanding of the data we want to analyze. For instance, we might want to know the size of our data, the data types, the presence of any outliers or anomalies, the relationships that may exist between the variables in our data and so on.
We do this using a number of techniques.
Some basic exploration of our dataset could be finding out the statistical properties of our data. this may include the measures of central tendencies such as the mean and median, as well as the measures of dispersion such as the variance and inter-quartile ranges. This may come in handy as we try and identify outliers and anomalies and impute missing values in our data. This will also inform our decision to transform some variables if the scales of the features we select are too different especially if we are using models that are affected by distance such as regression models.
As stated earlier, we are interested in finding out if our data has any missing values or anomalies. This is because we may want to use our data to train models to make predictions and anomalies could highly skew the data leading to incorrect predictions. From this exploration we are able to decide whether to fill in the missing values with a suitable replacement, drop some variables from our data set, or identify the cause of the anomalies and refine our data collection methods.
We could also use graphical representations such as bar graphs and heatmaps to show the relationships between variables such as correlation. This again helps in feature selection for our training data and it also helps easily visualize properties like outliers and skewness.
Top comments (0)