In this article, we will be exploring fundamental ways of doing an exploratory data analysis on a dataset.
Earlier, the statistical studies were limited to inferences, but then John Tukey proposed a new scientific discipline called data analysis that included statistical inference as just one component.
With the ready availability of computing power and expressive data analysis software, exploratory data analysis has evolved well beyond its original scope.
Data comes from many sources: sensor measurements, events, text, images, and videos.
The Internet of Things (IoT) is spewing out streams of information. Much of this data is unstructured: Images are a collection of pixels, with each pixel containing RGB (red, green, blue) color information.
Texts are sequences of words and nonword characters, often organized by sections, subsections, and so on.
To apply statistical concepts, unstructured raw data has to be converted into structured data.
There are mainly two types of structured data:
- Numeric Type
- Continuous: Data that can take on any value in an interval.
- Discrete: Data that can take on only integer values, such as counts.
- Categorical Type
- Binary Data (Special Case): A special case of categorical data with just two categories of values, e.g., 0/1, true/false.
- Ordinal Data: Categorical data that has an explicit ordering. (Synonym: ordered factor).
The typical frame of reference for analysis in data science is a rectangular data object, like a spreadsheet or database table.
Rectangular data is the general term for a two-dimensional matrix with rows indicating records (cases) and columns indicating features (variables).
The data frame is the specific format in R and Python.
Key Terms for Rectangular Data
Feature: A column within a table is commonly referred to as a feature. Alias: attribute, predictor, variable
Records: A row within a data frame. Alias: case, example, instance, observation. etc
Below is the typical data frame object read by pandas library in Python.
Dataset: Wine Quality by UCI
Non-Rectangular Data Structure
There are data structures other than the rectangular data.
Time series data records successive measurements of the same variable. It is the raw material for statistical forecasting methods, and it is also a key component of the data produced by devices—the Internet of Things.
Spatial data structures, which are used in mapping and location analytics, are more complex and varied than rectangular data structures.
Graph (or network) data structures are used to represent physical, social, and abstract relationships.
Variables with measured or count data (Numerical) might have thousands of distinct values.
A basic step in exploring your data is getting a “typical value” for each feature (variable): an estimate of where most of the data is located (i.e., its central tendency).
At first glance, summarizing data might seem fairly trivial: just take the mean of the data. In fact, while the mean is easy to compute and expedient to use, it may not always be the best measure for a central value.
The most basic estimate of location is the mean or average value. The mean is the sum of all values divided by the number of values.
N (or n) refers to the total number of records or observations. In statistics, it is capitalized if it is referring to a population, and lowercase if it refers to a sample from a population.
A variation of the mean is a trimmed mean, which you calculate by dropping a fixed number of sorted values at each end and then taking an average of the remaining values.
An advantage of using a trimmed mean is that it removes the influence of extreme values. It is more robust than the regular mean.
Another type of mean is a weighted mean, which you calculate by multiplying each data value by a user-specified weight and dividing their sum by the sum of the weights.
- Median and Robust Measures
The median is the middle number on a sorted list of the data. If there is an even number of data values, the middle value is one that is not actually in the data set, but rather the average of the two values that divide the sorted data into upper and lower halves.
Compared to the mean, the median takes into account only the central values of the sorted data, which makes the median more robust. In many use-cases, the median is a better metric for central tendencies.
The median is referred to as a robust estimate of location since it is not influenced by outliers (extreme cases) that could skew the results.
An outlier is any value that is very distant from the other values in a data set
In fact, a trimmed mean is widely used to avoid the influence of outliers. For example, trimming the bottom and top 10% (a common choice) of the data will provide protection against outliers in all but the smallest data sets.
# Mean, Trimmed Mean and Median of the feature: fixed acidity of wine print('Mean of Fixed Acidity of Wine:', data['fixed acidity'].mean()) # Slicing 10% of left and right most elements print('Trimmed Mean of Fixed Acidity of Wine: ', trim_mean(data['fixed acidity'], 0.1)) print('Median of Fixed Acidity of Wine: ', data['fixed acidity'].median()) #Output Mean of Fixed Acidity of Wine: 8.319637273295838 Trimmed Mean of Fixed Acidity of Wine: 8.152537080405933 Median of Fixed Acidity of Wine: 7.9
Location is just one dimension in summarizing a feature.
A second dimension, variability, also referred to as dispersion, measures whether the data values are tightly clustered or spread out.
- Standard Deviation and Related Estimates
The most widely used estimates of variation are based on the differences, or deviations, between the estimate of location and the observed data.
In fact, the sum of the deviations from the mean is precisely zero. Instead, a simple approach is to take the average of the absolute values of the deviations from the mean.
This is known as the mean absolute deviation and is computed with the formula:
The best-known estimates of variability are the variance and the standard deviation, which are based on squared deviations.
The standard deviation is much easier to interpret than the variance since it is on the same scale as the original data.
The variance and standard deviation are especially sensitive to outliers since they are based on the squared deviations.
A robust estimate of variability is Median Absolute Deviation
- Estimates based on Percentiles
A different approach to estimating dispersion is based on looking at the spread of the sorted data. Statistics based on sorted (ranked) data are referred to as order statistics.
The most basic measure is the range, but it is sensitive to outliers and not a great measure of dispersion.
In a data set, the Pth percentile is a value such that at least P percent of the values take on this value or less, and at least (100 – P) percent of the values take on this value or more.
For example, to find the 80th percentile, sort the data. Then, starting with the smallest value, proceed 80 percent of the way to the largest value
A common measurement of variability is the difference between the 25th percentile and the 75th percentile, called the interquartile range (or IQR).
For very large data sets, calculating exact percentiles can be computationally very expensive since it requires sorting all the data values.
# Measures of Variability for Sulfur Dioxide # Standard Deviation print('Standard Deviation for Sulfur Dioxide in Wine: ', data['free sulfur dioxide'].std()) # Inter-Quartile Range print('IQR of Sulfar Dioxide: ', data['free sulfur dioxide'].quantile(0.75) - data['free sulfur dioxide'].quantile(0.25)) # Median Absolute Deviation (a robust measure) print('Median Absolute Deviation: ', mad(data['free sulfur dioxide'])) #Output: Standard Deviation for Sulfur Dioxide in Wine: 10.46015696980973 IQR of Sulfur Dioxide: 14.0 Median Absolute Deviation: 10.378215529539213
So in this article, we have explored basics of EDA process, exploring central tendencies and measures of variability.
Part-B will focus on Data Distributions, Exploring Categorical Variables, and Correlations.