DEV Community

Samuel Kamuli
Samuel Kamuli

Posted on

Understanding Your Data: The Essentials of Exploratory Data Analysis"

INTRODUCTION

Data, in the simplest of terms can be referred to as factual information collected together for reference or analysis. Data can be grouped into first and foremost; qualitative/categorical data or quantitative/ numerical data. Qualitative data is data representing information and concepts that can not be represented by numbers whereas Quantitative data is data that can be represented numerically i.e. anything that can be counted or measured.
Exploratory data analysis refers to an analytical approach used to analyze datasets for the purpose of testing hypotheses, summarizing the general characteristics, uncover underlying patterns and spot anomalies.

ESSENTIALS OF EDA

  1. ** Understanding your Data Structure**
    Before you delve into your data and begin crunching the numbers, it is essential that you first get a good grasp of your dataset. Know what data types are in your dataset be they 'date', 'datetime', 'boolean', 'string', 'integer', 'floating point number', etc.
    It is also important to know whether your dataset falls under categorical or numerical data and the sub categories found therein. Understanding your dataset will guide you in knowing which type of EDA to perform whether it is Multivariate Non graphical, Univariate Non graphical or Univariate Graphical EDA

  2. Cleaning your Dataset
    When your dataset is first loaded into the coding environment of your choice, the most crucial step is to clean the dataset before analysis begins as a 'dirty' dataset is compromised and will affect the accuracy of your analysis. Some of the key steps in this stage including
    checking for null values; once you have identified any null values in your dataset you can replace them using the mean, median or mode of that column. In some instances where there are too many null values in one column you can drop the entire column.
    checking for outliers; outliers are data points that significantly deviate from the norm of your dataset. They can impact your data visualization, distort your summary statistic and negatively affect your models.
    identifying duplicate data; duplicate data is another factor that affects the integrity of your data and accuracy of your analysis. The most common practice when dealing with duplicate data is to drop the duplicate.
    Then the final stage of data cleaning is to ensure that there is data uniformity in your columns. Ensure that none of your columns has two or more distinct data types within it simultaneously.

  3. Visualize your Dataset
    Once you have cleaned up your original dataset, you can now visualize what remains. Depending on the numbers and type of variables you can choice any means of visualization. For instance you can elect correlation matrices or scatter plots to visualize data with 2 or more variables, you can choose bar graphs or pie charts to visualize categorical data and box plots for visualizing data with one variable.

  4. Perform analyses on your variables
    This step will help us gain insight into the distribution of and correlation between our variables. Once again the technique of analysis varies depending on the number of variables and datatypes. Once we analyze our variables, we can then identify the relationships between them.

  5. Identifying data Patterns
    This step is crucial because it allows us to observe the behavior of our variables and in the long term make predictions on them, both independent and dependent. This is a major step because it is a core reason for why EDA is performed in the first place.

The final step of EDA is documentation and reporting as you will need to present your findings in an 'easy to understand' manner. After all, the whole point of data analysis is to make sense of facts and figures.
Some of the tools that are necessary for EDA are Python, R and in some cases even SQL.

Top comments (0)