Nginacloud

Posted on

# What is EDA?

Exploratory data analysis is how best data is manipulated to get the answers one needs. This helps make it easy for data analysts to discover patterns, check assumptions, test a hypothesis or reveal a better understanding of the dataset.

# Four primary types of EDA

Univariate non-graphical
This type focuses on analyzing a single variable at a time without using visualizations.

• Descriptive Statistics: Measures like mean, median, mode, variance, standard deviation, and range.
• Frequency Distribution: Count of occurrences for each value in the dataset.
• Percentiles and Quartiles: Identifying specific points in the data distribution (e.g., 25th, 50th, and 75th percentiles).
``````print(df.describe())
``````

For example, in percentiles;
25th Percentile: The value below which 25% of the data falls.
50th Percentile: The median value
75th Percentile: The value below which 75% of the data falls.

Univariate graphical
This type also focuses on a single variable but uses visualizations to better understand its distribution. Common visual tools include:

• Histograms Show the distribution of a variable by grouping data into bins.
``````plt.figure(figsize=(10, 6))
sns.histplot(df['Temp_C'], kde=True)
plt.title('Distribution of Temperature')
plt.xlabel('Temperature')
plt.ylabel('Frequency')
plt.show()
``````

• Box Plots Display the distribution of data based on five summary statistics (minimum, first quartile, median, third quartile, and maximum).
• Density Plots Smoothed version of a histogram that shows the data distribution

Multivariate non-graphical
This type analyzes relationships between two or more variables without visual aids:

• Correlation Analysis Examining the linear relationship between two variables using correlation coefficients.
``````correlation_matrix = df.corr()
print(correlation_matrix)
``````

• Cross-tabulation Summarizing data by showing the relationship between categorical variables.
• Covariance Measuring the extent to which two variables change together.

Multivariate graphical
This type involves visualizing relationships between multiple variables to identify patterns and interactions. Common visual tools include:

• Scatter Plots Show the relationship between two continuous variables.
• Pair Plots Provide scatter plots for all possible pairs of variables in the dataset.
• Heatmaps Display correlation or other matrix-based data, using color to represent values.
``````correlation_matrix = df.corr()
plt.figure(figsize=(14, 7))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
``````

Red - (closer to 1) represent positive correlation
Blue - (closer to -1) represent negative correlation
white shades - little to no correlation

• 3D Plots Visualize the relationship between three variables simultaneously.

# Tools and Libraries

Python-Based Tools

• Pandas
A powerful data manipulation library that offers tools for data cleaning, aggregation, and simple statistical analysis. It integrates well with other Python libraries for visualizations.

• Matplotlib
A plotting library for creating static, animated, and interactive visualizations in Python. It’s often used for basic graphs like histograms, scatter plots, and line plots.

• Seaborn
Seaborn provides a high-level interface for drawing attractive and informative statistical graphics, such as pair plots, heatmaps, and box plots.

Jupyter Notebooks
This allows you to create and share documents containing live code, equations, visualizations, and narrative text. It’s highly flexible for combining code, output, and documentation in one place.

BI Tools

• Tableau A popular business intelligence tool that allows for drag-and-drop creation of interactive dashboards, visualizations, and in-depth data analysis.
• Power BI Microsoft’s business analytics service that offers powerful data visualization and reporting capabilities, making it a strong tool for EDA in a business context.