**Exploratory Data Analysis** (**EDA**) is an approach to analyzing datasets to summarize their main characteristics. It is used to understand data, get some context regarding it, understand the variables and the relationships between them, and formulate hypotheses that could be useful when building predictive models.

All data analysis must be guided by some key questions or objectives. Before starting any data analysis tasks, you should have a clear goal in mind. When your goal allows you to understand your data and the problem, you will be in a good position to get meaningful results out of your analysis!

In this tutorial, we will learn how to perform EDA using **data visualization.** Specifically, we will focus on ** seaborn**, a Python library that is built on top of

`matplotlib`

and has support for `NumPy`

and `pandas`

.`seaborn`

allows us to make attractive and informative statistical graphics. Although `matplotlib`

makes it possible to visualize essentially anything, it is often difficult and tedious to make the plots visually attractive. `seaborn`

is often used to make default `matplotlib`

plots look nicer, and also introduces some additional plot types.

We will cover how to visually analyze:

- Numerical variables with histograms,
- Categorical variables with count plots,
- Relationships between numerical variables with scatter plots, joint plots, and pair plots, and
- Relationships between numerical and categorical variables with box-and-whisker plots and complex conditional plots.

By effectively visualizing a dataset’s variables and their relationships, a data analyst or data scientist is able to quickly understand trends, outliers, and patterns. This understanding can then be used to tell a story, drive decisions, and create predictive models.

###
This brief tutorial is adapted from Next Tech’s full Data Analysis with Python course, which includes an in-browser sandboxed environment with Python, Jupyter Notebooks, and `seaborn`

pre-installed. You can get started with this course for free!

# Data Preparation

Data preparation is the first step of any data analysis to ensure data is cleaned and transformed in a form that can be analyzed.

We will be performing EDA on the Ames Housing dataset. This dataset is popular among those beginning to learn data science and machine learning as it contains data about almost every characteristic of different houses that were sold Ames, Iowa. This data can then be used to try to predict sale prices.

This dataset is already cleaned and ready for analysis. All we will be doing is filtering some variables to simplify our task. Let’s begin by reading our data as a `pandas`

DataFrame:

```
import pandas as pd
import matplotlib as plt
housing = pd.read_csv('house.csv')
housing.info()
```

If you run this code in Next Tech’s sandbox which already has the dataset imported, or in a Jupyter notebook, you can see that there are 1,460 observations and 81 columns. Each column represents a variable in the DataFrame. We can see from the data type of each column what type of variable it is.

We will only be working with some of the variables — let’s filter and store their names in two lists called `numerical`

and `categorical`

, then redefine our `housing`

DataFrame to contain only these variables:

```
numerical = [
'SalePrice', 'LotArea', 'OverallQual', 'OverallCond', '1stFlrSF', '2ndFlrSF', 'BedroomAbvGr'
]
categorical = [
'MSZoning', 'LotShape', 'Neighborhood', 'CentralAir', 'SaleCondition', 'MoSold', 'YrSold'
]
housing = housing[numerical + categorical]
housing.shape
```

From `housing.shape`

, we can see that our DataFrame now only has 14 columns. Let’s move onto some analysis!

# Analyzing Numerical Variables

Our EDA objective will be to understand how the variables in this dataset relate to the sale price of the house.

Before we can do that, we need to first understand our variables. Let’s start with numerical variables, specifically our target variable, `SalePrice`

.

Numerical variables are simply those for which the values are numbers. The first thing that we do when we have numerical variables is to understand what values the variable can take, as well as the distribution and dispersion. This can be achieved with a **histogram**:

```
import seaborn as sns
sns.set(style='whitegrid', palette="deep", font_scale=1.1, rc={"figure.figsize": [8, 5]})
sns.distplot(
housing['SalePrice'], norm_hist=False, kde=False, bins=20, hist_kws={"alpha": 1}
).set(xlabel='Sale Price', ylabel='Count');
```

Note that, due to an inside joke, the `seaborn`

library is imported as `sns`

.

With just one method `sns.set()`

, we are able to style our figure, change the color, increase font size for readability, and change the figure size.

We use `distplot`

to plot histograms in `seaborn`

. This by default plots a histogram with a kernel density estimation (KDE). You can try changing the parameter `kde=True`

to see what this looks like.

Taking a look at the histogram, we can see that very few houses are priced below $100,000, most of the houses sold between $100,000 and $200,000, and very few houses sold for above $400,000.

If we want to create histograms for all of our numerical variables, `pandas`

offers the simplest solution:

```
housing[numerical].hist(bins=15, figsize=(15, 6), layout=(2, 4));
```

From this visualization, we get a lot of information. We can see that `1stFlrSF`

(square footage of the first floor) is heavily skewed right, most houses do not have a second floor, and have 3 `BedroomAbvGr`

(bedrooms above ground). Most houses were sold at an `OverallCond`

of 5 and an `OverallQual`

of 5 or higher. The `LotArea`

visual is more difficult to decipher — however we can tell that there is one or more outliers that may need to be removed before any modeling.

Note that the figure keeps the style that we set previously using `seaborn`

.

# Analyzing Categorical Variables

Categorical variables are those for which the values are labeled categories. The values, distribution, and dispersion of categorical variables are best understood with **bar plots**. Let’s analyze the `SaleCondition`

variable. `seaborn`

gives us a very simple method to show the counts of observations in each category: the `countplot`

.

```
sns.countplot(housing['SaleCondition']);
```

From the visualization, we can easily see that most houses were sold in `Normal`

condition, and very few were sold in `AjdLand`

(adjoining land purchase), `Alloca`

(allocation: two linked properties with separate deeds), and `Family`

(sale between family members) conditions.

In order to visualize all the categorical variables in our dataset, just as we did with the numerical variables, we can loop through `pandas`

series to create subplots.

Using `plt.subplots`

, we can create a figure with a grid of 2 rows and 4 columns. Then we iterate over every categorical variable to create a `countplot`

with `seaborn`

:

```
fig, ax = plt.subplots(2, 4, figsize=(20, 10))
for variable, subplot in zip(categorical, ax.flatten()):
sns.countplot(housing[variable], ax=subplot)
for label in subplot.get_xticklabels():
label.set_rotation(90)
```

The second `for`

loop simply gets each *x*-tick label and rotates it 90 degrees to make the text fit on the plots better (you can remove these two lines if you want to know how the text looks without rotation).

As with our numerical variable histograms, we can gather lots of information from this visual — most houses have `RL`

(Residential Low Density) zoning classification, have `Regular`

lot shape, and have `CentralAir`

. We can also see that houses were sold more frequently during the summer months, the most houses were sold in the `NAmes`

(North Ames) neighborhood, and there was a dip in sale in 2010.

However, if we inspect the `YrSold`

variable further, we can see that this “dip” is actually due to the fact that only data up to July was collected.

```
housing[housing['YrSold'] == 2010].groupby('MoSold')['YrSold'].count()
```

As you can see, thorough exploration of variables and their values is incredibly important — if we built a model to predict sale prices under the assumption that there was a decrease in sales in 2010, this model would likely be very inaccurate.

Now that we have explored our numerical and categorical variables, let’s take a look at the relationship between these variables — more importantly, how these variables impact our target variable, `SalePrice`

!

# Analyzing Relationships Between Numerical Variables

Plotting relationships between variables allows us to easily get a visual understanding of patterns and correlations.

The **scatter plot** is often used for visualizing relationships between two numerical variables. The `seaborn`

method to create a scatter plot is very simple:

```
sns.scatterplot(x=housing['1stFlrSF'], y=housing['SalePrice']);
```

From the scatter plot, we see here that we have a positive relationship between the `1stFlrSF`

of the house and the `SalePrice`

of the house. In other words, the larger the first floor of a house, the higher the likely sale price.

You can also see that the axis labels are added for us by default, and the markers are automatically outlined to make them clearer — this is opposed to `matplotlib`

in which these are not the default.

`seaborn`

also provides us with a nice function called `jointplot`

which will give you a scatter plot showing the relationship between two variables along with histograms of each variable in the margins — also known as a **marginal plot**.

```
sns.jointplot(x=housing['1stFlrSF'], y=housing['SalePrice']);
```

Not only can you see the relationships between the two variables, but also how they are distributed individually.

# Analyzing Relationships Between Numerical and Categorical Variables

The **box-and-whisker plot** is commonly used for visualizing relationships between numerical variables and categorical variables, and **complex conditional plots** are used to visualize conditional relationships.

Let’s start by creating box-and-whisker plots with `seaborn`

’s `boxplot`

method:

```
fig, ax = plt.subplots(3, 3, figsize=(15, 10))
for var, subplot in zip(categorical, ax.flatten()):
sns.boxplot(x=var, y='SalePrice', data=housing, ax=subplot)
```

Here, we have iterated through every subplot to produce the visualization between all categorical variables and the `SalePrice`

.

We can see that houses with `FV`

(Floating Village Residential) zoning classification have a higher average `SalePrice`

than other zoning classifications, as do houses with `CentralAir`

, and houses with a `Partial`

(home not completed when last assessed) `SaleCondition`

. We can also see that there is little variance in average `SalePrice`

between houses with different `LotShapes`

, or between `MoSold`

and `YrSold`

.

Let’s take a closer look at the `Neighborhood`

variable. We see that there is definitely a different distribution for different neighborhoods, but the visualization is a bit difficult to decipher. Let’s sort our box plots by cheapest neighborhood to most expensive (by median price) using the additional argument `order`

.

```
sorted_nb = housing.groupby(['Neighborhood'])['SalePrice'].median().sort_values()
sns.boxplot(x=housing['Neighborhood'], y=housing['SalePrice'], order=list(sorted_nb.index))
```

In the above snippet, we sorted our neighborhoods by median price and stored this in `sorted_nb`

. Then, we passed this list of neighborhood names into the `order`

argument to create a sorted box plot.

This figure gives us a lot of information. We can see that in the cheapest neighborhoods houses sell for a median price of around $100,000, and in the most expensive neighborhoods houses sell for around $300,000. We can also see that for some neighborhoods, dispersion between the prices is very low, meaning that all the prices are close to each other. In the most expensive neighborhood `NridgHt`

, however, we see a large box — there is large dispersion in the distribution of prices.

Finally, `seaborn`

also allows us to create plots that show conditional relationships. For example, if we are conditioning on the `Neighborhood`

, using the `FacetGrid`

function we can visualize a scatter plot between the `OverallQual`

and the `SalePrice`

variables:

```
cond_plot = sns.FacetGrid(data=housing, col='Neighborhood', hue='CentralAir', col_wrap=4)
cond_plot.map(sns.scatterplot, 'OverallQual', 'SalePrice');
```

For each individual neighborhood we can see the relationship between `OverallQual`

and `SalePrice`

.

We also added another categorical variable `CentralAir`

to the (optional) `hue`

parameter — the orange points correspond to houses that do not have `CentralAir`

. As you can see, these houses tend to sell at a lower price.

The `FacetGrid`

method makes it incredibly easy to produce complex visualizations and to get valuable information. It is good practice to produce these visualizations to get quick insights about variable relationships.

I hope you’ve enjoyed this brief tutorial on exploratory data analysis and data visualization with `seaborn`

! I covered how to create histograms, count plots, scatter plots, marginal plots, box-and-whisker plots, and conditional plots.

During our exploration, we discovered outliers and trends within individual variables, and relationships between variables. This knowledge can be used to build a model to predict the `SalePrice`

of houses in Ames. For example, since we found a correlation between `SalePrice`

and the variables `CentralAir`

, `1stFlrSf`

, `SaleCondition`

, and `Neighborhood`

, we can start with a simple model using these variables.

###
If you are interesting in learning how to do this and more, we have a full **Data Analysis with Python** course available at Next Tech. Get started for free here!

This course explores vectorizing operations with `NumPy`

, EDA using `pandas`

, data visualization with `matplotlib`

, additional EDA and visualization techniques using `seaborn`

, statistical computing with `SciPy`

, and machine learning with `scikit-learn`

.

## Discussion