So you'd like to build a prediction model from some data, but some of the columns seem a bit . . . useless? Alternatively, the column of data might accidentally infer meaning where there shouldn't be any. Let's try this out by engineering some columns with location data to help predict the price of a home using a small slice of the King County Housing Data set from Washington State. Let's explore how we can turn difficult columns into something useful instead of just throwing them away or using them incorrectly.
import pandas as pd df = pd.read_csv('kc_house_data.csv') df.head().T
What do each of these columns mean?
from IPython.display import display, Markdown with open('column_names.md', 'r') as fh: content = fh.read() display(Markdown(content))
COLUMN NAMES AND DESCRIPTIONS FOR KING COUNTRY DATA SET
- id - unique identified for a house
- dateDate - house was sold
- pricePrice - is prediction target
- bedroomsNumber - of Bedrooms/House
- bathroomsNumber - of bathrooms/bedrooms
- sqft_livingsquare - footage of the home
- sqft_lotsquare - footage of the lot
- floorsTotal - floors (levels) in house
- waterfront - House which has a view to a waterfront
- view - Has been viewed
- condition - How good the condition is ( Overall )
- grade - overall grade given to the housing unit, based on King County grading system
- sqft_above - square footage of house apart from basement
- sqft_basement - square footage of the basement
- yr_built - Built Year
- yr_renovated - Year when house was renovated
- zipcode - zip
- lat - Latitude coordinate
- long - Longitude coordinate
- sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
- sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors
Many of these columns seem like they should have some kind of an influence on the price. In particular, the common saying "the three most important things to think about when buying are home are location, location, location" tells us we should probably care a lot about things like zipcode, latitude, and longitude. Also, just to note, 'sqft_living15' and 'sqft_lot15' are metadata about each house in the database that someone else has already engineered for us. Thanks, King Co. data scientist!
There are plenty of columns we could use, but for this tutorial with Haversine, let's just focus on latitude & longitude. Let's get a visual of what this data looks like.
target = df['price'].copy() needs_work = df[['long', 'lat']].copy() import seaborn as sns import matplotlib.pyplot as plt for col in needs_work: x = needs_work[col] y = target plt.figure(figsize=(16,8)) plt.subplot(1,2,1) sns.distplot(x) plt.subplot(1,2,2) sns.scatterplot(x, y) plt.suptitle(col) plt.tight_layout plt.savefig(col + '.png')
Even though these pairs of graphs look similar, notice the y-axes on these plots are measuring different things. The histogram is simply counting how many observations of the x-value we have in the data, while the scatterplot is showing the price in USD of the house sold at the x-value. It's almost like more houses sold (histogram height) correlates rather strongly with a higher sale price (scatterplot height). The Law of Supply & Demand seems alive and well in King County, WA. We'll discuss how to deal with this data in regard to predicting sale price using Haversine's 3D distance formula.
Before we get into the details of the formula, let's talk about why we should consider augmenting the location data at all. The problem here is that ONLY looking at one-dimensional movement on a map doesn't particularly generate accurate housing price patterns. Another way of saying this is there is no linear relationship between either latitude vs price nor longitude vs price. For example, if there are multiple wealthy neighborhoods that have lower-priced neighborhoods in between them, using just that one dimension of data won't accurately predict the value. In the following image (shamelessly lifted from http://www.peteryu.ca/tutorials/matlab/image_in_3d_surface_plot_with_multiple_colormaps) imagine the Z axis isn't elevation, but rather home price. As you can see in the heatmap below the surface, there isn't a clear linear pattern when you move along JUST the X-direction or just the Y-direction, but proximity to the dark red spot near the back right part of the graph would have a clear pattern for predicting price.
(Note the red spot near the bottom left should actually be blue on the heatmap, as in this example, it represents very low house prices.)
When modeling with multivariable linear regression, simply having both latitude and longitude doesn't really help either. Instead, let's think about why home values in some neighborhoods are high vs low. It could simply be the houses in those neighborhoods have a view of the mountains & water, but the only way any one home sells for a lot of money is if someone pays a lot of money for it. I know this seems simplistic, but high home values require high paying jobs, and people with high paying jobs will pay more to be conveniently close their work (especially if the home has a view of the mountains or water). This leads me to believe a "radius from high-paying job cluster" might be a much better indicator of home value instead of using the two columns of 'longitude' and 'latitude' on their own. Let's build one! In fact, if there are multiple employment clusters, maybe building a single column for each "job hub" would make sense.
If you are willing to accept that we live on a round planet, we can utilize the Haversine formula, which measures 3D arc-length on the surface of a sphere. This is really just an adaption of the Pythagorean theorem.
We adapt the python code from: https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points/4913653
We choose to pass in two arguments, each in the form
[longitude, latitude]. To do this, we first build a new column to serve as the input to the function.
needs_work['long_lat'] = list(zip(needs_work['long'], needs_work['lat'])) needs_work.head(2)
Now we define the Haversine function. Alternatively, you can simply import it from https://pypi.org/project/haversine/ but we want to make sure we can control exactly what is going on and it's fun to see trigonometry in action!
from math import radians, cos, sin, asin, sqrt def haversine(list_long_lat, other=[-122.336283, 47.609395]): """ Calculate the great circle distance between two points on the earth (specified in decimal degrees), in this case the 2nd point is in the Pike Pine Retail Core of Seattle, WA. """ lon1, lat1 = list_long_lat, list_long_lat lon2, lat2 = other, other # convert decimal degrees to radians lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # haversine formula dlon = lon2 - lon1 dlat = lat2 - lat1 a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 c = 2 * asin(sqrt(a)) # Radius of earth in kilometers is 6371 km = 6371 * c return km
We're finally ready to build our useful output column using
series.apply() which yields the distance to Seattle.
needs_work['dist_to_seattle'] = needs_work['long_lat'].apply(haversine)
In fact, it would very easy to build another column for a different "job cluster location" at this same point by passing in the optional arguments for haversine() doing a different
series.apply(function, opt_arg=______). Seattle's "suburb" of Bellevue has actually grown into its own city with skyscrapers and high paying jobs of its own. Let's create the
'dist_to_bellevue' column as well.
needs_work['dist_to_bellevue'] = needs_work['long_lat'].apply(haversine, other=[-122.198985, 47.615577])
Finally, let's drop the columns we used as the building blocks for these two new columns. (Although instead of using
df.drop([columns], axis=1) we actively select the columns to keep.)
engineered_columns = needs_work.loc[:,['dist_to_seattle', 'dist_to_bellevue']]
Now you can add these back into to your original dataframe and begin improving your model with these new feature columns.