David Kaspar

Posted on Apr 19, 2019

Engineering Location Features with Haversine's Formula for Prediction Modeling

#python #linearregression #machinelearning #distancecalculation

Here is the full git repo: Predicting King Co home prices This blog post is just a snippet of one strategy we used.
Co-written with Natasha Kacoroski

So you'd like to build a prediction model from some data, but some of the columns seem a bit . . . useless? Alternatively, the column of data might accidentally infer meaning where there shouldn't be any. Let's try this out by engineering some columns with location data to help predict the price of a home using a small slice of the King County Housing Data set from Washington State. Let's explore how we can turn difficult columns into something useful instead of just throwing them away or using them incorrectly.

import pandas as pd

df = pd.read_csv('kc_house_data.csv')
df.head().T

	0	1	2	3	4
id	7129300520	6414100192	5631500400	2487200875	1954400510
date	10/13/2014	12/9/2014	2/25/2015	12/9/2014	2/18/2015
price	221900	538000	180000	604000	510000
bedrooms	3	3	2	4	3
bathrooms	1	2.25	1	3	2
sqft_living	1180	2570	770	1960	1680
sqft_lot	5650	7242	10000	5000	8080
floors	1	2	1	1	1
waterfront	NaN	0	0	0	0
view	0	0	0	0	0
condition	3	3	3	5	3
grade	7	7	6	7	8
sqft_above	1180	2170	770	1050	1680
sqft_basement	0.0	400.0	0.0	910.0	0.0
yr_built	1955	1951	1933	1965	1987
yr_renovated	0	1991	NaN	0	0
zipcode	98178	98125	98028	98136	98074
lat	47.5112	47.721	47.7379	47.5208	47.6168
long	-122.257	-122.319	-122.233	-122.393	-122.045
sqft_living15	1340	1690	2720	1360	1800
sqft_lot15	5650	7639	8062	5000	7503

What do each of these columns mean?

from IPython.display import display, Markdown

with open('column_names.md', 'r') as fh:
    content = fh.read()

display(Markdown(content))

COLUMN NAMES AND DESCRIPTIONS FOR KING COUNTRY DATA SET

id - unique identified for a house
dateDate - house was sold
pricePrice - is prediction target
bedroomsNumber - of Bedrooms/House
bathroomsNumber - of bathrooms/bedrooms
sqft_livingsquare - footage of the home
sqft_lotsquare - footage of the lot
floorsTotal - floors (levels) in house
waterfront - House which has a view to a waterfront
view - Has been viewed
condition - How good the condition is ( Overall )
grade - overall grade given to the housing unit, based on King County grading system
sqft_above - square footage of house apart from basement
sqft_basement - square footage of the basement
yr_built - Built Year
yr_renovated - Year when house was renovated
zipcode - zip
lat - Latitude coordinate
long - Longitude coordinate
sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors

Many of these columns seem like they should have some kind of an influence on the price. In particular, the common saying "the three most important things to think about when buying are home are location, location, location" tells us we should probably care a lot about things like zipcode, latitude, and longitude. Also, just to note, 'sqft_living15' and 'sqft_lot15' are metadata about each house in the database that someone else has already engineered for us. Thanks, King Co. data scientist!

There are plenty of columns we could use, but for this tutorial with Haversine, let's just focus on latitude & longitude. Let's get a visual of what this data looks like.

target = df['price'].copy()
needs_work = df[['long', 'lat']].copy()

import seaborn as sns
import matplotlib.pyplot as plt

for col in needs_work:
    x = needs_work[col]
    y = target

    plt.figure(figsize=(16,8))
    plt.subplot(1,2,1)
    sns.distplot(x)
    plt.subplot(1,2,2)

    sns.scatterplot(x, y)
    plt.suptitle(col)
    plt.tight_layout
    plt.savefig(col + '.png')

Even though these pairs of graphs look similar, notice the y-axes on these plots are measuring different things. The histogram is simply counting how many observations of the x-value we have in the data, while the scatterplot is showing the price in USD of the house sold at the x-value. It's almost like more houses sold (histogram height) correlates rather strongly with a higher sale price (scatterplot height). The Law of Supply & Demand seems alive and well in King County, WA. We'll discuss how to deal with this data in regard to predicting sale price using Haversine's 3D distance formula.

Before we get into the details of the formula, let's talk about why we should consider augmenting the location data at all. The problem here is that ONLY looking at one-dimensional movement on a map doesn't particularly generate accurate housing price patterns. Another way of saying this is there is no linear relationship between either latitude vs price nor longitude vs price. For example, if there are multiple wealthy neighborhoods that have lower-priced neighborhoods in between them, using just that one dimension of data won't accurately predict the value. In the following image (shamelessly lifted from http://www.peteryu.ca/tutorials/matlab/image_in_3d_surface_plot_with_multiple_colormaps) imagine the Z axis isn't elevation, but rather home price. As you can see in the heatmap below the surface, there isn't a clear linear pattern when you move along JUST the X-direction or just the Y-direction, but proximity to the dark red spot near the back right part of the graph would have a clear pattern for predicting price.
(Note the red spot near the bottom left should actually be blue on the heatmap, as in this example, it represents very low house prices.)

When modeling with multivariable linear regression, simply having both latitude and longitude doesn't really help either. Instead, let's think about why home values in some neighborhoods are high vs low. It could simply be the houses in those neighborhoods have a view of the mountains & water, but the only way any one home sells for a lot of money is if someone pays a lot of money for it. I know this seems simplistic, but high home values require high paying jobs, and people with high paying jobs will pay more to be conveniently close their work (especially if the home has a view of the mountains or water). This leads me to believe a "radius from high-paying job cluster" might be a much better indicator of home value instead of using the two columns of 'longitude' and 'latitude' on their own. Let's build one! In fact, if there are multiple employment clusters, maybe building a single column for each "job hub" would make sense.

If you are willing to accept that we live on a round planet, we can utilize the Haversine formula, which measures 3D arc-length on the surface of a sphere. This is really just an adaption of the Pythagorean theorem.

We adapt the python code from: https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points/4913653

We choose to pass in two arguments, each in the form [longitude, latitude]. To do this, we first build a new column to serve as the input to the function.

needs_work.head(2)

	long	lat
0	-122.257	47.5112
1	-122.319	47.7210

needs_work['long_lat'] = list(zip(needs_work['long'], needs_work['lat']))
needs_work.head(2)

	long	lat	long_lat
0	-122.257	47.5112	(-122.257, 47.5112)
1	-122.319	47.7210	(-122.319, 47.721000000000004)

Now we define the Haversine function. Alternatively, you can simply import it from https://pypi.org/project/haversine/ but we want to make sure we can control exactly what is going on and it's fun to see trigonometry in action!

from math import radians, cos, sin, asin, sqrt

def haversine(list_long_lat, other=[-122.336283, 47.609395]):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees), in this case the 2nd point is in the 
    Pike Pine Retail Core of Seattle, WA.
    """
    lon1, lat1 = list_long_lat[0], list_long_lat[1]
    lon2, lat2 = other[0], other[1]
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    # Radius of earth in kilometers is 6371
    km = 6371 * c
    return km

We're finally ready to build our useful output column using series.apply() which yields the distance to Seattle.

needs_work['dist_to_seattle'] = needs_work['long_lat'].apply(haversine)

In fact, it would very easy to build another column for a different "job cluster location" at this same point by passing in the optional arguments for haversine() doing a different series.apply(function, opt_arg=______). Seattle's "suburb" of Bellevue has actually grown into its own city with skyscrapers and high paying jobs of its own. Let's create the 'dist_to_bellevue' column as well.

needs_work['dist_to_bellevue'] = needs_work['long_lat'].apply(haversine, other=[-122.198985, 47.615577])

Finally, let's drop the columns we used as the building blocks for these two new columns. (Although instead of using df.drop([columns], axis=1) we actively select the columns to keep.)

engineered_columns = needs_work.loc[:,['dist_to_seattle', 'dist_to_bellevue']]

engineered_columns.head(10)

	dist_to_seattle	dist_to_bellevue
0	12.434278	12.395639
1	12.477217	14.770934
2	16.247460	13.838051
3	10.731122	17.970486
4	21.850148	11.542868
5	25.361129	15.217257
6	33.331870	35.347235
7	22.284717	24.515385
8	10.796605	15.463271
9	35.274167	30.244215

Now you can add these back into to your original dataframe and begin improving your model with these new feature columns.

DEV Community

Engineering Location Features with Haversine's Formula for Prediction Modeling

Top comments (0)