Xavier Bas

Posted on Jan 2, 2019

Clustering Strava club riders

#python #kmeans #strava

Have you ever wondered how well you qualify in your Strava club? Do you want to know the level of your friend's club? Would you like to find those that have similar riding performance than you within your Strava club?

Well I asked myself similar questions and decided to investigate the Strava API and run a small analysis to group the members of your Strava club in n number of clusters based on riding performance 🚴‍♀️🚴‍ 🚴🏿‍🚴🏻‍♀️🚴🏼‍♀️ 🚴🏼‍♂️🚴🏽‍♂️🚴🏾‍♀️🚴🏾‍♂️🚴🏽‍♀️🚴🏿‍♀️🚴‍♂️🚴🏻‍♂️💨💨💨 what a nice club ride is this?

Here I would like to focus on the clustering process rather than the use of the Strava API because I think the later topic is widely covered out there. If you are not familiar with the API you might want to have a look at the official documentation here

Let's make a start, shall we?

What is a club activity?

I feel like I have to start from here: definitions

To me a club activity and more specifically a club ride is what happens on a typical Sunday, you wake up early, have a nice breakfast, dress yourself up in lycra and off you go for a bunch of hours with the guys from the club.

Well it turns out the definition is not shared within the Strava world. Stay with me here. Strava considers club activities of a specific club to be all the activities from all the users of that club. In plain English, if you join a club all your activities will be listed as club activities of such club.

The plan

The plan is simple - the simpler the better they say - First and foremost we will make sure we have supplies of your favorite hot beverage, in my case Earl Gray tea 🍵. IMHO this should be always the first step before attempting to do anything glorious. ok, moving forward..

The idea is to retrieve as much data as possible about the rides of the club of interest, make some data cleansing and once we are happy with the data we will get excited as we will be ready to cluster the club rides.

Cracking on

My kettle is on, my Earl Gray is about to get ready ☑️

It's time to look at some code:

import requests

ACCESS_TOKEN = 'your_access_token_here'
n_clubs = 30
endpoint = "https://www.strava.com/api/v3/athlete/clubs?&pagenobody =1&per_page={}&access_token={}"
r = requests.get(endpoint.format(n_clubs,ACCESS_TOKEN))
my_clubs = r.json()

I think this step was not detailed in the plan 🙃 anyway basically what this does is getting a list of all your Strava Clubs you joined. In there you should be able to find the key id for each of your clubs. Once you have identified from the list the club that you are interested in, make a note of its id - from now on we will call this club_id.

Note that you would need the access token here. If you know how to get it that is good news for you, if not I'm afraid I'm won't be covering this here, sorry, I believe other people that can communicate far better than me have already posted the way to get yours.

Now as we have planned we will use the club_id to retrieve as much data we are allowed to:

import pandas as pd

endpoint = "https://www.strava.com/api/v3/clubs/{}/activities?&page={}&per_page={}&access_token={}"
df = None
for ii in range(2):
    r = requests.get(endpoint.format(str(club_id),str(ii+1),'100',ACCESS_TOKEN))
    club_activity = r.json()
    df = pd.concat([df, pd.DataFrame(club_activity)])
df = df.reset_index(drop=True)
# Unpack the nested athlete dictionary into columns
df = pd.concat([df, pd.DataFrame((d for idx, d in df['athlete'].iteritems()))], axis=1)
df.drop(['athlete','resource_state'],axis=1,inplace=True)
df['full_name'] = df.firstname + ' ' + df.lastname

This should result in the generation of a DataFrame with basic information about the club activities of the club of interest.

In[]: df.head()
Out[]: 
   distance  elapsed_time  moving_time             name 
0   55359.5          6709         6709   Afternoon Ride
1   23363.7          5911         5595   Afternoon Ride
2   28746.8          4961         4823   Afternoon Ride
3   64576.7         13551        10647   Afternoon Ride
4   24094.0          2712         2712     Morning Ride


   total_elevation_gain         type  workout_type        full_name(*)
0                 816.0         Ride          10.0          Sanglier
1                 427.0         Ride           NaN  Julius Pompilius
2                 724.0         Ride           NaN      Moralélastix
3                1343.7         Ride           NaN           Amnésix
4                 146.0  VirtualRide           NaN         Sténograf

(*) For privacy reasons I will display Astérix characters instead of the actual names.

Few notes here,

thankfully units seem to be in SI, that's a nice touch! 🙌🙌
we have two features for describing the time from the data above: elapsed_time should include breaks whereas moving_time should be what the name describes. If that would be the case I would expect to have higher values of elapsed time than moving time, always. As you can see this is not the case, what makes me think that some rides do not log with autopause turned on 🤦‍♂️ augh, come on guys!
average speed is not shown so we are computing it with df['speed_kph'] = df.distance/df.moving_time*3.6 sorry for those folks that don't autopause as their speed will be reduced
it appears that not everybody is hitting the road, Sténograf was pretty comfortable doing an early session at home!

Rearranging the data

We would like to have the data indexed by athlete, one way we can achieve it is using the method groupby chained with mean statistic.

summary = df[df.type=='Ride'].groupby('full_name')['distance','total_elevation_gain','speed_kph'].mean()

In[]: summary.head()
Out[]: 
                  distance  total_elevation_gain  speed_kph
full_name                                                                 
Abraracourcix      34325.1                 502.1       15.2
Absolumentexclus   50507.7                 796.8       23.7
Amnésix            48812.6                 981.7       21.5
Amonbofis          54889.8                1014.0       20.5
Aplusbégalix       92074.0                 956.0       27.5

So you see, this gives us the 3 features - distance, total elevation gain and speed - for each rider. Note that we disregard virtual rides by filtering out the type of ride. Following, we will use precisely this data to cluster the riders in groups.

Clustering

I'm running low on tea.. hold on a minute this section deserves a bit more than just tea. I think biscuits will do 😋

A good practice when dealing with machine learning algorithms is recaling your data. In this case we will preprocess the data with the minmax scaler. This will scale all features such that its values fall within a given range, typically between 0 and 1. Then we will use these values to feed the K-means clustering algorithm.

from sklearn.preprocessing import minmax_scale
from sklearn.cluster import KMeans

X = minmax_scale(np.array(summary))
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
summary['cluster'] = kmeans.labels_

Pretty quick isn't it? Well let's have a look at the results before rushing into conclusions. I would like to plot the athlete's performance through our 3 features while displaying the groups we've just made.

import matplotlib.pyplot as plt
import seaborn as sns

_= plt.figure()
_= plt.subplots_adjust(hspace=0,wspace=0)
_= plt.subplot(221)
_= sns.scatterplot(x=summary.distance/1000,y='total_elevation_gain',data=summary,hue=summary.cluster,legend=False)
_= plt.subplot(223)
_= sns.scatterplot(x=summary.distance/1000,y='speed_kph',data=summary,hue=summary.cluster,legend=False)
_= plt.subplot(224)
plt.yticks([])
_= sns.scatterplot(x='total_elevation_gain',y='speed_kph',data=summary,hue=summary.cluster,legend=False)

Nice plot but I'm not entirely satisfied with it. Surely we have succeed on clustering the riders in 3 groups or say 3 teams. Hold the champagne for now, it is good news that we have riders well grouped by the distance they cover but looking a bit closer, some of these teams are quite unbalanced in terms of speed 😓. Look at the bottom subplots - distance vs speed and total elevation gain vs speed - now pay attention at the blue team. Their range in speed is huge and remember this speed is average speed!! I personally wouldn't like to be in the blue team, if you are a top rider you do nothing but waiting the rest and if you are the slowest rider there.. what a nightmare this has to be!!

We need a second attempt.

We would like to have a smaller range in speed on each group so that all riders can easily keep up with the pace of the group. This means the feature speed needs to matter more than the rest. How do you implement this concept? The key is in the scaling. Follow the minmax scaler we will scale the speed by a factor of 2 and leave the other features as they are. This will do the trick.

X_weighted = np.multiply(X, np.tile([1,1,2], (len(X), 1)))
kmeans = KMeans(n_clusters=3, random_state=0).fit(X_weighted)
summary['cluster2'] = kmeans.labels_

And now we create the same figure:

_= plt.figure()
_= plt.subplots_adjust(hspace=0,wspace=0)
_= plt.subplot(221)
_= sns.scatterplot(x=summary.distance/1000,y='total_elevation_gain',data=summary,hue=summary.cluster2,legend=False)
_= plt.subplot(223)
_= sns.scatterplot(x=summary.distance/1000,y='speed_kph',data=summary,hue=summary.cluster2,legend=False)
_= plt.subplot(224)
plt.yticks([])
_= sns.scatterplot(x='total_elevation_gain',y='speed_kph',data=summary,hue=summary.cluster2,legend=False)

This looks much much better now, riders are grouped by the amount of distance they cover, how high they climb and how fast they ride, making sure the spread in average speed within the groups is kept low.

And there you go, how to cluster your Strava club rides. Time to open the bottle of champagne 🍾

DEV Community

Clustering Strava club riders

What is a club activity?

The plan

Cracking on

Rearranging the data

Clustering

Top comments (0)

Read next

Advanced Python

Hacktoberfest from a maintainer's point of view

Performance trap: general libraries & helper objects

Concurrency and Parallelism in Python