Vladyslav Hutov

Posted on Jan 11, 2023 • Edited on Jan 23, 2023

Building Personalised Music Recommendation System

#python #tutorial #machinelearning #showdev

I was excited to find a playlist dataset from Spotify because I have always been unhappy with the recommendations from music streaming platforms. I wanted to use this dataset as an opportunity to learn how to build a personalised recommendation system.

In this series, I'll walk you through the process of building such a system using Python for data processing and Node.js for recommendations retrieval. In this first article, we'll focus on the data processing flow. In the second article, we'll be building a real-time recommendation system.
If you're interested, join me to see the final result!

Conceptually, this is what I wanted to implement:

The Data

The dataset we'll be working with consists of 1000 JSON files, and contains around 1 million public playlists.

Here's an example of a single playlist from the JSON:

playlist content

{
  "name": "Throwbacks",
  "collaborative": "false",
  "pid": 0,
  "modified_at": 1493424000,
  "num_tracks": 52,
  "num_albums": 47,
  "num_followers": 1,
  "tracks": [
    {
      "pos": 0,
      "artist_name": "Missy Elliott",
      "track_uri": "spotify:track:0UaMYEvWZi0ZqiDOoHU3YI",
      "artist_uri": "spotify:artist:2wIVse2owClT7go1WT98tk",
      "track_name": "Lose Control (feat. Ciara & Fat Man Scoop)",
      "album_uri": "spotify:album:6vV5UrXcfyQD1wu4Qo2I9K",
      "duration_ms": 226863,
      "album_name": "The Cookbook"
    },
    // ...
    {
      "pos": 51,
      "artist_name": "Boys Like Girls",
      "track_uri": "spotify:track:6GIrIt2M39wEGwjCQjGChX",
      "artist_uri": "spotify:artist:0vWCyXMrrvMlCcepuOJaGI",
      "track_name": "The Great Escape",
      "album_uri": "spotify:album:4WqgusSAgXkrjbXzqdBY68",
      "duration_ms": 206520,
      "album_name": "Boys Like Girls"
    }
  ],
  "num_edits": 6,
  "duration_ms": 11532414,
  "num_artists": 37
}

Each playlist consists of a name, a PID, and a list of tracks, along with some additional data.

When building this recommendation system, I made the assumption that "people are more likely to add similar songs in a playlist." While this may not hold true for every playlist, I believe that on a larger scale, it's a reasonable assumption. As a result, I'll be treating each playlist (identified by its PID) as a unit of taste preferences, and modeling item similarities based on them using various algorithms.

Talk is cheap. Show me the code ― Linus Torvalds

Preparing Data

Before diving into running different ML algorithms, we need to prepare the data for ingestion and do some cleaning.

The goal is to parse the URIs, select the necessary fields, unfold the nested playlist tracks structure, and keep only the top 10K artists. Also, converting the JSON files to CSV and parsing the fields will save time when experimenting.

While this section may be basic, I've included it for completeness. If you're familiar with these concepts, feel free to skip ahead to the next section.

Below are some common imports that I'll be using throughout the examples.

imports

import collections
import functools
import itertools
import json
import math

from collections import Counter, namedtuple
from pathlib import Path

import pandas as pd
import tqdm

from fastai.collab import *
from fastai.tabular.all import *
from sklearn.neighbors import KNeighborsClassifier

Reading JSON dataset

We'll need a few helper functions. The first one is simply reading a file and parsing the JSON contained within it:

def read_file(path):
    with open(path, 'r') as f:
      return json.load(f)

We'll also need a few field getters for the track entity. Notice that the URI parsing is just a simple substring operation:

artist_uri_prefix = len('spotify:artist:')
track_uri_prefix = len('spotify:track:')


def artist_name(track):
    return track['artist_name']

def artist_uri(track):
    return track['artist_uri'][artist_uri_prefix:]

def track_name(track):
    return track['track_name']

def track_uri(track):
    return track['track_uri'][track_uri_prefix:]

The next function reads the JSON files, parses the required fields, and returns an iterator of processed rows. This part might be confusing if you're not familiar with the concept of generators. Since we don't need to store the intermediate JSON files or the unfolded structure in memory, I'm using generators to lazily process them.

root_dir = '/kaggle/input/spotify-millions-playlist/spotify/data'
p = Path(root_dir)
paths = list(p.iterdir())

def read_files(num_files=5):

    lookup_path = paths if num_files is None else paths[:num_files]

    jsons = (read_file(path) for path in lookup_path)

    return (
        [
            row['pid'], 
            artist_name(track), 
            artist_uri(track), 
            track_name(track), 
            track_uri(track)
        ]
        for data in jsons
        for row in data['playlists']
        for track in row['tracks']
    )

example of a row

next(read_files(1))

[35000,
 'Marian Hill',
 '1xHQO9GJIW9OXHxGBISYc5',
 'Lovit',
 '2Ag3LUfgpN6ymEMwDOqKdg']

Most popular artists

Despite having a million of playlists, these data may not be enough to produce accurate predictions for every track and artist in the dataset. As a result, I want to limit the task to providing recommendations for the N most popular artists.

To do this, I'll need to count the number of artist mentions in the playlists:

artist_uri_pos = 2
c = collections.Counter(r[artist_uri_pos] for r in read_files(None))

After executing this code, I have the following structure:

c.most_common(1)

[('3TVXtAsR1Inumwj472S9r4', 846937)]

You can visit https://open.spotify.com/artist/3TVXtAsR1Inumwj472S9r4 to see the most popular artist according to the data we have from Spotify.

I should have actually counted unique artist mentions in a playlist here, but in this case it's not a problem since I'm not planning to use all 10K.

Let's save the top artists in a CSV for later use:

df = pd.DataFrame(data=c.most_common(10000), columns=['artist_uri', 'count'])
df.to_csv('common_artist_uri.csv', index=False)

Transforming JSON to CSV

Now, we want to convert the initial JSON files to processed CSV. To do this, we'll start by reading the most popular artists created in the previous step:

artists = pd.read_csv('/kaggle/input/common-spotify-artists/common_artist_uri.csv')
artists = artists.set_index('artist_uri')

I'll also modify the previously introduced parsing function by filtering out artists that are not in the allowlist and playlists that have too few tracks. This will help ensure that the models will produce better predictions:

def read_files(artist_allowlist, num_files=5):

    lookup_path = paths if num_files is None else paths[:num_files]

    jsons = (read_file(path) for path in lookup_path)

    return (
        [
            row['pid'], 
            artist_name(track), 
            artist_uri(track), 
            track_name(track), 
            track_uri(track)
        ]
        for data in jsons
        for row in data['playlists']
        if row['num_tracks'] > 5 # added filter or playlist number of tracks
        for track in row['tracks']
        if artist_uri(track) in artist_allowlist # and filter on popular artists
    )

Since we don't want to write everything in a single huge file, I'll split the rows into batches of 500K.

Here's the code for writing the CSV files:

gen = read_files(artists.index, None)
n = 500_000
i = 1
column_names = ['pid', 'artist_name', 'artist_uri', 'track_name', 'track_uri']

while (True):
    batch_slice = itertools.islice(gen, n)
    df = pd.DataFrame(batch_slice, columns=column_names)
    if df.empty:
        break
    df.to_csv(f'playlist_{i}.csv', index=False)
    i += 1

This leaves us with 5 GB of CSV files containing playlist-track data.

Similar items

In this section, we will compute the similarity between items in the dataset, such as artists or tracks. To do this, we will use three different algorithms: two variants of Collaborative Filtering and one based on Co-Occurrence.

Collaborative Filtering is a popular approach for recommendation systems, as it relies on the ratings or preferences of users to make predictions about what items a given user might like. In our case, we will use the presence of an artist in a playlist as a proxy for a "rating" of 1, indicating that the artist is relevant to the playlist.

The Co-Occurrence based algorithm, on the other hand, will measure the similarity between items based on how often they appear in the same playlist.

Preparing training Dataset

We'll start by creating a training dataset.

The first function reads every CSV file and combines them into a single pandas DataFrame. For artist similarity, we are only interested in the presence of artists in a playlist. Here's the code for reading the raw dataset:

def read_for_artists(allowlist):
    p = Path('/kaggle/input/spotify-playlists-csv')
    paths = list(p.iterdir())
    df = pd.concat((pd.read_csv(f, usecols=['pid', 'artist_uri']) for f in paths), ignore_index=True)
    df = df[df.artist_uri.isin(allowlist)]
    return df

As previously mentioned, I'll further limit the number of artists. By playing with the data, I noticed that it's reasonable to keep around 1K of them, corresponding to 10K playlist mentions. Here's the code for preparing the artist allowlist:

N_10k = 10_000 # leaves ~1K  artists
N_50k = 50_000 # leaves ~200 artists
N_88k = 88_000 # leaves ~100 artists

def get_common_artists_index(mentions):
    common_artists = pd.read_csv('/kaggle/input/common-spotify-artists/common_artist_uri.csv')
    df = common_artists[common_artists['count'] > mentions] 
    df = df.set_index('artist_uri')
    return df.index

To remove noise, I suggest to remove playlists that have too few or too many artists. Having too many artists might mean that this is a playlist with random songs, while having too few may risk losing signals and overfitting the model.

def clean_by_pid(df, p_min, p_max):    
    pid_counts = df.groupby('pid')['pid'].count()

    drop_df = ((pid_counts < p_min) | (pid_counts > p_max))
    drop_index = drop_df[drop_df].index
    df = df[~df.pid.isin(drop_index)]

    return df

Having more than one track from the same artist in a playlist might signal something, but I didn't want to introduce the complexity of a value function, so let's leave only unique mentions of artists in the training set.

def preprocess_artists(df):
    df.drop_duplicates(subset=['pid', 'artist_uri'], inplace=True)
    return df

After performing all necessary manipulations on the dataset, we are left with relevant (pid, artist) pairs.

The first two algorithms we will consider are variants of Collaborative Filtering. Collaborative Filtering relies on the notion of ratings, so for every pair of pid and artist, we’ll assign a rating of 1 to indicate that the artist is relevant to the playlist.

Additionally, it will be useful to index the dataset by the (pid, artist) pair. This will allow us to efficiently union dataset when we add negative data.

def prepare_positive(df, item_name):
    df['rating'] = 1
    df = df.set_index(['pid', item_name])
    return df

Now, let's move on to generating negative labels as promised. To do this, we will need to obtain the unique playlist IDs and unique artists in our dataset. We can then randomly match them a specified number of times to create our negative examples:

def generate_negative_pairs(N, unique_pids, unique_items):    
    items = np.random.choice(unique_items, size=N)
    pids = np.random.choice(unique_pids, size=N)

    df = pd.concat([pids, items], axis=1)
    df['rating'] = 0
    return df

def generate_negative_artists(N, pids, artists):
    df = generate_negative_pairs(N, pids, artists)
    df = df.set_index(['pid', 'artist_uri'])
    return df

To create a dataset, we can generate two random arrays: one for artists and one for playlists, this process can duplicate items if necessary. Then we can join the arrays into a single dataset by adding them as columns. It is important to note that we will need to index this new dataset accordingly.

To finalize the creation of our positive and negative datasets, we join them together and reset the indexed columns:

def join_dfs(df1, df2):
    res = pd.concat([df1, df2], sort=True, copy=False)
    # make sure to free the occupied memory
    df1.drop(df1.index, inplace=True)
    df2.drop(df2.index, inplace=True)
    return res


def reindex(df):
    df.reset_index(inplace=True)
    return df

Here is a summary of the steps we have discussed for creating a training dataset for collaborative filtering:

Obtain the list of the most popular artists to use as an allowlist.
Read in the raw data.
Clean up the noisy playlists.
Generate the negative dataset by randomly generating pairs of playlist IDs and artists that do not already exist in the dataset.
Assign positive and negative labels to the data accordingly.
Join datasets together into a single dataset.

Here is a function that combines these steps into a single flow for creating the training dataset for collaborative filtering:

def artist_training_set(n_common, pid_min_max, negative_ratio):
    # read most popular artists
    allowed_artists = get_common_artists_index(n_common)
    # read and preprocess positive labels data
    df = read_for_artists(allowed_artists)
    df = clean_by_pid(df, pid_min_max[0], pid_min_max[1])
    df = preprocess_artists(df)
    # generate negative labels
    df2 = generate_negative_artists(math.ceil(negative_ratio * len(df)), df.pid.drop_duplicates(), allowed_artists.to_series())
    # prepare negative labels
    df = prepare_positive(df, 'artist_uri')
    # finilise resulting dataset
    df = join_dfs(df, df2)
    return reindex(df)

If we execute this function, the resulting data will look this way:

training set

training_df = artist_training_set(n_common=N_10k,                                                     
                                  users_min_max=(5, 40),                                               
                                  negative_ratio=1,)

pid     artist_uri              rating
896625  7DMveApC7UnC2NPfPvlHSU  1
896625  23fqKkggKUBHNkbKtXEls4  1
896625  45eNHdiiabvmbp4erw26rg  1
896625  0L8ExT028jH3ddEcZwqJJ5  1
896625  2wY79sveU1sp5g7SokKOiI  1
... ... ... ...
957041  70BYFdaZbEKbeauJ670ysI  0
587936  4Z8W4fKeB5YxbusRsdQVPb  0
656290  0uq5PttqEjj3IH1bzwcrXF  0
   757  2RhgnQNC74QoBlaUvT4MEe  0
677025  2NhdGz9EDv2FeUw6udu2g1  0

Training Models

Now that the training dataset is prepared, we can experiment with various item similarity algorithms. As I mentioned, we will begin with Collaborative Filtering, which has two variations: Matrix Factorization and Collaborative Filtering with Multi-Layer Perceptron (MLP). Both methods follow a similar process: we create a vector representation for each artist, treat this vector as a location in N-dimensional space, and find the K closest artists to a given artist. The main difference lies in how these vectors are computed.

I have tried to make this article as practical as possible. If you are interested in learning more about the underlying theory behind these algorithms, I recommend the following resources:

The Netflix Tech Blog has an in-depth article on the collaborative filtering algorithms used by Netflix.
This tutorial from the DataCamp blog provides a practical introduction to implementing collaborative filtering in Python:
This course from Coursera, "Recommendation Systems and Deep Learning in Python," covers collaborative filtering in detail and includes hands-on exercises:
Article with theory behind how Collaborative Filtering algorithms work.

We will be utilizing the fastai library, so let's go ahead and create a Dataset Loader that retrieves data from the created DataFrame:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
dls = CollabDataLoaders.from_df(df, item_name='artist_uri', rating_name='rating', username='pid', bs=4096, device=device)

Internally, the Dataset Loader will load the data into a device, encode URIs, shuffle and batch the data, and create a validation subset. Let's make sure it has got the correct data:

data batch

dls.show_batch()
pid     artist_uri              rating
902723  61lyPtntblHJvA7FMMhi7E  1
28782   7iZtZyCzp3LItcw1wtPI3D  1
41514   11Y54BxlxC3UIAUkU2eadQ  0
682607  4EVpmkEwrLYEg6jIsiPMIb  0
166133  3iri9nBFs9e4wN7PLIetAw  0
886234  7jefIIksOi1EazgRTfW2Pk  1
688805  1f5GqyOPo0CkotzzRwviBu  1
645015  1yAwtBaoHLEDWAnWR87hBT  1
14524   0n94vC3S9c3mb2HyNAOcjg  0
909486  7H55rcKCfwqkyDFH9wpKM6  1

Matrix Factorization Collaborative Filtering

I would like to begin by using the Matrix Factorization Collaborative Filtering item similarity method. Let's create and train the model:

learn = collab_learner(dls, n_factors=50, y_range=(0, 1))
epochs = 5
learning_rate = 5e-3
weight_decay = 0.1
learn.fit_one_cycle(epochs, learning_rate, wd=weight_decay)

Here are model metrics:

epoch	train_loss	valid_loss	time
0	0.217431	0.217432	01:46
1	0.216207	0.216138	01:47
2	0.193768	0.194263	01:48
3	0.171074	0.176515	01:47
4	0.160071	0.174029	01:49

One of the most important parameters when creating the model is the n_factors argument. This parameter specifies the size of the N-dimensional vector. A larger value allows the model to capture more concepts, but it will also require more data to be trained effectively.

After completing the model fitting process, we can check the embeddings of the items (artists):

learn.weight(['61lyPtntblHJvA7FMMhi7E'], is_item=True)

tensor([[ 0.1625, -0.3502, -0.5267, -0.4166, -0.5810,  0.8513, -0.4677,  0.8700,
         -0.4015,  0.9194,  0.4940,  0.5556, -0.6228, -0.0939, -0.1931,  0.3866,
          0.5970, -0.7663,  0.7816,  0.3170, -0.8104,  0.1340, -0.5477,  0.9246,
         -0.2875, -0.0404,  0.0587,  0.8526, -0.3287, -0.6467,  0.2383,  0.0480,
          0.3321, -0.5684, -0.6459,  0.6053,  0.2226,  0.2793, -0.3024,  0.0718,
         -0.4428, -0.1309, -0.2367, -1.0821,  0.3814, -0.0182, -0.1722,  0.4477,
          0.1380,  0.3781]])

These vectors may not be interpretable to humans, but we can use them to find similar items by calculating the cosine (or some another) distance between two vectors.

Now, we can construct a similarity matrix using another popular algorithm: a KNN classifier. To do this, we will use the sklearn library.

The process is as follows: train the KNN model and then, for each artist in the dataset, retrieve the N most similar artists.

Here is how we train the model:

weights = learn.model.i_weight.weight.cpu().detach().numpy()
items = dls.classes['artist_uri'].items
knn = KNeighborsClassifier(n_neighbors=20, algorithm='kd_tree')
knn.fit(weights, items)

Here is how we can retrieve similar items:

item = learn.weight(['3TVXtAsR1Inumwj472S9r4'], is_item=True).cpu().numpy()
knn.kneighbors(item, return_distance=False)

array([[ 520, 1079,  973,  412,   91,  268, 1133,  952,   36,   50,  816,
        1137,  886,  670,  775,   48,  644,  317,  346,  148]])

As you can see, the returned items are not URIs but indices, so let's convert them to URIs using the DatasetLoader:

see similar items

[dls.classes['artist_uri'].items[i] for i in neighb[0,1:]]

['7CCjtD0hCK005Bvg2WG1a7',
 '6d47Z08T4snK50HgTEHo5Z',
 '2jku7tDXc6XoB6MO2hFuqg',
 '0cGUm45nv7Z6M6qdXYQGTX',
 '1tqhsYv8yBBdwANFNzHtcr',
 '7kFfY4UjNdNyaeUgLIEbIF',
 '6TIYQ3jFPwQSRmorSezPxX',
 '0HjYETXAvcL6mzaKjAmH2K',
 '0NWbwDZY1VkRqFafuQm6wk',
 '5bgfj5zUoWpyeVatGDjn6H',
 '7mDU6nMUJnOSY2Hkjz5oqM',
 '67nwj3Y5sZQLl72VNUHEYE',
 '4TsHKU8l8Wq7n7OPVikirn',
 '5G9kmDLg3OeUyj8KVBLzbu',
 '0NIIxcxNHmOoyBx03SfTCD',
 '4LLpKhyESsyAXpc4laK94U',
 '27mFvqQj8KpjmdKIqcw1mG',
 '2HPaUgqeutzr3jx5a9WyDV',
 '137W8MRPWKqSmrBGDBFSop']

As previously, you can follow https://open.spotify.com/artist/URI link to see artists,
i.e. https://open.spotify.com/artist/7CCjtD0hCK005Bvg2WG1a7

Multi-Layer Perceptron Collaborative Filtering

Now I would like to try a different algorithm for computing embeddings: the Multi-Layer Perceptron.

The data preparation step is the same, as well as instantiation of a DataLoader. The main difference lies in the model definition, with fastai we can do this:

learn = collab_learner(dls, use_nn=True, layers=[50, 50], y_range=(0, 1))

In this case, we will let the fastai library to decide on the size of the embeddings and only specify the NN layers.

Let's train the model using the same parameters:

learn.fit_one_cycle(5, 5e-3, wd=0.1)

Model metrics:

epoch	train_loss	valid_loss	time
0	0.167988	0.167022	07:35
1	0.142461	0.144335	07:45
2	0.127324	0.134424	07:27
3	0.099502	0.128019	07:14
4	0.056717	0.137510	07:13

Training and using the KNN model is slightly different due to internal structure difference in fastai embeddings model:

weights = learn.model.embeds[1].weight.cpu().detach().numpy()
items = dls.classes['artist_uri'].items
knn = KNeighborsClassifier(n_neighbors=20, algorithm='kd_tree')
knn.fit(weights, items)

And this is how we use it:

item_idx = dls.classes['artist_uri'].o2i['3TVXtAsR1Inumwj472S9r4']
item = learn.model.embeds[1].weight[[item_idx]].cpu().detach().numpy()
neighb = knn.kneighbors(item, return_distance=False)

array([[ 520,  220,  780,  846,   90, 1081,  387, 1054,   57,  151,  754,
         894, 1000,  104,  148, 1029, 1128,  199,  156,  300]])

converted artist URIs

['1Xyo4u8uXC1ZmMpatF05PJ',
 '5K4W6rqBFWDnAN6FQUkS6x',
 '5pKCCKE2ajJHZ9KAiaK11H',
 '0c173mlxpT3dSFRgMO8XPh',
 '7CajNmpbOovFoOoasH2HaY',
 '2YZyLoL8N0Wb9xBt1NhZWg',
 '73sIBHcqh3Z3NyqHKZ7FOL',
 '0QHgL1lAIqAw0HtD7YldmP',
 '13ubrt8QOOCPljQ2FL1Kca',
 '55Aa2cqylxrFIXC767Z865',
 '69GGBxA162lTqCwzJG5jLp',
 '6l3HvQ5sa6mXTsMTB19rO5',
 '0fA0VVWsXO9YnASrzqfmYu',
 '137W8MRPWKqSmrBGDBFSop',
 '6vWDO969PvNqNYHIOW5v0m',
 '7iZtZyCzp3LItcw1wtPI3D',
 '1RyvyyTE3xzB2ZywiAwp0i',
 '163tK9Wjr9P9DmM0AVK7lm',
 '246dkjvS1zLTtiykXe5h60']

Storing similarities

We are now able to query similar items for a given item, so the next step is to compute a similarity matrix and store it in a file which we will use in the second part of this tutorial.

I plan to use the following structure and write it in a plain text:

seed_item1 similar_11 similar_12 similar_13 ...
seed_item2 similar_21 similar_22 similar_23 ...
...
seed_itemN similar_N1 similar_N2 similar_N3 ...

You may find this structure too simple, but in the next part I'll explain why we need it this way.

The first step is to compute the similarity matrix. To do this, we will need a few helper functions, the most notable of them is get_similar, which queries similar items for a given batch of items. The other functions should be self-explanatory:

def get_emb(learn, items):
    if isinstance(learn.model, EmbeddingNN):
        emb = learn.model.embeds[1].weight[items]
    else:
        emb = learn.model.i_weight.weight[items]
    return emb.cpu().detach().numpy()

def get_similar(knn, learn, items):
    emb = get_emb(learn, items)
    return knn.kneighbors(emb, return_distance=False)

def to_uri(dls, i):
    return dls.classes['artist_uri'].items[i]

def total_items(dls):
    return len(dls.classes['artist_uri'].items)

We can now construct a generator that will compute the similarity matrix, but won't store it in memory. This generator operates in batches when querying the KNN model, but later flattens the batch and iterates over a single row. Querying the KNN model in batches, hopefully, increases the throughput as it can benefit from CPU caches and vector instructions.

def compute_similarity(learn, dls, knn, bs):
    items = total_items(dls)

    stream = (
        (idx, similar_arr)
        for start_idx in range(0, items, bs)
        for (idx, similar_arr) in zip(
            range(start_idx, start_idx+bs),
            get_similar(knn, learn, slice(start_idx, start_idx+bs)),
        )
    )

    return (
        (to_uri(dls, idx), [to_uri(dls, i) for i in similar if i != idx])
        for idx, similar in stream
    )

The final piece of the puzzle is a loop that goes through the generator and writes the rows to a file:

def store_similarity_pairs(stream, name):
    with open(name, 'w') as f:
        for k, v in stream:
            f.write(k)
            f.write(' ')
            f.write(' '.join(v))
            f.write('\n')

With that, we have completed the Collaborative Filtering part. Let's move on to another similarity algorithm.

Co-ocurrence Counting

The next similarity algorithm is called Co-ocurrence Counting. Its idea is very simple: it counts the co-occurrences of items within a given window of events. Co-occurrences that happen often (more than the specified threshold) can be considered similar. In the case of Spotify dataset, instead of windows, we can count the co-occurrences of artists within a playlist.

For this algorithm, we are only interested in pairs of playlist ID and artist, and we don't need negative data or ratings.

One good aspect is that we can reuse most of the data processing logic from the previous example. This is how we read and process the data:

allowed_artists = get_common_artists_index(N_10k)
df = read_for_artists(allowed_artists)
df = clean_by_pid(df, 5, 30)
df = preprocess_artists(df)

pid     artist_uri
896625  7DMveApC7UnC2NPfPvlHSU
896625  23fqKkggKUBHNkbKtXEls4
896625  45eNHdiiabvmbp4erw26rg
896625  0L8ExT028jH3ddEcZwqJJ5
896625  2wY79sveU1sp5g7SokKOiI
...     ...
59361   2Q0MyH5YMI5HPQjFjlq5g3
59361   3XHO7cRUPCLOr6jwp8vsx5
59361   42vg2T0Xg9yPaAgogJzoQH
59361   0YhUSm86okLWldQVwJkLlP
59364   3IYUhFvPQItj6xySrBmZkd

The number of unique artist pairs can increase by several orders of magnitude from the initial dataset size. In such case, using a pair of strings (string, string) is very inefficient and will have a large memory footprint. However, we can encode (factorize) the artist IDs into unique integers and store pairs of integers (int, int) instead. This will reduce the memory footprint by a factor of 10.

df['artist_idx'], i2a = df.artist_uri.factorize()

pid artist_uri  artist_idx
896625  7DMveApC7UnC2NPfPvlHSU  0
896625  23fqKkggKUBHNkbKtXEls4  1
896625  45eNHdiiabvmbp4erw26rg  2
896625  01aC2ikO4Xgb2LUpf9JfKp  3
896625  0L8ExT028jH3ddEcZwqJJ5  4
... ... ...
59361   42vg2T0Xg9yPaAgogJzoQH  1348
59361   0YhUSm86okLWldQVwJkLlP  1305
59361   42crL07E4WPfVovyUtMpvC  311
59364   2qFr8w5sWUITRlzZ9kZotF  1330
59364   3IYUhFvPQItj6xySrBmZkd  489

Now, we can generate all present combinations of artist pairs from every playlist:

def coco_pairs(df):
    groupped = df.groupby('pid')['artist_idx'].apply(list)
    return ((item1, item2) if item1 < item2 else (item2, item1)
            for _, arr in groupped.items()
            for item1, item2 in itertools.combinations(arr, r=2))

The logic is as follows: we group the dataset by playlist ID and aggregate the artists into a list. Then, for each list, generate combinations, ensuring that the pair is ordered so that (A, B) is the same as (B, A). This is because the order of the pair does not matter. I am using a generator because storing every pair before aggregating them would almost certainly not fit in memory. The output of this generator is a single artist co-occurrence pair from a playlist.

We can now use Python's Counter class to count the co-occurrences of pairs:

c = Counter(coco_pairs(df))

Let's check the most common co-occurrences:

c.most_common(3)

[((77, 434), 9175), ((30, 77), 8889), ((77, 97), 8474)]

And convert the encoded numerical IDs back to artist URIs using the previously saved index:

{i: i2a[i] for i in [77, 434, 30]}

{77: '3TVXtAsR1Inumwj472S9r4',
 434: '5K4W6rqBFWDnAN6FQUkS6x',
 30: '2YZyLoL8N0Wb9xBt1NhZWg'}

We will create a DataFrame with three columns: item1, item2, and count. Then, group the DataFrame by item1 and get the item2 values that are similar to it, so the item order now matters. Therefore, we will need to duplicate the counter and store both the (A, B) pair and the (B, A) pair:

o1 = ((i1, i2, c[i1, i2]) for (i1, i2) in c)
o2 = ((i2, i1, c[i1, i2]) for (i1, i2) in c)

coco = pd.DataFrame(itertools.chain(o1, o2), columns=['item1', 'item2', 'count'])

item1   item2   count
869     1623    8
535     869     8
536     869     10
224     869     131
869     1279    6
...     ...     ...
811     120     1
251     120     1
1119    120     1
1352    120     1
1582    1566    1

Let's apply a threshold and see what similarities the co-occurrence counting model yields:

similarities = res[res['count'] > 1000]

item1   item2   count
142     714     1085
236     280     1531
142     236     1046
280     471     1105
142     280     1349
...     ...     ...
450     252     1085
219     218     1379
450     198     1116
677     1       1162
677     259     1347

And convert encoded ints for the first result:

i2a[142], i2a[714]
('1dfeR4HaWDbWqFHLkxsg1d', '26bcq2nyj5GB7uRr558iQg')

This might be a final solution, but we're overlooking an important aspect when we filter pairs in this way: less popular artists will always have fewer co-occurrences, yet they may still be similar to each other. To address this, we can add a normalisation step.

First, let's count the total number of mentions of each artist:

item_counts = res.groupby('item1')['count'].sum()

item1
0        71840
1       193403
2        29229
3       106297
4       182767
         ...  
1168     18829
1169      9948
1170     24442
1171     12039
1172      8835

Then, we can add a normalised count to each pair, which is computed as pair count diveded by a sum of total counts of individual items:

pair_count = res['count']
item1_total = res.item1.map(item_counts)
item2_total = res.item2.map(item_counts)
res['normalised_count'] = pair_count / (item1_total + item2_total)

This normalised count penalises popular artists and boosts less popular.

item1   item2   count   normalised_count
320     714     10      0.000180
236     714     178     0.001518
280     714     202     0.001310
471     714     348     0.003135
714     1071    224     0.003058
...     ...     ...     ...
1088    616     1       0.000015
616     492     1       0.000016
616     517     1       0.000021
710     660     1       0.000030
591     219     1       0.000012

If we apply the threshold to the normalised count, we can see how different pairs that were previously missing or present have been affected:

similarities = res[res.normalised_count > 0.004]

item1   item2   count   normalised_count
472     714     312     0.004030
142     714     1085    0.005807
647     714     605     0.005391
236     280     1531    0.008224
236     471     915     0.006402
...     ...     ...     ...
1172    1169    111     0.005910
1169    1166    89      0.005227
1135    1093    206     0.010754
1137    744     178     0.004037
1169    696     78      0.004469

For example, the last pair (1169, 696) which has only 78 co-occurrences would have certainly been filtered out previously. These artist URIs are shown below, you can judge for yourself how similar they are:

i2a[1169], i2a[696]
('4sTFGCigAQIUiEy8wSSQNF', '1YSA4byX5AL1zoTsSTlB03')

Storing similarities

Let's now create the similarity matrix and write it to a file in the same format as before:

similar_items = similarities.groupby('item1').item2.apply(list)

item1
0                                                                     [1, 408]
1                [200, 252, 259, 57, 202, 408, 896, 66, 467, 285, 334, 677, 0]
2                                               [333, 284, 331, 858, 330, 674]
3                                               [234, 276, 648, 767, 238, 631]
4                [126, 204, 228, 998, 506, 167, 131, 326, 1122, 103, 192, 288]
                                         ...                                  
1169    [1172, 1167, 862, 863, 864, 1165, 779, 726, 731, 636, 1059, 1166, 696]
1170                                      [608, 145, 1038, 473, 989, 147, 387]
1171                [700, 746, 1074, 744, 602, 1154, 114, 745, 1130, 747, 611]
1172                                        [696, 1091, 1166, 744, 1165, 1169]

One downside of this algorithm is that it doesn't provide a guaranteed number of recommendations. For example, item {idx=0} has only 2 similar items, while item {idx=1169} has 13, etc.

And finally we can write them to the text file:

fname = 'coco.txt'
with open(fname, 'w') as f:
    for k, v in similar.items():
        f.write(i2a[k])
        f.write(' ')
        v = [i2a[i] for i in v[:20]]
        f.write(' '.join(v))
        f.write('\n')

Since this article is getting quite long, I'll leave it as an exercise for you to model the track similarities.

This concludes the section on finding similar items. We overviewed how to build similarity matrices using Collaborative Filtering and Co-occurence counting algorithms.
You can find an example matrix files in this GitHub Gist.

You can also try other algorithms that produce embeddings, such as Word2Vec, or even possibly content-based algorithms that take artist or track features into account.
Share your artist or track similarity results in the comments!

In the next chapter, I'll cover how we can use similarities to retrieve recommendations. If you don't want to miss it, be sure to hit the follow button.

Thank you for reading this far. I hope it was interesting!

Top comments (1)

Lana • Jan 11 '23

Waiting for the next chapter, thank you