In this article you will learn what is cosine similarity and how to calculate it using Python.
What is cosine similarity?
Cosine similarity is a metric used to measure how similar two entities are, irrespective of their size. It measures the cosine of the angle between two vectors projected in a multi-dimensional space.
In this context, the two vectors I am talking about are arrays of numbers (like a list in Python), and the angle between them is a measure of how similar they are. The closer the vectors, the smaller the angle, leading to a cosine close to 1, and vice versa. This metric is a measurement of orientation (not magnitude).
Now, if the arrows are at a 90 degree angle, it means the data sets are unrelated, giving a cosine similarity of 0. So, in short, cosine similarity is a way of measuring how related two sets of data are. The similarity will range from -1 to 1, where:
- 1 means the vectors are identical
- 0 means the vectors are unrelated (not similar)
- -1 means the vectors are diametrically opposed (completely dissimilar)
In the image above, you can visually see the cosine similarity, and its classification for two distinct vectors.
If you want to learn more about vectors, I have an article explaining it with more details: What is a vector embedding?
Cosine Similarity formula
The mathematical formula for calculating cosine similarity is:
Where:
a
and b
are our vectors
The dot product (a
. b
) of a
and b
is calculated as Dot Product
||a||
and ||b||
are the magnitudes (lengths) of the vectors
Calculating it with Python
The Python function cosine_similarity(vector1: list[float], vector2: list[float]) -> float:
takes two vectors as input and calculates their cosine similarity.
Let's see the full code
from math import sqrt, pow
def cosine_similarity(vector1: list[float], vector2: list[float]) -> float:
"""Returns the cosine of the angle between two vectors."""
# the cosine similarity between two vectors is the dot product of the two vectors divided by the magnitude of each vector
dot_product = 0
magnitude_vector1 = 0
magnitude_vector2 = 0
vector1_length = len(vector1)
vector2_length = len(vector2)
if vector1_length > vector2_length:
# fill vector2 with 0s until it is the same length as vector1 (required for dot product)
vector2 = vector2 + [0] * (vector1_length - vector2_length)
elif vector2_length > vector1_length:
# fill vector1 with 0s until it is the same length as vector2 (required for dot product)
vector1 = vector1 + [0] * (vector2_length - vector1_length)
# dot product calculation
for i in range(len(vector1)):
dot_product += vector1[i] * vector2[i]
# vector1 magnitude calculation
for i in range(len(vector1)):
magnitude_vector1 += pow(vector1[i], 2)
# vector2 magnitude calculation
for i in range(len(vector2)):
magnitude_vector2 += pow(vector2[i], 2)
# final magnitude calculation
magnitude = sqrt(magnitude_vector1) * sqrt(magnitude_vector2)
# return cosine similarity
return dot_product / magnitude
The code begins by initializing the variables for dot product and magnitudes of the vectors. It then checks the lengths of the two input vectors and pads the shorter one with zeros so that they have the same length. This step is necessary for calculating the dot product.
Then, the dot product of the two vectors and the magnitude of each vector are calculated using for loops. Finally, the cosine similarity is calculated by dividing the dot product by the product of the magnitudes.
Usage
You can use this function in your code as follows:
vector1 = [1, 2, 3]
vector2 = [2, 3, 4]
similarity = cosine_similarity(vector1, vector2)
print("The cosine similarity between vector1 and vector2 is: ", similarity)
Conclusion
If you want to see a more complete explanation about cosine similarity and the code, I published a video on YouTube, teaching it step by step.
Top comments (0)