Cosine similarity is a measure between two single dimensional vectors that gives us a value ranging 0-1 to inform of the similarity between the vectors. The formula is below:
Cosine Similarity = (A . B) / (||A||.||B||)
Where (A . B) is the dot product between vector A and B. A dot product is the sum of the element-by-element product between A and B. For example,
A = [1, 2, 3]
B = [4, 5, 6]
A . B
>> 32
# (1 * 4) + (2 * 5) + (3 * 6) = 32
Meanwhile, ||A||
is the notation used to denote the L2 Norm of a vector. The L2 norm is a method to calculate the length of a vector in Euclidean space. Think of this as the length of a vector of length N as a "line" if the vector was drawn out on a N-dimensional graph. You sum the squares of the values in each dimension, and take the square root of the sum.
A = [1, 2, 3]
norm(A)
>> 3.7416573
# (1^2 + 2^2 + 3^2)^0.5 = 3.7416573
Numpy has a bunch of helpers so we don't need to run all of these calculations manually:
import numpy as np
from numpy.linalg import norm
# define two lists or array
A = np.array([1,2,3,4])
B = np.array([1,2,3,5])
# cosine similarity
cosine = np.dot(A, B) / (norm(A) * norm(B))
print("cosine similarity:", cosine)
>> 0.9939990885479664
A cosine similarity score near 1 means the vectors are very close to one another if they were projected. 0 means they are very dissimilar.
Top comments (0)