Dot products, cosine similarity, text vectors

Cosine similarity is a measure between two single dimensional vectors that gives us a value ranging 0-1 to inform of the similarity between the vectors. The formula is below:

Cosine Similarity = (A . B) / (||A||.||B||)

Where (A . B) is the dot product between vector A and B. A dot product is the sum of the element-by-element product between A and B. For example,

A = [1, 2, 3]
B = [4, 5, 6]


A . B
>> 32
# (1 * 4) + (2 * 5) + (3 * 6) = 32

Meanwhile, ||A|| is the notation used to denote the L2 Norm of a vector. The L2 norm is a method to calculate the length of a vector in Euclidean space. Think of this as the length of a vector of length N as a "line" if the vector was drawn out on a N-dimensional graph. You sum the squares of the values in each dimension, and take the square root of the sum.

A = [1, 2, 3]

norm(A)

>> 3.7416573
# (1^2 + 2^2 + 3^2)^0.5 = 3.7416573

Numpy has a bunch of helpers so we don't need to run all of these calculations manually:

import numpy as np
from numpy.linalg import norm

# define two lists or array
A = np.array([1,2,3,4])
B = np.array([1,2,3,5])

# cosine similarity
cosine = np.dot(A, B) / (norm(A) * norm(B))
print("cosine similarity:", cosine)

>> 0.9939990885479664

A cosine similarity score near 1 means the vectors are very close to one another if they were projected. 0 means they are very dissimilar.