Abel Peter

Posted on Jun 24, 2023 • Edited on Jul 28, 2023

Understanding Vector Metrics(Cosine similarity, Euclidean distance, Dot product).

#vectordatabase #nlp #openai #langchain

Euclidean distance

Euclidean distance is a measure of the straight-line distance between two points in a plane or space. It calculates the geometric distance between two vectors by summing the squared differences between their corresponding elements and taking the square root of the result. In other words, it measures the length of the line connecting two points in a multidimensional space. The Euclidean distance is commonly used in various applications, such as image similarity search, where the goal is to find the most similar images based on their features. When using the Euclidean distance metric, the most similar results are those with the lowest distance score.

Consider the triangle below

Assuming the y-axis and x-axis are vectors, the hypotenuse is the Euclidean distance between the two vectors.

The triangle is a very simple vector representation, if a more complex vector representation is plotted, the Euclidean distance intuition is still similar. The smaller the distance, the more we can infer similarity between the vectors.

If we represent the hypotenuse as a vector from the origin, we get the image below:
NOTE!: the vectors in X and Y are complex vectors in hyperspace but let's represent them as shown below in the x and y axis, the hypotenuse(Euclidean distance) is the subject here.

A representation of vectors with a large Euclidian distance.

The Euclidean distance between the 2 vectors is huge and thus we can infer that vectors x and Y are not very similar compared to the vectors below.

A representation of vectors with a very small Euclidian distance

Euclidean distance can be useful in various scenarios, such as measuring the distance between two locations on a map, calculating the similarity between two images based on their pixel values, or determining the difference between two sets of data points in a scientific experiment.

Cosine similarity

Cosine similarity is a measure of similarity between two vectors in a high-dimensional space. It determines the cosine of the angle between the vectors, which represents their orientation or direction. Imagine you have two vectors (like arrows) pointing in different directions. Cosine similarity tells us how much these vectors align or point in the same direction.

The advantage of using cosine similarity is that it provides a normalized score ranging from -1 to 1, where 1 indicates identical directions, 0 indicates orthogonality (no similarity), and -1 indicates completely opposite directions. Illustrated below.

Orthogonality between the vectors.(0)

Both vectors are on the x-axis.(1)

In a search with cosine similarity, these vectors are considered very similar on the search score. In a search with real data, the vectors won't be this close obviously but the score will show which vectors are close to the vector query.

Completely opposite directions.(-1)

Cosine similarity can be useful in text analysis. Suppose you have two documents, and you want to find out how similar they are in terms of their word frequencies. By representing each document as a vector where each element represents the frequency of a specific word, you can calculate the cosine similarity between the two vectors to measure their similarity. This can be used for tasks like document clustering, plagiarism detection, or recommendation systems.

Dot product

The dot product is a way to measure how much two vectors "overlap" or are similar in terms of their directions. Imagine you have two vectors and you want to know how much they are aligned or pointing in the same direction.
The dot product takes two vectors and returns a scalar value. It calculates the sum of the products of the corresponding elements in the vectors. A higher positive dot product indicates a closer alignment, while a negative dot product suggests misalignment or opposite directions.

Consider the vector plots below

After a dot product operation on both vectors, we get:
Vectors and dot product

A negative dot product indicating misalignment looks like the plot below

The dot product can be useful in various applications. For instance, in image processing, you can use the dot product to compare two image feature vectors and determine how similar they are. In machine learning, the dot product is used in algorithms like support vector machines (SVM) to classify data points into different categories based on their features.

Summary.

These three metrics provide different ways to measure similarity or dissimilarity between vectors or data points. Euclidean distance measures geometric distance, cosine similarity _measures directional similarity, and _dot product measures alignment or overlap. They have various applications in fields such as image processing, text analysis, recommendation systems, and machine learning.

Good learning!

DEV Community

Understanding Vector Metrics(Cosine similarity, Euclidean distance, Dot product).

Top comments (0)

Read next

Most affordable Whisper API

Vector Search and Semantic Search in Depth

Day 37: Named Entity Recognition (NER) with LLMs

I created a Realtime Voice Assistant for my ESP-32, here is my journey - Part 2 : Node, OpenAI, Langchain