## DEV Community

Akmal Chaudhri for SingleStore

Posted on • Updated on

# Quick tip: Dot Product, Euclidean Distance and Cosine Similarity in SingleStoreDB

## Abstract

SingleStoreDB has supported vector functions since 2017. In this article, we'll see examples of using Dot Product, Euclidean Distance and Cosine Similarity in SingleStoreDB to assess vector relationships. New vector capabilities will be available in the next major release of SingleStoreDB - stay tuned.

## Introduction

### Dot Product

The Dot Product is a mathematical procedure that combines two lists of numbers into a single value. This is achieved by multiplying corresponding numbers from each list and then adding up the products. This operation is useful for tasks like measuring the commonality between two sets of values or assessing the influence of one set on another.

### Euclidean Distance

Euclidean Distance is a mathematical metric that measures the straight-line distance between two points in a Cartesian coordinate system. To compute it, square the differences between the x and y coordinates of the two points, sum these squared differences, and then take the square root of the result. The final output is a single numerical representation of the spatial separation between the points, like the direct walking distance between them on a graph.

### Cosine Similarity

Cosine Similarity is a mathematical measure that gauges the similarity between two sets, such as two lists of words. Instead of focusing on the specific words, it considers the angle between the sets, irrespective of their sizes. Visualise these sets as vectors in a multi-dimensional space. Cosine Similarity essentially assesses how well the directions of these vectors align. High similarity arises when the vectors point in the same direction, while low similarity occurs when they are perpendicular. This method is commonly employed in comparing documents or texts, providing insights into whether they discuss similar topics, regardless of the specific wording.

### SingleStoreDB

SingleStoreDB provides direct support for Dot Product and Euclidean Distance using the vector functions `DOT_PRODUCT` and `EUCLIDEAN_DISTANCE`, respectively. Cosine Similarity is supported by combining the `DOT_PRODUCT` and `SQRT` functions. In this article, we'll see some simple examples of each function.

For further details, see:

## Create a SingleStoreDB Cloud account

A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use the following settings:

• Workspace Group Name: Iris Demo Group
• Cloud Provider: AWS
• Region: US East 1 (N. Virginia)
• Workspace Name: iris-demo
• Size: S-00

## Create a Database and Table

In our SingleStore Cloud account, let's use the SQL Editor to create a new database. Call this `iris_db`, as follows:

``````CREATE DATABASE IF NOT EXISTS iris_db;
``````

We'll also create the `iris` table, as follows:

``````USE iris_db;

CREATE TABLE IF NOT EXISTS iris (
vector BLOB,
species VARCHAR(20)
);
``````

The Iris data set contains 150 rows of data for three different species of flowers. Each flower has four columns of data: `sepal_length`, `sepal_width`, `petal_length` and `petal_width`. We can store these four column values together using `JSON_ARRAY_PACK` in a BLOB format, as follows:

``````INSERT INTO iris VALUES
(JSON_ARRAY_PACK('[5.1,3.5,1.4,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.9,3,1.4,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.7,3.2,1.3,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.6,3.1,1.5,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5,3.6,1.4,0.2]'),'Iris-setosa'),
...
(JSON_ARRAY_PACK('[6.7,3,5.2,2.3]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.3,2.5,5,1.9]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.5,3,5.2,2]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.2,3.4,5.4,2.3]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[5.9,3,5.1,1.8]'),'Iris-virginica');
``````

Only the first five and last five rows are shown above. The complete `INSERT` code listing is available in a GitHub Gist.

## Query 1

First, let's try a query where we want to find the name of the flower species using an exact match for the `sepal_length`, `sepal_width`, `petal_length` and `petal_width`. We'll use the values `[5.9,3,5.1,1.8]` from the last row of the `iris` table, shown above.

### Dot Product

``````SELECT species
FROM iris
WHERE DOT_PRODUCT(vector, JSON_ARRAY_PACK('[5.9,3,5.1,1.8]')) = DOT_PRODUCT(vector, vector);
``````

This SQL query retrieves the `species` from the `iris` table for rows where the Dot Product of the vector in the `vector` column with the specified vector `[5.9,3,5.1,1.8]` is equal to the Dot Product of the `vector` column with itself. Since there's only one row with the specified values in the `vector` column, the query essentially identifies the `species` for that particular set of values in the `iris` table.

### Euclidean Distance

``````SELECT species
FROM iris
WHERE EUCLIDEAN_DISTANCE(vector, JSON_ARRAY_PACK('[5.9,3,5.1,1.8]')) = 0;
``````

This SQL query retrieves the `species` from the `iris` table for rows where the Euclidean Distance between the vector in the `vector` column and the specified vector `[5.9,3,5.1,1.8]` is equal to 0. In other words, they are one and the same.

### Cosine Similarity

``````SELECT species
FROM iris
WHERE DOT_PRODUCT(vector, JSON_ARRAY_PACK('[5.9,3,5.1,1.8]')) / SQRT(DOT_PRODUCT(vector, vector) * DOT_PRODUCT(JSON_ARRAY_PACK('[5.9,3,5.1,1.8]'), JSON_ARRAY_PACK('[5.9,3,5.1,1.8]'))) = 1;
``````

This SQL query retrieves the `species` values from the `iris` table for rows where the normalised Dot Product of the vector in the `vector` column with the specified vector `[5.9,3,5.1,1.8]` equals 1. The condition checks for parallel vectors, indicating a high similarity or identical direction between the two vectors.

The result in each case should be:

``````+----------------+
| species        |
+----------------+
| Iris-virginica |
+----------------+
``````

## Query 2

Now, let's use some fictitious data values `[5.2,3.6,1.5,0.3]` to make a prediction.

### Dot Product

``````SELECT species
FROM iris
ORDER BY DOT_PRODUCT(vector, JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]')) DESC
LIMIT 1;
``````

This SQL query retrieves the `species` from the `iris` table and orders the results in descending order based on the Dot Product between the vectors in the `vector` column and the specified vector `[5.2,3.6,1.5,0.3]`. The LIMIT 1 ensures that only the top result is returned.

The result should be:

``````+----------------+
| species        |
+----------------+
| Iris-virginica |
+----------------+
``````

### Euclidean Distance

``````SELECT species
FROM iris
ORDER BY EUCLIDEAN_DISTANCE(vector, JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]'))
LIMIT 1;
``````

This SQL query retrieves the `species` from the `iris` table and orders the results based on the Euclidean Distance between the vectors in the `vector` column and the specified vector `[5.2,3.6,1.5,0.3]`. The LIMIT 1 ensures that only the top result, closest in Euclidean Distance, is returned.

The result should be:

``````+-------------+
| species     |
+-------------+
| Iris-setosa |
+-------------+
``````

### Cosine Similarity

If we want to use `DOT_PRODUCT` and achieve a similar result to `EUCLIDEAN_DISTANCE`, we can use the following approach. The `DOT_PRODUCT` doesn't directly represent distance, so we need to modify the query. One possible adjustment is to consider the angle between vectors.

``````SELECT species
FROM iris
ORDER BY DOT_PRODUCT(vector, JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]')) / SQRT(DOT_PRODUCT(vector, vector) * DOT_PRODUCT(JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]'), JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]'))) DESC
LIMIT 1;
``````

This SQL query normalises the `DOT_PRODUCT` by dividing it by the product of the magnitudes of the vectors. This adjustment helps to account for differences in vector magnitudes and makes the `DOT_PRODUCT` more comparable to a Cosine Similarity, which considers the angle between vectors. The ordering is done in descending order to prioritise higher similarity.

The result should be:

``````+-------------+
| species     |
+-------------+
| Iris-setosa |
+-------------+
``````

## Summary

In this short article, we've seen several examples of how to use SingleStoreDB's built-in `DOT_PRODUCT` and `EUCLIDEAN_DISTANCE` vector functions. We've also seen how we can easily represent Cosine Similarity by combining the `DOT_PRODUCT` and `SQRT` functions.

We've used the functions to find an exact match and made predictions based on new, previously unknown, values.