Stefano Lottini for DataStax

Posted on Mar 4

The Subtleties of Vector Similarity Scales (part 4)

#programming #vectordatabase #ai

Here’s the fourth and final installment of this series on learnings from building a vector database. We covered the basic notions of vectors and interacting with them, the behavior of vector similarities, and their usage with Apache Cassandra and DataStax Astra DB. Here, we’ll explore the pitfalls associated with rescaling similarities, and bring it to life with an end-to-end migration example.

In this series (as is the case for most applications out there), we’ve preferred to work with "similarities" rather than "distances.” This is because the former lend themselves more easily to be cast into the notion of a "score" bounded between known values.

For most applications, knowing that zero means "least similar" and one means "most similar" is all that counts. However, a few points should be kept in mind.

Scale

The choice of scaling this score between zero and one is just for convenience; there is nothing special in this, except the fact that it "feels natural." And, sure enough, this is what Apache Cassandra and DataStax Astra DB do, as can be checked by looking at the definitions given in part 1. These achieve a final result bound to lie between zero and one, albeit through very different formulae for the Cosine and the Euclidean cases.

Alternate cosine similarity

When working with the cosine similarity, however, it is important to note that another, different scale is very common in textbooks and references (such as Wikipedia). Especially in more mathematical-oriented applications, one often prefers the following definition (denoted with a superscript star in this writeup):

This definition is such that identically-oriented vectors have S*_cos = 1 all right, while exactly opposed vectors yield S*_cos = -1. In other words, the two cosine similarities are related by the simple linear rescaling S_cos = (1+S*_cos)/2.

The meaning of the scale

At this point it is clear that the numeric values of similarities have no intrinsic meaning by themselves. They are very useful to anchor comparisons, such as when determining – and then applying – a cutoff threshold, but not much more. Stated differently, just knowing that "vectors v₁ and v₂ have a similarity of 0.8" is of little importance without a comparison context. This is even more true across measures: a 0.8 with Euclidean, for example, has nothing to do with a 0.8 from cosine (earlier I gave an explicit function to translate values, but that holds on the sphere only).

Mathematically speaking, one could have chosen any of the infinite ways to construct a well-behaving "similarity function" of two vectors; while there is no strong formal principle to favor one over the other, all these candidate similarities may well yield different numeric values for the same pair of vectors. This is the reason for the claim that similarity values are an arbitrary, conventional notion.

Intermezzo on vector embeddings

The special case of vectors from embedding models comes with special problems and caveats, and is not the main scope of this article. Yet two things are worth mentioning here: first, that the same two sentences will have different values for the similarity if using different models (even when using the Cosine similarity throughout!). And, second, one should not expect by any means that "extremely different sentences" will result in vectors with zero similarity. The latter is a somewhat common misconception, possibly fueled by an erroneous interpretation of this score as a "semantic-relatedness percentage." The truth is, with most embedding models one would have a hard time coming up with two sentences whose similarity (S_cos) goes below 0.75 or so. The lesson here is: rescale your expectations accordingly. There'll be a follow-up article specifically targeted at embeddings-related issues.

The pesky dot, again

I just mentioned how the various similarities are engineered to be all bound in the very handy [0:1] interval. Well, strictly speaking, that’s a lie: for the dot-product similarity is designed just to be used as a replacement for the cosine where they coincide (i.e. on unit-norm vectors). So, once again, if you use the dot-product for arbitrary vectors, which at this point you will surely see as a weird choice anyway, do not expect the similarities to be bounded in any way. In fact, as the formulae given earlier would show, your dot-product similarity between arbitrary vectors can be anything from negative infinity all the way to positive infinity!

Similarity of one

One must not assume that similarity of one means coinciding vectors. This is true only for either Euclidean similarities or on the unit sphere. The counterexample is that of the cosine similarity between two vectors, one a multiple of the other (even worse, for the dot-product off the sphere, you have seen how 1 is not a "special" value at all).

Case study: Migration of a vector app

One of the lessons from this (admittedly a bit theoretical) exposition is that you should always read the fine print when it comes to vector stores and the specific mathematical definitions that are used for the similarities.

To illustrate this point in a practical manner, let’s look at what kind of care should be taken concerning similarities when migrating a typical vector-powered application between vector stores. Let's say you are moving from Chroma to Cassandra / Astra DB. Your application stores vectors and runs ANN search queries, possibly involving a cutoff threshold, previously determined through dedicated analysis, on the results' "scores" (whatever they are). Our task now is to ensure the application behaves exactly the same after the migration.

Note: below you'll see a detailed investigation on how Chroma behaves. This has been chosen just as a representative example, the main point being that such level of care should be exercised when migrating vector-based workloads between any two databases!

Fulfilling the stated goal requires:

using the same "kind of similarity" (what I called measure earlier)
being aware of the precise definition for the similarity (different scales and such), and correcting for any difference
of course, adapting the code to using another library!

The third point is not really in scope for this illustrative example; we are most interested in the previous steps. Let's start!

The Chroma-backed "app" you're migrating is the following Python script. It creates a vector store (with Cosine measure), puts a few vectors in it, and runs an ANN search to print the resulting matches, the associated number, and whether these are "close enough to the query" (for some unspecified purpose). All vectors are guaranteed to have unit norm.



import chromadb
chroma_client = chromadb.Client()

# Creating a Vector store
cos_coll = chroma_client.create_collection(
    name="cosine_coll",
    metadata={"hnsw:space": "cosine"},
)

# Saving vector entries
cos_coll.add(
    documents=["3 o-clock", "6 o'clock", "9 o'clock"],
    embeddings=[[1, 0], [0, -1], [-1, 0]],
    ids=["3:00", "6:00", "9:00"],
)

# Running ANN search
cos_matches = cos_coll.query(
    query_embeddings=[[1, 0]],
    n_results=3
)

chroma_threshold = 1.5

# Printing the results and their "distance"
match_ids = cos_matches["ids"][0]
match_distances = cos_matches["distances"][0]
for m_id, m_distance in zip(match_ids, match_distances):
    status = "ok" if m_distance <= chroma_threshold else "NO!"
    print(f"d([1,0], '{m_id})' = {m_distance}: {status}")

For illustrative purposes, the script inserts two-dimensional vectors arranged as the hour hand of a clock at various times, the query vectors being the "3 o'clock" right-pointing direction.

Caption: The "clock model" illustrates the vectors used in the "sample application". The red vectors are the inserted rows, and the blue vector is the query vector used throughout.

Running the above program (as tested with chromadb==0.4.21) has this output:



d([1,0], '3:00') = 0.0: ok
d([1,0], '6:00') = 1.0: ok
d([1,0], '9:00') = 2.0: NO!

Do you notice anything here? Well, the number Chroma returns with the matches is not a similarity at all, rather a distance! Indeed, it increases from the closest to the farthest match. This can be verified on the Chroma docs page, where all relevant formulae are provided. This is very useful information if one is to port an application to a different vector store!

One finds out that, regardless of the measure, Chroma always works in terms of a distance, and that the Cosine choice is no exception, with a "Cosine distance" defined as:

In other words, one can relate this quantity to the familiar similarity through d_cos^Chroma(v₁, v₂) = 1 - S*_cos(v₁, v₂) = 2 - 2 S_cos(v₁, v₂), equivalent to the inverse mapping S_cos = 1- d_cos^Chroma / 2.

But there is more in the way of translations: indeed, the inequalities in the original code have to be reversed to keep their meaning. Where the Chroma code has distance <= chroma_threshold, for example, you'll need to place a condition such as similarity > cass_threshold in the ported code, where cass_threshold = 1 - chroma_threshold / 2, following the mapping above.

Side note: When possible, it’s better to translate thresholds rather than similarities/distances. This can be done "at coding time," generally minimizing the chance of errors/inconsistencies, and in some cases (e.g. when using higher abstractions around a vector store) might be the only feasible choice.

Finally, it should be noted that whereas in Chroma the default measure is Euclidean, Cassandra and Astra DB employ cosine when not explicitly chosen: it may be safer and less prone to surprises to always spell it out when creating vector stores.
So, the "application," once migrated to Astra DB, is comprised of a CQL schema creation script, looking like:



// Table creation (CQL)
CREATE TABLE cos_table (
  id TEXT PRIMARY KEY, my_vector VECTOR<FLOAT, 2>
);

// Vector index creation (CQL)
CREATE CUSTOM INDEX cos_table_v_index ON cos_table(my_vector)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {'similarity_function': 'COSINE'};

plus the "app" itself, the following Python script:



# Connecting to DB
from cassandra.cluster import Cluster
cluster = Cluster(...)  # connection to DB
session = cluster.connect()

# Saving vector entries
session.execute("""
    INSERT INTO cos_table (id, my_vector)
    VALUES ('3:00', [1, 0]);
""")
session.execute("""
    INSERT INTO cos_table (id, my_vector)
    VALUES ('6:00', [0, -1]);
""")
session.execute("""
    INSERT INTO cos_table (id, my_vector)
    VALUES ('9:00', [-1, 0]);
""")

# Running ANN search
ann_query = """
SELECT
  id,
  my_vector,
  similarity_cosine([1, 0], my_vector) as sim
FROM cos_table
ORDER BY my_vector ANN OF [1, 0]
LIMIT 3;"""
cos_matches = session.execute(ann_query)

chroma_threshold = 1.5
cass_threshold = 1 - chroma_threshold / 2

# Printing the results and their "similarity"
for match in cos_matches:
    # While we're at it, we recast to Chroma distance
    chroma_dist = 1 - match.sim
    #
    status = "ok" if match.sim > cass_threshold else "NO!"
    print(
        f"d([1,0], '{match.id})' = {match.sim}: {status} "
        f"(d_chroma = {chroma_dist})"
    )

The output of this, as expected, will be:



d([1,0], '3:00') = 1: ok (d_chroma = 0)
d([1,0], '6:00') = 0.5: ok (d_chroma = 1)
d([1,0], '9:00') = 0: NO! (d_chroma = 2)

As you see, one has to pay some attention to avoid getting caught in the subtleties of distances, similarities, and definitions. It's definitely better to always read the fine print and play with a toy model to check one's assumptions on known cases (such as the "clock model" vectors used above).

Were the original application using the Euclidean measure (but still working on the unit sphere), one would be in for another surprise: namely, what Chroma calls "Euclidean distance" is actually the squared distance! In other words, d_eucl^Chroma(v₁, v₂) = δ²_eucl(v₁, v₂).

Once this bit is acknowledged, the rest proceeds in the same manner as seen above. Distances (Chroma) grow when similarities (Cassandra / Astra DB) decrease, inequalities have to be reversed, and the following mapping needs to be used: S_eucl = 1 / (1 + d_eucl^Chroma), i.e. d_eucl^Chroma = (1/S_eucl) - 1. Note that a consequence is that, on the sphere, the Chroma Euclidean distance ranges from zero (most similar) to four (most dissimilar, i.e. antipodal vectors on the sphere).

The sheer amount of possible ways to quantify the position of two vectors, with different stores and different similarities, is enough to make you feel a bit dizzy – the lesson here is that one should make no unwarranted assumptions and verify definitions thoroughly. Test with known vectors, check the docs for formulae! To complete the exercise, here is a complete "translation map" between all distances/similarities encountered in this migration example:

In the table above, which expresses each quantity as a function of any other, the white cells are always valid, while the darkened ones are relations that hold only on the sphere (i.e. where it makes sense to recast Euclidean notions to Cosine, and vice versa, unambiguously).

You can also check the values these quantities assume with the three "clock" vector positions that were used in the example code (remember these are unit-norm vectors):

Embedded in LangChain

Your original code to migrate might be using a framework rather than directly accessing the Chroma primitives, for example it might be a LangChain application leveraging the langchain.vectorstores.Chroma vector store abstraction. As can be verified by inspecting the plugin source code (or running suitable test code, although this turns out to be more convoluted due to LangChain's choice of abstractions around embeddings), essentially the same API as before is exposed through the LangChain object, so that one should specify the cosine measure by passing a specific parameter when creating the store:



from langchain_community.vectorstores import Chroma
my_store = Chroma(
    ...,
    collection_metadata={"hnsw:space": "cosine"},
)

The "score" returned by methods such as similarity_search_with_score, likewise, is the very "distance" coming from the Chroma methods, so the same conversions seen above are required.
Likewise, when using the langchain.vectorstores.Cassandra class, the "score" will be exactly the similarity S_eucl seen earlier and bound in the [0:1] interval.

Conclusion

This technical deep dive has highlighted the definitions, the quirks and the caveats to keep in mind when approaching the concept of similarity when querying vector stores. As you have seen, subtleties abound. Luckily, awareness of the underlying mathematical structure helps avoiding fruitless pursuits and actively counterproductive choices.

So, armed with all this knowledge … why not create a free account on Astra DB and start playing with vector search?

DEV Community

The Subtleties of Vector Similarity Scales (part 4)

Case study: Migration of a vector app

Conclusion

Top comments (0)

Read next

TypeScript for Domain-Driven Design (DDD)

The Limitations of Machine Learning: What We Still Can't Teach Machines

Migrating from Azure Database for PostgreSQL to Neon

.NET Development and Localization for JustAnswer – case study