In part 1, we explored how to build a naïve natural language image search engine using Open AI's CLIP. But our implementation had linear time complexity.
To speed up the process, let's understand how CLIP works.
CLIP can be instructed, in natural language, to predict the most relevant text snippet, given an image, without directly optimizing for the task.
To do so, it translates the input (text or image) into a relatively low dimensional space in which similar images and text are close to each other. This process is called embedding.
In the embedded space, each input map to a vector.
A vector is just an object in (a vector) space.
More generally, all n-tuples (sequences of length n)
(a1, a2, ..., an) of elements
ai form a vector space of dimension n.
The nearest neighbors of each vector are the best match for our search query.
We can use cosine similarity to retrieve the nearest neighbors.
Cosine similarity is a measure of similarity between two non-zero vectors. It is defined to equal the cosine of the angle between them. One advantage of cosine similarity is its low complexity : only the non-zero dimensions need to be considered. It is also bounded (
-1 to 1) which makes it ideal for information retrieval.
Our previous implementation could be split into two phases: Indexing and Searching.
During indexing, we embed each of our images and store the resulting vector somewhere.
During a search, we embed our search query and compute the cosine similarity of the term query's vector with each of the previously stored vectors. This operation takes O(n) ! Finally we can return the image with the best similarity score.
In part 3, we will learn about existing techniques to retrieve similar vectors faster.