DEV Community

Cover image for Quick tip: Visualising Similarities Between CLIP Text and Image Embeddings
Akmal Chaudhri for SingleStore

Posted on • Edited on

Quick tip: Visualising Similarities Between CLIP Text and Image Embeddings

Abstract

In this article, we'll use OpenAI's CLIP model (Contrastive Language-Image Pre-training) to analyse the relationship between text and visual data by encoding and comparing their feature representations. Cosine similarity is calculated between image and text embeddings, and several dimensionality reduction techniques are used to create 2D visualisations of these relationships.

The notebook file used in this article is available on GitHub.

Introduction

In this article, we'll explore OpenAI's CLIP model to evaluate the relationship between images and text data. CLIP's model is used to encode both text and image features, followed by normalisation to compute cosine similarity, which measures the relevance between the two modalities.

Create a SingleStore Cloud account

A previous article showed the steps to create a free SingleStore Cloud account. We'll use the Free Shared Tier and take the default names for the Workspace and Database.

Import the notebook

We'll download the notebook from GitHub.

From the left navigation pane in the SingleStore cloud portal, we'll select DEVELOP > Data Studio.

In the top right of the web page, we'll select New Notebook > Import From File. We'll use the wizard to locate and import the notebook we downloaded from GitHub.

Run the notebook

After checking that we are connected to our SingleStore workspace, we'll run the cells one by one.

We'll begin by installing the necessary libraries and importing dependencies.

Next, we'll load the CLIP model and preprocess function:

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device = device)
Enter fullscreen mode Exit fullscreen mode

We'll then download a sample image, preprocess the image and create some sample text, as follows:

image_url = "https://github.com/VeryFatBoy/clip-demo/raw/main/thumbnails/1_what_makes_singlestore_unique.png"
response = requests.get(image_url)
display(Image(url = image_url))

image = preprocess(
    PILImage.open(
        BytesIO(response.content)
    )
).unsqueeze(0).to(device)

texts = [
    "What makes SingleStoreDB unique",
    "Ultra-Fast Ingestion",
    "Pipelines"
]
Enter fullscreen mode Exit fullscreen mode

Next, we'll encode the image and text features:

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(
        clip.tokenize(texts).to(device)
    )
Enter fullscreen mode Exit fullscreen mode

We'll normalise the features:

image_features /= image_features.norm(dim = -1, keepdim = True)
text_features /= text_features.norm(dim = -1, keepdim = True)
Enter fullscreen mode Exit fullscreen mode

then combine the embeddings:

combined_features = torch.cat([
    image_features,
    text_features
], dim = 0).cpu().numpy()
Enter fullscreen mode Exit fullscreen mode

and compute the cosine similarities:

similarities = [calculate_similarity(image_features, text_features[i]) for i in range(len(texts))]
labels = ["What makes SingleStoreDB unique (Image)"] + [
    f"{text} (Cosine Similarity: {similarity:.6f})" for text, similarity in zip(texts, similarities)
]
Enter fullscreen mode Exit fullscreen mode

Before plotting, we'll print the similarity scores:

print(f"{'Text':<35} {'Cosine Similarity':<10}")
print("-" * 60)

for text, similarity in zip(texts, similarities):
    print(f"{text:<35} {similarity:<10.6f}")
Enter fullscreen mode Exit fullscreen mode

Example output:

Text                                Cosine Similarity
------------------------------------------------------------
What makes SingleStoreDB unique     0.265887  
Ultra-Fast Ingestion                0.155181  
Pipelines                           0.153016
Enter fullscreen mode Exit fullscreen mode

We'll create a function to handle the different plots:

def plot_reduction(data, title, similarities):
    fig = px.scatter(
        x = data[:, 0],
        y = data[:, 1],
        color = labels,
        title = title,
        labels = {"x": "x", "y": "y"},
        size = similarities
    )
    # fig.update_traces(marker = dict(sizemode = "diameter", sizemin = 5))
    fig.show()

image_marker_size = 1
Enter fullscreen mode Exit fullscreen mode

First, we'll plot PCA:

pca = PCA(n_components = 2)
pca_result = pca.fit_transform(combined_features)
plot_reduction(
    pca_result,
    "PCA",
    [image_marker_size] + similarities
)
Enter fullscreen mode Exit fullscreen mode

Example output is shown in Figure 1.

Figure 1. PCA.

Figure 1. PCA.

Next, we'll plot UMAP:

n_neighbors = min(15, combined_features.shape[0] - 1)
umap_model = umap.UMAP(n_components = 2, n_neighbors = n_neighbors, random_state = 42)
umap_result = umap_model.fit_transform(combined_features)
plot_reduction(
    umap_result,
    "UMAP",
    [image_marker_size] + similarities
)
Enter fullscreen mode Exit fullscreen mode

Example output is shown in Figure 2.

Figure 2. UMAP.

Figure 2. UMAP.

Finally, we'll plot t-SNE:

perplexity = min(30, combined_features.shape[0] - 1)
tsne = TSNE(n_components = 2, perplexity = perplexity, random_state = 42)
tsne_result = tsne.fit_transform(combined_features)
plot_reduction(
    tsne_result,
    "t-SNE",
    [image_marker_size] + similarities
)
Enter fullscreen mode Exit fullscreen mode

Example output is shown in Figure 3.

Figure 3. t-SNE.

Figure 3. t-SNE.

Summary

In this article, we used Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), and t-Distributed Stochastic Neighbour Embedding (t-SNE) to visualise the reduced feature space. Plotly charts for each method displayed the embeddings, with text-image cosine similarities determining marker sizes. This showed CLIP's ability to integrate and interpret multi-modal data, offering valuable insights for the visual analysis of textual and visual features.

Top comments (0)