DEV Community

Bolaji Bolajoko
Bolaji Bolajoko

Posted on

1 1 1 1 1

Getting Started with AI Development: My Journey with Embeddings and Vector Databases

As a web developer, I’ve always been intrigued by AI development. I had wanted to explore it for quite a while but never gave myself the chance to dive in. Part of me was curious and eager to learn, but another part of me was lazy, and I kept procrastinating. Last weekend, I finally decided to give it a try. I added it to my weekend to-do list because if it’s not on the list, I know it won’t get done.

I started by doing a simple Google search on "How to get started with AI development," but I wasn’t satisfied with the results and articles I came across. However, I remembered hearing terms like "embeddings" and "vector databases" from Per Bogen, the co-founder of Scrimba, during a podcast interview. These concepts stood out in my mind as something I should explore further.

Before my Google search, Scrimba was the first resource that came to mind. I browsed their AI Engineering Path, which includes lots of interesting topics and projects. If you're interested in learning about AI Engineering, I strongly recommend checking it out. However, I didn’t take the course myself because I couldn’t afford the subscription.

It All Started with Embeddings and Vector Databases

Embeddings

Embeddings are a technique in machine learning (ML) where data is transformed into vector representations. A vector, in this context, is a mathematical object that represents data as a list of numbers.

Embeddings allow high-dimensional data such as images, videos, or text to be encoded in a lower-dimensional space. In this space, similar items are represented by vectors that are close to each other.

Let’s take an example of using a simplified word embedding model with 5 dimensions to represent words. Here's how different words might be represented:

  • "cat": [0.2, 0.5, -0.3, 0.1, 0.4]
  • "dog": [0.3, 0.4, -0.2, 0.2, 0.3]
  • "fish": [-0.1, 0.2, 0.5, -0.3, 0.1]
  • "car": [-0.4, -0.2, 0.1, 0.5, -0.3]

Notice how the vectors for "cat" and "dog" are similar, with only slight differences, compared to "cat" and "car." There are various methods for calculating the distance between vectors using machine learning embedding models, which we’ll explore further, including creating our own metrics from scratch.

As discussed, embeddings can be performed on various types of data—images, video, text, and user data. For this article, I will focus on text data, using examples to demonstrate how it works.

Why Do We Need Embeddings?

Consider a food app that helps users learn how to prepare different dishes. Embeddings can filter out irrelevant queries like “Top 5 movies of all time,” ensuring that the app stays focused on its purpose—food-related content.

How Embeddings Help with Query Filtering

When a user enters a query, embeddings can represent the meaning of that query as a vector in high-dimensional space. A machine learning model then compares this vector with vectors of known food-related queries (e.g., recipes or cooking instructions). If the similarity score between the user’s query and the food-related records is low, the system can flag it as irrelevant and provide a response like, “Please ask food-related questions.”

Vector Databases

To provide accurate responses to user queries, we need pre-defined related data stored somewhere. This is where vector databases come into play. A vector database is a specialized database used to store and retrieve high-dimensional vectors. Some of the core benefits include fast similarity searches, clustering, and retrieval based on distance metrics like cosine similarity, Euclidean distance, or dot product. Some popular vector databases include:

  • Pinecone
  • FAISS (Facebook AI Similarity Search)
  • Milvus
  • Vespa
  • Weaviate

How Embeddings and Vector Databases Work Together

When you generate embeddings (vectors) from an ML model, these vectors can be stored in a vector database. The vector database then allows you to:

  • Search for similar embeddings: You can search for vectors that are close to a given query vector. For example, if a user inputs a food-related query, you can retrieve the closest vectors representing recipes or cooking instructions from the database.
  • Semantic search: Instead of exact keyword matching, embeddings allow you to search for semantically similar content. For example, if the query contains “How do I grill a steak?”, the search may return results related to grilling or cooking, even if the exact words aren’t used.
  • Clustering and classification: You can group similar data points together. For example, food-related vectors can be grouped in one cluster, while movie-related vectors can be grouped separately.

The vector with the highest similarity score can also be sent to a Large Language Model (LLM) API (like OpenAI or Gemini) for an even more robust response.

To use any of the vector databases mentioned above, you can visit their official websites and follow their setup documentation. In this example, I’ll be using an array to store our data, allowing us to see how things work, especially with our metric. If you’re using a vector database service, the metric calculation options are usually provided on the dashboard.

Project: The Chef

In this project, we’ll build a Node.js app that filters a given query based on predefined vectors from a list.

Project setup:

//set up node.js app
npm init
// install tensorflow and embedding model USE (universal-sentence-encoder)
npm install @tensorflow/tfjs @tensorflow-models/universal-sentence-encoder
Enter fullscreen mode Exit fullscreen mode

Read more on (https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder)

Here is how our workflow is going to look like:

  1. Generate embedding
  2. Store setup (Insert embedding)
  3. Cosine similarity (metric measurement)
  4. Query Embedding

We will be using USE to generate embeddings for our text data

generateEmbedding.js


import use from "@tensorflow-models/universal-sentence-encoder";
import '@tensorflow/tfjs-backend-cpu'

// description function to generate embedding for a given text
export async function generateEmbedding(text){
   const model = await use.load(); // load the model
   const embeddings = await model.embed([text]); // generate embedding for the input text
   return embeddings.arraySync()[0] // returns the first embedding in array format
}
Enter fullscreen mode Exit fullscreen mode

storeSetup.js

import { generateEmbedding } from "./generateEmbeddings.js";

const embeddingStore = [];

// add embeddings to the store
export function addEmbedding(id, text, embedding) {
  embeddingStore.push({ id, embedding, text });
}

// get all embedding from the store
export function getEmbeddings() {
  return embeddingStore;
}

// insert embedding
export async function insertEmbedding(id, text) {
  const embedding = await generateEmbedding(text); // generate embeddings with the input text
  addEmbedding(id, text, embedding); // store the embedding along with associated text
  console.log(`Insert embedding for: "${text}"`);
}

Enter fullscreen mode Exit fullscreen mode

metric.js
We will use cosine similarity to compare the query embedding with the stored embeddings to find the closest matches. Cosine similarity is a common method for comparing vectors.

export function cosineSimilarity(vecA, vecB){
    const dotProduct = vecA.reduce((sum, a, i) => sum + a * vecB[i], 0)
    const normA = Math.sqrt(vecA.reduce((sum, a) => sum + a * a, 0))
    const normB = Math.sqrt(vecB.reduce((sum, b) => sum + b * b, 0));

    return dotProduct / (normA * normB)
}
Enter fullscreen mode Exit fullscreen mode

cosine_similarity(A,B)=A.B/∥A∥∥B∥

If you’re interested on how cosine similarity work kindly drop a comment will do my best to explain much better in the comment section.

queryEmbedding.js

import { generateEmbedding } from "./generateEmbeddings.js";
import { getEmbeddings } from "./storeSetup.js";
import { cosineSimilarity } from "./utils/metrics.js";

// function to query for the most similar embedding to a given user query
export async function queryEmbedding(userQuery){
    const queryEmbedding = await generateEmbedding(userQuery); // generate embedding for user query
    const embeddings = getEmbeddings(); // retrieve all stored embedding

        // calculate similarity score for each embedding
    const results = embeddings.map(entry => ({
        id: entry.id,
        text: entry.text,
        similarity: cosineSimilarity(queryEmbedding, entry.embedding)
    }));

        // sort similarity with the highest vector first
    results.sort((a, b) => b.similarity - a.similarity);

        // return top 5 similar result
    return results.slice(0, 5)
}
Enter fullscreen mode Exit fullscreen mode

index.js

import { queryEmbedding } from "./lib/queryEmbedding.js";
import { insertEmbedding } from "./lib/storeSetup.js";

(async () => {
  await insertEmbedding("1", "How to make pizza?");
  await insertEmbedding("2", "How to bake a cake?");
  await insertEmbedding("3", "What do I need to bake a cake?");
  await insertEmbedding("4", "How to bake a cookie");

    // query for similar question
  const userQuery = "What are cake ingredients?";
  const results = await queryEmbedding(userQuery);

  console.log(`Query: "${userQuery}"`);
  console.log("Most similar results:");
  results.forEach((result) => {
    console.log(`-[${result.similarity.toFixed(4)}] ${result.text}`);
  });

  console.log("The highest score: ");
  // you can get the text with the highest vector similarity and send to a LLM API
  // for more robust response.
  console.log(`${results[0].similarity.toFixed(4)} - ${results[0].text}`);
})();
// if your device is running on cpu this might take a while to process
Enter fullscreen mode Exit fullscreen mode

output

Insert embedding for: "How to make pizza?"
Insert embedding for: "How to bake a cake?"
Insert embedding for: "What do I need to bake a cake?"
Insert embedding for: "How to bake a cookie"
Query: "What are cake ingredients?"
Most similar results:
-[0.7664] What do I need to bake a cake?
-[0.7153] How to bake a cake?
-[0.5386] How to make pizza?
-[0.4366] How to bake a cookie
The highest score:
0.7664 - What do I need to bake a cake?
Enter fullscreen mode Exit fullscreen mode

Conclusion

Embeddings and vector databases offer powerful ways to process and search high-dimensional data, enabling AI applications to perform tasks like semantic search and query filtering with impressive accuracy. By understanding these concepts and applying them in simple projects, you can unlock a wide range of possibilities in AI development. This is just the beginning—there’s much more to explore as you continue your journey into the world of machine learning and artificial intelligence.

Top comments (0)

Image of Bright Data

High-Quality Data for AI – Access diverse datasets ready for your ML models.

Browse our extensive library of pre-collected datasets tailored for various AI and ML projects.

Explore Datasets