DEV Community

yukaty
yukaty Subscriber

Posted on • Edited on

Part 2: Implementing Vector Search with OpenAI

In Part 1, we set up PostgreSQL with pgvector. Now, let's see how vector search actually works. 🧠


Contents


Prerequisites 📋

  • Completed Part 1 setup for pgvector
  • OpenAI API key

Understanding Vector Search 🌐

A vector is a list of numbers that represents position or direction:

2D Vector: [x, y]     📍 Like coordinates on a map
3D Vector: [x, y, z]  🎲 Like a point in 3D space
Enter fullscreen mode Exit fullscreen mode

When AI processes content, it creates special vectors called "embeddings" (1536 dimensions) to represent meaning. These embeddings are stored in the database, allowing us to perform similarity search:

📘 "How to use Docker"
    [0.23, 0.45, 0.12, ...]  # 1536-dimensional vector
📗 "Docker tutorial"          
    [0.24, 0.44, 0.11, ...]  🤝 Very Similar! (Distance: 0.2)
📕 "Chocolate cake recipe"    
    [0.89, 0.12, 0.67, ...]  🚫 Not Related! (Distance: 0.9)
Enter fullscreen mode Exit fullscreen mode
  • Vectors let AI understand similarity mathematically
  • Vector search finds similar content by comparing distances
  • pgvector stores embeddings efficiently
  • Works across any language (it's all just numbers!)

Project Setup ⚙️

Updated Project Structure

vector-search/
├── .env
├── compose.yml
├── requirements.txt
├── postgres/            # Part 1: Database setup
│   └── schema.sql
└── scripts/             # New: Data loading
    ├── Dockerfile
    └── load_data.py
Enter fullscreen mode Exit fullscreen mode

1. Set Up OpenAI API

Create .env:

OPENAI_API_KEY=your_api_key  # Get from platform.openai.com
Enter fullscreen mode Exit fullscreen mode

2. Create Data Loading Script

Create scripts/load_data.py to fetch books and generate embeddings:

import openai

client = openai.OpenAI()

def get_embedding(text: str):
    """Generate embedding using OpenAI API"""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def load_books_to_db():
    """Load books with embeddings into PostgreSQL"""

    # 1. Fetches books from Open Library API
    books = fetch_books()

    for book in books:
        # 2.Create text description for embedding
        description = f"Book titled '{book['title']}' by {', '.join(book['authors'])}. "
        description += f"Published in {book['first_publish_year']}. "
        description += f"This is a book about {book['subject']}."

        # 3. Generate embedding using OpenAI
        embedding = get_embedding(description)

        # 4. Stores books and embeddings in PostgreSQL
        store_book(book["title"], json.dumps(book), embedding)
Enter fullscreen mode Exit fullscreen mode

Full source code is available on GitHub 🐙

Also create requirements.txt and scripts/Dockerfile.

3. Update Docker Compose

Update compose.yml to add the data loader:

services:
  # ... existing db service from Part 1

  data_loader:
    build:
      context: .
      dockerfile: scripts/Dockerfile
    environment:
      - DATABASE_URL=postgresql://postgres:password@db:5432/example_db
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    depends_on:
      - db
    command: python load_data.py
Enter fullscreen mode Exit fullscreen mode

4. Load Sample Data

docker compose up --build
Enter fullscreen mode Exit fullscreen mode

Exploring Vector Search 🔦

First, connect to the database:

docker exec -it pgvector-db psql -U postgres -d example_db
Enter fullscreen mode Exit fullscreen mode

Inspecting Embeddings

Check what the vectors look like:

-- View first 5 dimensions of an embedding
SELECT
    name,
    (replace(replace(embedding::text, '[', '{'), ']', '}')::float[])[1:5] as first_dimensions
FROM items
LIMIT 1;
Enter fullscreen mode Exit fullscreen mode

💡 Each embedding from OpenAI's model:

  • Has 1536 dimensions
  • Contains values between -1 and 1
  • Represents text meaning mathematically
  • Outputs in [...] format, which needs to be converted to PostgreSQL's {...} array format for array operations

Finding Similar Books

Search for books about web development:

WITH web_book AS (
        SELECT embedding FROM items WHERE name LIKE '%Web%' LIMIT 1
)
SELECT 
    item_data->>'title' as title,
    item_data->>'authors' as authors,
    embedding <=> (SELECT embedding FROM web_book) as similarity
FROM items
ORDER BY similarity
LIMIT 3;  -- Returns the 3 most similar books
Enter fullscreen mode Exit fullscreen mode

Working with JSON and Vectors ⚡️

JSON Operators

Use ->> to extract text value from a JSON field:

-- Get title from the 'item_data' JSON column
SELECT item_data->>'title' FROM items;
Enter fullscreen mode Exit fullscreen mode

Vector Search Operators

pgvector supports multiple distance functions. Here are the two most commonly used operators.

L2 Distance: <->

Measures straight-line (Euclidean) distance between vectors:

-- Find similar books using L2 distance
SELECT 
    name,
    embedding <-> (
        SELECT embedding FROM items WHERE name LIKE '%Web%' LIMIT 1
    ) as distance
FROM items
ORDER BY distance
LIMIT 3;
Enter fullscreen mode Exit fullscreen mode

Cosine Distance: <=>

Measures angle-based (cosine) distance between vectors:

-- Find similar books using Cosine distance
SELECT 
    name,
    embedding <=> (
        SELECT embedding FROM items WHERE name LIKE '%Web%' LIMIT 1
    ) as distance
FROM items
ORDER BY distance
LIMIT 3;
Enter fullscreen mode Exit fullscreen mode

💡 Tips

  • OpenAI recommends <=> (Cosine distance) for their embeddings.
  • Smaller distance means higher similarity.

Performance Tips 🚀

Query Optimization

Cache query vectors instead of subquerying:

-- ❌ Inefficient: Subquery runs for every row
SELECT name, embedding <=> (
        SELECT embedding FROM items WHERE name LIKE '%Web%' LIMIT 1
) as distance
FROM items
ORDER BY distance
LIMIT 3;

-- ✅ Better: Query vector calculated once
WITH query_embedding AS (
        SELECT embedding FROM items WHERE name LIKE '%Web%' LIMIT 1
)
SELECT 
    name,
    embedding <=> (SELECT embedding FROM query_embedding) as distance
FROM items
ORDER BY distance
LIMIT 3;
Enter fullscreen mode Exit fullscreen mode

Indexing

Choose an index based on your needs:

-- Option 1: IVFFlat (Less memory, good for development)
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Option 2: HNSW (Faster searches, more memory)
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops);
Enter fullscreen mode Exit fullscreen mode

Resources 🔗


Hope this helps you build something cool. Feel free to drop a comment below! 💬

Top comments (1)

Collapse
 
maksim_sheptiakov profile image
Maksim Sheptiakov

Hey
Nice tutorial, but you missed some parts/files. So if someone wants to run it:

Full scripts/load_data.py file:

import openai
import requests
import json
import psycopg2
import os
# from psycopg2.extras import Json
from typing import List, Dict, Any
"""https://dev.to/yukaty/part-3-implementing-vector-search-with-ollama-1dop - how to do locally with ollama"""
client = openai.OpenAI()
def get_embedding(text: str):
    """Generate embedding using OpenAI API"""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding
def store_book(title: str, book_json: str, embedding: List[float]) -> None:
    """
    Store a book with its embedding in the PostgreSQL database.
    Args:
        title: The title of the book
        book_json: JSON string containing book data
        embedding: Vector embedding of the book description
    Returns:
        None
    """
    # Get database connection string from environment variable
    db_url = os.environ.get("DATABASE_URL", "postgresql://postgres:password@localhost:5432/example_db")
    try:
        # Connect to the PostgreSQL database
        conn = psycopg2.connect(db_url)
        cursor = conn.cursor()
        # Insert the book data and embedding into the database
        cursor.execute(
            """
            INSERT INTO items (name, item_data, embedding)
            VALUES (%s, %s, %s)
            RETURNING id
            """,
            (title, book_json, embedding)
        )
        # Get the ID of the newly inserted book
        book_id = cursor.fetchone()[0]
        # Commit the transaction
        conn.commit()
        print(f"Successfully stored '{title}' with ID {book_id} in the database")
    except Exception as e:
        print(f"Error storing book in database: {e}")
        # Roll back the transaction in case of error
        if conn:
            conn.rollback()
    finally:
        # Close the cursor and connection
        if cursor:
            cursor.close()
        if conn:
            conn.close()
def fetch_books(search_query: str = "python programming",
                limit: int = 10) -> List[Dict[str, Any]]:
    """
    Fetch books from the Open Library API.
    Args:
        search_query: The search term for finding books (default: "python programming")
        limit: Maximum number of books to return (default: 10)
    Returns:
        A list of dictionaries containing book information
    """
    base_url = "https://openlibrary.org/search.json"
    # Define parameters for the API request
    params = {
        "q": search_query,
        "limit": limit,
        "fields": "key,title,author_name,first_publish_year,cover_i,edition_count,subject"
    }
    try:
        # Make the API request
        response = requests.get(base_url, params=params)
        response.raise_for_status()  # Raise exception for HTTP errors
        data = response.json()
        books = data.get("docs", [])
        # Process and format the book data
        formatted_books = []
        for book in books:
            # Create a cleaner book object with consistent fields
            formatted_book = {
                "title": book.get("title", "Unknown Title"),
                "authors": book.get("author_name", ["Unknown Author"]),
                "publish_year": book.get("first_publish_year", "Unknown"),
                "cover_id": book.get("cover_i"),
                "edition_count": book.get("edition_count", 0),
                "subjects": book.get("subject", []),
                "key": book.get("key", "")
            }
            formatted_books.append(formatted_book)
        print(f"Successfully fetched {len(formatted_books)} books about '{search_query}'")
        return formatted_books
    except requests.exceptions.RequestException as e:
        print(f"Error fetching books from Open Library API: {e}")
        return []
def load_books_to_db():
    """Load books with embeddings into PostgreSQL"""
    # 1. Fetches books from Open Library API
    books = fetch_books()
    for book in books:
        # 2.Create text description for embedding
        print(book)
        description = f"Book titled '{book['title']}' by {', '.join(book['authors'])}. "
        description += f"Published in {book['publish_year']}. "
        description += f"This is a book about {[x + ', ' for x in book['subjects']]}."
        # 3. Generate embedding using OpenAI
        embedding = get_embedding(description)
        # 4. Stores books and embeddings in PostgreSQL
        store_book(book["title"], json.dumps(book), embedding)
if __name__ == '__main__':
    print(1)
    load_books_to_db()
Enter fullscreen mode Exit fullscreen mode

scripts/Dockerfile:

FROM python:3.11-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    gcc \
    libpq-dev \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the application code
COPY scripts/load_data.py /app/load_data.py

# Set environment variables (these will be overridden by docker-compose)
ENV DATABASE_URL=postgresql://postgres:password@db:5432/example_db
ENV OPENAI_API_KEY=""
Enter fullscreen mode Exit fullscreen mode