DEV Community

Cover image for 🚀 pgai Vectorizer: Automate AI Embeddings With One SQL Command in PostgreSQL
Avthar Sewrathan for Timescale

Posted on

🚀 pgai Vectorizer: Automate AI Embeddings With One SQL Command in PostgreSQL

Learn how to automate AI embedding creation using the PostgreSQL you know and love.

Managing embedding workflows for AI systems like RAG, search and AI agents can be a hassle: juggling multiple tools, setting up complex pipelines, and spending hours syncing data, especially if you aren't an ML or AI expert. But it doesn’t have to be that way.

With pgai Vectorizer, now in Early Access, you can automate vector embedding creation, keep them automatically synced as your data changes, and experiment with different AI models -- all with a simple SQL command. No extra tools, no complex setups -- just PostgreSQL doing the heavy lifting.

-- Create a vectorizer to embed data in the blogs table
-- Use Open AI text-embedding-3-small model
SELECT ai.create_vectorizer(
    'public.blogs'::regclass,
    embedding => ai.embedding_openai('text-embedding-3-small', 1536),
    chunking => ai.chunking_recursive_character_text_splitter('content')
);

Enter fullscreen mode Exit fullscreen mode

What pgai Vectorizer does:

  • Embedding creation with SQL: generate vector embeddings from multiple text columns with just one command, streamlining a key part of your AI workflow.
  • Automatic sync: embeddings update as your data changes—no manual intervention needed.
  • Quick model switching: test different AI models instantly using SQL—no data reprocessing required.
  • Test and roll out: compare models and chunking techniques, A/B test, and roll out updates with confidence and without downtime.

pgai Vectorizer system architecture –  Pgai Vectorizer automatically creates and updates embeddings from a source data table through the use of work queues and configuration tables housed in PostgreSQL, while embeddings are created in an external worker that interacts with embedding services like the OpenAI API.<br>

Here's an example of testing the RAG output of two different embedding models using pgai Vectorizer:

-- Vectorizer using OpenAI text-embedding-3-small
SELECT ai.create_vectorizer(
   'public.blogs'::regclass,
   destination => 'blogs_embedding_small',
   embedding => ai.embedding_openai('text-embedding-3-small', 1536),
   chunking => ai.chunking_recursive_character_text_splitter('content'),
   formatting => ai.formatting_python_template('Title: $title\nURL: $url\nContent: $chunk')
);

-- Vectorizer using OpenAI text-embedding-3-large
SELECT ai.create_vectorizer(
   'public.blogs'::regclass,
   destination => 'blogs_embedding_large',
   embedding => ai.embedding_openai('text-embedding-3-large', 1536),  -- Note different dimensions
   chunking => ai.chunking_recursive_character_text_splitter('content'),
   formatting => ai.formatting_python_template('Title: $title\nURL: $url\nContent: $chunk')
);

-- Compare results from the two vectorizers on the same RAG query
SELECT
   'text-embedding-3-small' as model,
   generate_rag_response(
       'What is AI?',
       'public.blogs_embedding_small'
   ) as response
UNION ALL
SELECT
   'text-embedding-3-large' as model,
   generate_rag_response(
       'What is AI?',
       'public.blogs_embedding_large'
   ) as response;
Enter fullscreen mode Exit fullscreen mode

Built to Scale

As your datasets grow, pgai Vectorizer scales with you. It automatically optimizes search performance with vector indexes (like HNSW and StreamingDiskANN) once you exceed 100,000 vectors. You’re in control—define chunking and formatting rules to tailor your embeddings to your needs.

Here's an example of an advanced vectorizer configuration, with an ANN index created after 100k rows added, and custom chunking for HTML files:


-- Advanced vectorizer configuration
SELECT ai.create_vectorizer(
   'public.blogs'::regclass,
   destination => 'blogs_embedding_recursive',
   embedding => ai.embedding_openai('text-embedding-3-small', 1536),
   -- automatically create a StreamingDiskANN index when table has 100k rows
   indexing => ai.indexing_diskann(min_rows => 100000, storage_layout => 'memory_optimized'),
   -- apply recursive chunking with specified settings for HTML content
   chunking => ai.chunking_recursive_character_text_splitter(
       'content',
       chunk_size => 800,
       chunk_overlap => 400,
       -- HTML-aware separators, ordered from highest to lowest precedence
       separator => array[
           E'</article>', -- Split on major document sections
           E'</div>',    -- Split on div boundaries
           E'</section>',
           E'</p>',      -- Split on paragraphs
           E'<br>',      -- Split on line breaks
           E'</li>',     -- Split on list items
           E'. ',        -- Fall back to sentence boundaries
           ' '          -- Last resort: split on spaces
       ]
   ),
   formatting => ai.formatting_python_template('title: $title url: $url $chunk')
);

Enter fullscreen mode Exit fullscreen mode

Try pgai Vectorizer Today (Early Access)

For companies like MarketReader, pgai Vectorizer has already made AI development faster and more efficient:

“pgai Vectorizer streamlines our AI workflow, from embedding creation to real-time syncing, making AI development faster and simpler -- all in PostgreSQL.” — Web Begole, CTO at MarketReader, an AI Financial Insights Company

If you're ready to start building, we are hosting a Dev Challenge with our partners at Ollama all about building AI apps with Open Source Software. We're really excited to see what the community builds with PostgreSQL and pgai Vectorizer!

Save time and effort. Focus less on embeddings. Spend more time building your next killer AI app. Try pgai Vectorizer free today: get it on GitHub or fully managed on Timescale Cloud (free for a limited time during Early Access).

Top comments (2)

Collapse
 
ben profile image
Ben Halpern

Really excited about this challenge tomorrow

Collapse
 
robbenzo24 profile image
Rob Benzo

exciting!!!!