vec2pg is a CLI utility for migrating data from vector databases to Supabase, or any Postgres instance with pgvector.
Objective
Our goal with https://github.com/supabase-community/vec2pg is to create an easy on-ramp to efficiently copy your data from various vector databases into Postgres with associated ids and metadata. The data loads into a new schema with a table name that matches the source e.g. vec2pg.<collection_name>
. That output table uses https://github.com/pgvector/pgvector's's vector
type for the embedding/vector and the builtin json
type for additional metadata.
Once loaded, the data can be manipulated using SQL to transform it into your preferred schema.
When migrating, be sure to increase your Supabase project's disk size so there is enough space for the vectors.
Vendors
At launch we support migrating to Postgres from Pinecone and Qdrant. You can vote for additional providers in the issue tracker and we'll reference that when deciding which vendor to support next.
Throughput when migrating workloads is measured in records-per-second and is dependent on a few factors:
- the resources of the source data
- the size of your Postgres instance
- network speed
- vector dimensionality
- metadata size
When throughput is mentioned, we assume a Small Supabase Instance, a 300 Mbps network, 1024 dimensional vectors, and reasonable geographic colocation of the developer machine, the cloud hosted source DB, and the Postgres instance.
Pinecone
vec2pg copies entire Pinecone indexes without the need to manage namespaces. It will iterate through all namespaces in the specified index and has a column for the namespace in its Postgres output table.
Given the conditions noted above, expect 700-1100 records per second.
Qdrant
The qdrant
subcommand supports migrating from cloud and locally hosted Qdrant instances
Again, with the conditions mentioned above, Qdrant collections migrate at between 900 and 2500 records per second.
Why Use Postgres/pgvector?
The main reasons to use Postgres for your vector workloads are the same reasons you use Postgres for all of your other data. Postgres is performant, scalable, and secure. Its a well understood technology with a wide ecosystem of tools that support needs from early stage startups through to large scale enterprise.
A few game changing capabilities that are old hat for Postgres that haven't made their way to upstart vector DBs include:
Backups
Postgres has extensive supports for backups and point-in-time-recovery (PITR). If your vectors are included in your Postgres instance you get backup and restore functionality for free. Combining the data results in one fewer systems to maintain. Moreover, your relational workload and your vector workload are transactionally consistent with full referential integrity so you never get dangling records.
Row Security
Row Level Security (RLS) allows you to write a SQL expression to determine which users are allowed to insert/update/select individual rows.
For example:
create policy "Individuals can view their own todos."
on public.todos
for select
using
( ( select auth.uid() ) = user_id );
Allows users of Supabase APIs to update their own records in the todos
table.
Since vector
is just another column type in Postgres, you can write policies to ensure e.g. each tenant in your application can only access their own records. That security is enforced at the database level so you can be confident each tenant only sees their own data without repeating that logic all over API endpoint code or in your client application.
Performance
pgvector has world class performance in terms of raw throughput and dominates in performance per dollar. Check out some of our prior blog posts for more information on functionality and performance:
- What's new in pgvector v0.7.0
- pgvector 0.6.0: 30x faster with parallel index builds
- Matryoshka embeddings: faster OpenAI vector search using Adaptive Retrieval
Keep an eye out for our upcoming post directly comparing pgvector with Pinecone Serverless.
Next Steps
To get started, head over to the vec2pg GitHub Page, or if you're comfortable with CLI help guides, you can install it using pip
:
pip install vec2pg
If your current vector database vendor isn't supported, be sure to weigh in on the vendor support issue.
Top comments (0)