Theo Vasilis for Apify

Posted on Oct 10, 2023 • Originally published at blog.apify.com on Jun 28, 2023

6 open-source Pinecone alternatives for LLMs

#ai #llms #integration

Were Apify , and our mission is to make the web more programmable. This article about alternatives to Pinecone was inspired by our work on getting better data for AI. Check us out .

In a previous blog post, I introduced you to Pinecone, which is one of the most popular vector databases around. If you want to know more about Pinecone or vector databases in general, I suggest you read those blog posts I just linked to.

Oh, youre still here? Then I guess youre already familiar with Pinecone and vector databases. But that doesnt mean I wont subject you to some preamble. Before we get to our list of open-source alternatives, lets at least touch upon why theyre so important for large language models.

How do vector databases help with LLMs?

Large language models have brought generative AI into the mainstream, but they have a couple of drawbacks:

1) Large language models have a word limit

LLMs have limited memory, so they can accept only a certain number of tokens as input. If you want to fit more than a few thousand words into an LLM at once, you need to fine-tune the model by training it on new data or extracting only the relevant text for your prompt.

Thats where vector embeddings come in. Embedding is an AI term for a group of vectors that represent text to capture and map the semantic meaning of words. These embeddings can break up content into manageable chunks that can be fed into the limited context of a language model like ChatGPT. However, this creates a new problem. Where do you store these embeddings? Vector databases are the answer.

🤖 Fast, reliable data for ChatGPT and LLMs

Use data extraction tools to get the data you need to feed your vector databases

2) Large language models are stuck in the past

Its well-known that LLMs were trained on data from before 2022. Want ChatGPT to help you with content related to current news or finding the best available properties on Airbnb this summer? It will do one of two things: throw its hands up and bleat on about its identity as an AI language model to justify its inability to do what you want, or - worse - hallucinate a fake answer.

📢 Update! As of September 27, 2023, GPT-4's knowledge is no longer limited to data before September 2021

Recently, paying users of ChatGPT gained access to the internet through third-party tools and the use of OpenAI plugins, but an even better solution is to use web scraping to provide GPT or whatever LLM youre using with the information needed to answer your questions. But this creates another problem: if you have a huge dataset for your LLM, you need a way to store it and pass it on to your language model. Vector databases are again the solution.

✍🏻 Need a tool to extract data for LLMs?

Website Content Crawler automatically removes headers, footers, menus, ads, and other noise from web pages in order to return only text content that can be directly fed to language models to create chatbots and other useful AI tools.

Why use a Pinecone alternative?

Pinecone is a service that stores vector data in a cloud-based Pinecone-managed database. Your applications interact with the Pinecone service through APIs to store and retrieve vector data. And while Pinecone is the industry leader when it comes to vector databases, theres one thing about it that some developers arent too keen on: it isnt open source, which means you don't have the option to host your own instance.

We love open source at Apify (check out our open-source web scraping and automation library, Crawlee), and Im willing to bet many of you do, too. So here are six popular open-source Pinecone alternatives you might want to explore (all links to these alternatives will take you to their GitHub repo).

🔗Simplify data retrieval and enable advanced data analysis operations with this Apify - Pinecone integration

6 Pinecone alternatives that are open source

Weaviate

In early 2022, Weaviates series A funding saw open-source downloads pass the two million mark. In April this year, its series B funding raised $50 million! That's certainly enough to make us pay attention to Weaviate as a Pinecone alternative. Apart from being open source, theres another difference between Pinecone and Weaviate. Pinecone is a more general-purpose vector database that can be used for multiple data types (images, audio, sensory data), while Weaviate is designed specifically for natural language or numerical data based on contextualized word embeddings.

Milvus

Like Weaviate, Milvus is an open-source vector database written in Go. It was founded by the startup, Zilliz, which reached $113 million in investment last year. The Milvus vector database is specifically designed from the bottom up to handle embedding vectors converted from unstructured data. It can handle queries over input vectors and is capable of indexing vectors on a huge scale.

Chroma

Another provider that got lots of investors for its embedding database this year. Chroma lets you build Python or JavaScript LLM apps with memory and provides a local ephemeral storage option. That means that the vector data is stored on your local machine or the machine running your application. It doesnt require any external service or database to store the data.

Qdrant

Qdrant is a vector similarity engine developed entirely in Rust, making it fast and reliable even under high load. Its vector payload supports a large variety of data types and query conditions, and filtering conditions make it useful for all sorts of neural-net or semantic-based matching, faceted search, and other applications.

Faiss

Faiss stands for Facebook AI similarity search. It's a library that lets you quickly search for similar multimedia documents using nearest-neighbor search implementation on a huge scale. Faiss is fundamentally an index rather than a database. It solves the approximate nearest neighbor problem rather than the storage problem.

LlamaIndex

Formerly known as GPT Index, LlamaIndex is a data framework for building LLM applications. It provides tools for data ingestion, structuring, retrieval, and integration with multiple application frameworks. LlamaIndex gives you the ability to query your data for any downstream LLM use case, whether its question-answering, summarization, or a component in a chatbot.

✍🏻 Need to load data into LlamaIndex?

Apify Actor Loader is designed to do just that and can be subsequently used as a Tool in a LangChain Agent. View it on GitHub.

Combine vector databases with LangChain

I can't leave the subject of vector databases without saying a few words about LangChain, which has quickly become the library of choice for building on top of generative AI models.

Unlike the aforementioned libraries, which are specifically designed for their vector database services or indexes, LangChain is a more generic library that simplifies the process of integrating different vector databases into an application. That means you can use multiple databases and switch between them without committing to one specific service or its implementation.

You can integrate LangChain with Pinecone and all the vector databases mentioned above. You can also integrate LangChain with Apify, which you can use to collect data for your vector databases.

How to use LangChain with Apify for large language models

Top comments (1)

Navraj Chohan • Nov 10 '23

I created a simple wrapper to pgvector that is inspired by Pinecone's simplicity: github.com/UpMortem/simple-pgvecto...

DEV Community