Challenges with Turkish Text in Vector Search
In one of my projects involving Turkish text data, I encountered the well-known difficulty of working with non-English languages in vector search (semantic search) processes. Despite trying various embedding methods, including OpenAI and Cohere Multilingual (V3), the results were consistently unsatisfactory.
The issues I faced were twofold: either the search results lacked accuracy and relevance, or I had to retrieve an excessive number of documents (more than 15) to achieve satisfactory results. This inefficiency highlighted the need for a more effective approach.
Exploring a Denser Context Solution
To address these challenges, I began exploring the idea of searching within a smaller but denser context. I hypothesized that utilizing optimized, keyword-based embeddings could be the key to improving accuracy and efficiency. By focusing on a concentrated set of relevant keywords rather than the entire text, the vector representations would potentially capture the semantic meaning more effectively.
Optimizing Vector Searches with a Two-Step Retrieval Approach
When working with large document collections and utilizing vector similarity searches, performance and storage considerations become crucial. Storing and searching over the full text of documents can be inefficient and resource-intensive, especially for applications that require real-time responses. However, by adopting a two-step retrieval approach, we can strike a balance between accuracy and efficiency, while also taking advantage of the strengths of vector representations.
In this approach, we separate the retrieval process into two distinct stages:
- Initial Vector Similarity Search: In the first stage, we perform a vector similarity search over a collection of compact vector representations, typically derived from document summaries or keywords. These vector embeddings are smaller in size and can be searched rapidly, allowing us to identify a subset of relevant documents efficiently.
- Full Content Retrieval: In the second stage, we retrieve the full text content of the documents identified in the initial search. This step is more computationally expensive, but by narrowing down the search space in the first stage, we minimize the amount of data that needs to be processed in this costlier operation.
Why is this approach beneficial?
Storage Optimization: By storing only the vector representations and metadata in the initial search index, we significantly reduce the storage requirements compared to storing the entire text of all documents. This optimization becomes increasingly valuable as the document collection grows larger.
Improved Search Performance: Vector similarity searches over compact representations are generally faster than full-text searches, especially for large datasets. By performing the initial search over these compact vectors, we can quickly identify potentially relevant documents, reducing the overall search time.
Better Utilization of Vector Representations: Vector embeddings derived from summaries or keywords can often provide better representations of the core concepts and topics within a document compared to raw text. Searching over these embeddings can yield more accurate and relevant results, as the vector space captures semantic similarities more effectively.
Flexibility: This two-step approach allows for flexibility in the choice of vector embeddings and the level of summarization or keyword extraction applied. Depending on the specific use case and requirements, different techniques can be employed to optimize the trade-off between accuracy and efficiency.
In the subsequent sections, we’ll dive deeper into the implementation details, showcase code examples, and explore potential optimizations and variations of this approach.
Implementation Steps
The two-step retrieval approach involves several key steps, which we’ll explore in detail. Let’s walk through the process:
1. Document Processing
The first step is to process the raw document data into a format suitable for vector embedding and storage. This typically involves the following substeps:
- Text Extraction: If your documents are in formats like PDF, Word, or HTML, you’ll need to extract the plain text content.
- Text Splitting: Depending on the length of your documents, you may need to split them into smaller chunks or passages to improve the quality of the vector embeddings.
- Summarization or Keyword Extraction: To create compact vector representations, you can either generate summaries of the text chunks using natural language processing techniques or extract relevant keywords/phrases.
2. Vector Embedding
Once you have the processed text (summaries or keywords), you’ll need to convert them into vector representations using a suitable embedding model. Popular choices include:
-
Sentence Transformers: Pre-trained models like
all-MiniLM-L6-v2
ormulti-qa-MiniLM-L6-cos-v1
can generate high-quality sentence embeddings. -
Hugging Face Transformers: You can leverage pre-trained language models like
bert-base-uncased
ordistilbert-base-uncased
to generate embeddings. -
OpenAI Embeddings: OpenAI’s
text-embedding-ada-002
model is another option for generating embeddings.
We will use OpenAI embeddings for demonstration purpose
3. Vector Storage And Raw Document Storage
After generating the vector embeddings, you’ll need to store them in a vector database or data store optimized for similarity searches.
Popular choices include:
- Supabase Vector Store: An open-source vector store built on top of Supabase, a hosted Postgres database.
- Chroma: A vector store that can be used with different storage backends like Elasticsearch, DynamoDB, or a simple directory.
- Pinecone: A managed vector database service with advanced filtering and hybrid search capabilities.
Before saving embeddings, we need to add another table in same environment or another DB solution. We need to pass row ID to embedding storage as metadata for further steps.
4. Initial Vector Similarity Search
With the vector embeddings stored, you can now perform the initial vector similarity search. This involves:
- Converting the user’s query into a vector representation using the same embedding model as the documents.
- Querying the vector store to find the top
N
most similar vector embeddings to the query vector. - Retrieving the metadata (e.g., document IDs) associated with the top results.
5. Full Content Retrieval
In the second stage, you’ll use the metadata obtained from the initial search to retrieve the full text content of the relevant documents. This may involve:
- Fetching the documents from a database, file system, or other storage based on the document IDs.
- Optionally, applying additional filtering, ranking, or re-ranking techniques on the full text content.
Skip Theory, Let’s Code!
Supabase Helpers
For this demo, i will use Supabase as it’s easy to use and free for starter projects. First create a table named documentsRaw with id (default) and content ( text) fields.
You need to disable Row Level Security or create an user and and login with it. Also you must create INSERT & SELECT Policies
Following example uses RLS ( Row Level Security ) With an authenticated user.
Now, lets create a file named supabaseDocument.js
import dotenv from "dotenv";
dotenv.config();
const sbApiKey = process.env.SUPABASE\_API\_KEY;
const sbUrl = process.env.SUPABASE\_URL;
const sbClient = createClient(sbUrl, sbApiKey, {
auth: {
autoRefreshToken: false,
persistSession: false,
detectSessionInUrl: false
}
});
async function authenticate() {
const { user, error } = await sbClient.auth.signInWithPassword({
email: "fill actual email",
password: "fill with password",
});
if (error) {
console.log("error", error);
} else {
}
}
export function addDocumentData(contents) {
return new Promise((resolve, reject) => {
authenticate()
.then(() => {
const documents = contents.map(content => ({ content }));
sbClient
.from("documentRaw")
.insert(documents)
.select()
.then((response) => {
const { data, error } = response;
if (error) {
console.log(error);
reject(error);
} else {
resolve(data);
}
})
.catch((error) => {
console.log(error);
reject(error);
});
})
.catch((error) => {
console.log(error);
reject(error);
});
});
}
export function getDocumentData(ids) {
return new Promise((resolve, reject) => {
authenticate()
.then(() => {
const query = sbClient.from("documentRaw").select("\*");
if (Array.isArray(ids)) {
query.in("id", ids);
} else {
query.eq("id", ids);
}
query
.then((response) => {
const { data, error } = response;
if (error) {
console.log(error);
reject(error);
} else {
if (Array.isArray(ids)) {
resolve(data);
} else {
resolve(data\[0\]);
}
}
})
.catch((error) => {
console.log(error);
reject(error);
});
})
.catch((error) => {
console.log(error);
reject(error);
});
});
}
These function are only for inserting rawDocuments into table and getting them later with IDs.
Populating Databases, Ingesting files and RawDocuments
Please follow setup in following url to configure your Supabase to work with langchain.
https://js.langchain.com/docs/integrations/vectorstores/supabase
Lets create a file named ingestor.js
I wrote descriptions for every critical step as a comment in each line.
import { TextLoader } from "langchain/document\_loaders/fs/text";
import { DirectoryLoader } from "langchain/document\_loaders/fs/directory";
import dotenv from "dotenv";
dotenv.config();
import { OpenAIEmbeddings } from "@langchain/openai";
import { CharacterTextSplitter } from "langchain/text\_splitter";
import { SupabaseVectorStore } from "@langchain/community/vectorstores/supabase"
import { createClient } from "@supabase/supabase-js";
import { Document } from "langchain/document";
const sbApiKey = process.env.SUPABASE\_API\_KEY;
const sbUrl = process.env.SUPABASE\_URL;
const client = createClient(sbUrl, sbApiKey); // initializing SupabaseClient
const embeddings = new OpenAIEmbeddings({ // Embedding Model
model: "text-embedding-3-large",
});
const path = "./documents/";
const directoryLoader = new DirectoryLoader(path, { // Loading multiple txt files.
".txt": (path) => new TextLoader(path),
});
const splitter = new CharacterTextSplitter({ // Splitting text into chunks.
keepSeparator: false,
chunkSize: 1500,
});
import pit from "p-iteration" // npm install p-iteration
import { default as summarizer } from "./summarizeChain.js"; // summarizing content with anthropic, see below code
import { addDocumentData } from "./supabaseDocument.js"; // we created this earlier.
(async () => {
const loaded\_docs = await directoryLoader.load(); // load all documents
const text = await splitter.splitDocuments(loaded\_docs); // split documents into chunks. This returns an array of Document(s)
const textSummarized = await pit.mapSeries(text, async document => { //we are mapping each document to process it to another documents array.
let addToSupabaseRaw = await addDocumentData(\[document.pageContent\]) // add pageContent to documentsRaw Table in Supabase
const summary = await summarizer(document.pageContent); // summarize content and extract keywords and keyword clouds.
console.log("created row,", addToSupabaseRaw\[0\].id, summary.content);
return new Document({
metadata: { ...document.metadata, rawDocumentId: addToSupabaseRaw\[0\].id },
pageContent: summary
}); // this is our new Document, now pagecontent is summarized and we have rawDocumentId metadata to create relation later.
});
const vectorStore = await SupabaseVectorStore.fromDocuments(
textSummarized,
embeddings,
{
client,
}
); // this step creates embeddings and saves them in SupabaseVector Store.
})();
here is content of summarizeChain.js. Feel free to change template for your needs.
import dotenv from "dotenv";
dotenv.config();
import { PromptTemplate } from "@langchain/core/prompts";
import { ChatAnthropic } from "@langchain/anthropic";
import { StringOutputParser } from "@langchain/core/output\_parsers";
let anthropicModel = "claude-3-haiku-20240307"
const model = new ChatAnthropic({
temperature: 0.9,
model: anthropicModel,
// In Node.js defaults to process.env.ANTHROPIC\_API\_KEY,
// apiKey: "YOUR-API-KEY",
maxTokens: 4096,
}); //setup LLM
const summarizeInputTemplate = \`Without including personal information and private details, convert the given text into a tag cloud.
In this context, embeddings will be created so that they can be easily retrieved when another text is used.
Therefore, summarize what the text is about, what its demands are, and what it pertains to, generating keywords.
Create very detailed groups of keywords.
There's no need for proper names or place names. Look at the big picture; you're an excellent content analyst.
Here's the text: {content} \`; // This is our template we can invoke it and change {content}
const summarizeInputPrompt = PromptTemplate.fromTemplate(
summarizeInputTemplate
);
const summarizeChain = summarizeInputPrompt
.pipe(model).pipe(new StringOutputParser());
const run = async (content) => { //this is our main function which receives a content and returns summarized content.
const response = await summarizeChain.invoke({
content: content,
});
return response
};
export default run;
Now. If you place some .txt files into “documents” folder in project folder then run ingestor.js, Your documentsRaw table and documents table will be filled with summarized embeddings data and actual data.
Two-Step Retrieval
Create another file named search.js
import { SupabaseVectorStore } from "@langchain/community/vectorstores/supabase";
import { createClient } from "@supabase/supabase-js";
import { OpenAIEmbeddings } from "@langchain/openai";
import dotenv from "dotenv";
dotenv.config();
const sbApiKey = process.env.SUPABASE\_API\_KEY;
const sbUrl = process.env.SUPABASE\_URL;
const client = createClient(sbUrl, sbApiKey);
const embeddings = new OpenAIEmbeddings({
model: "text-embedding-3-large",
});
import pit from "p-iteration" //npm install p-iteration
import { getDocumentData } from "./supabaseDocument.js"; // we created this earlier, now we are using getDocumentData function.
import { Document } from "langchain/document";
const run = async () => {
const vectorStore = new SupabaseVectorStore( //initializing vector store.
embeddings,
{
client,
tableName: "documents",
queryName: "match\_documents",
}
);
let query = "your query for searching"
//In here, we can create another chain similar to summarizer, but for analyzing query and creating keyword cloud out of it for further improvements.
const results = await vectorStore.similaritySearch(query, 5); // Now we searched in summarized and embeddings ( Keyword clouds).
const resultsRestored = await pit.mapSeries(results, async document => {
let getFromSupabase = await getDocumentData(document.metadata.rawDocumentId); // getting rawDocumentID from metadata and getting actual content from supabase table ( documentsRaw)
return new Document({
metadata: { ...document.metadata },
pageContent: getFromSupabase.content
}); //populating new documents with actual data
});
console.log(resultsRestored);
};
run()
The two-step retrieval approach offers an efficient solution for working with large document collections by leveraging vector embeddings while balancing performance and accuracy. While it has advantages, potential limitations like the quality of embeddings, effectiveness of summarization techniques, and choice of vector database should be considered.
Despite these challenges, this approach remains valuable in natural language processing and information retrieval, with the potential for further enhancements as new advancements emerge. Maintaining code quality, documentation, and robust error handling practices is crucial for long-term maintainability and scalability. Ultimately, this approach represents a powerful tool for enabling efficient and accurate information access, paving the way for tackling complex tasks in data and knowledge management.
Top comments (0)