DEV Community

Ayush for Intuit Developers

Posted on • Updated on

Building AI-powered search using LangChain and Milvus

This blog is co-authored by Ayush Pandey, Senior Software Engineer and Amit Kaushal, Software Manager at Intuit.

Artificial Intelligence (AI) has revolutionized the way we interact with technology, and one of the most significant applications of AI is in search. With the help of AI, search tools can surface more accurate and relevant results to users. In this blog, we will discuss how to build an AI-powered search engine using LangChain and Milvus.

Before we dive into the demo, let’s talk through some of the concepts and tools involved.

What is Langchain?

LangChain is a framework for developing applications powered by language models. Use cases include applications for document question answering, building conversational interfaces for database interactions, and much more. We believe that the most powerful and differentiated applications will not only leverage a language model, but will also be:
Data-aware: connect a language model to other sources of data
Agentic: allow a language model to interact with its environment

LangChain provides the modular components used to build such applications, which can be used standalone or combined for more complexity.

What is Milvus?

Milvus was created in 2019 with a singular goal: to store, index, and manage massive embedding vectors generated by deep neural networks and other machine learning (ML) models.

As a database specifically designed to handle queries over input vectors, it is capable of indexing vectors on a trillion scale. Unlike existing relational databases, which mainly deal with structured data following a pre-defined pattern, Milvus is designed from the bottom-up to handle embedding vectors converted from unstructured data.

Vector embeddings: why the hype?

Vector embeddings are a powerful tool for developers working with natural language processing (NLP) and ML applications. Vector embeddings are a way of representing words or phrases as vectors in a high-dimensional space, where each dimension represents a different feature of the word or phrase. This allows developers to perform complex operations on text data, such as sentiment analysis, text classification, and machine translation.

Let’s go over a simple explainer of semantic features: scientists sort different types of animals in the world into categories based on certain characteristics. For example, Birds are a type of warm-blooded vertebrate that are adapted to fly. Based on these features, we created word coordinates to represent animals based on their Type and Domestication score. These scores are called “semantic features,” which capture parts of the meanings of each word. Now that the words have corresponding numerical values, we can then plot these words as points on a graph, where the x-axis represents Type, and the y-axis represents Domestication score.

Word Coordinates for Animal types

Word Coordinates for Animal types and plots in graph
We can add new words to the plot based on their meanings. For example, where should the words "Lions" and "Parrots" go? How about "Whales"? Or "Snakes"?

Image description

There are also several libraries and tools available for developers who want to work with vector embeddings. Some popular libraries include Gensim, TensorFlow, and PyTorch. These libraries provide pre-trained models for word2vec and GloVe, as well as tools for training custom models on specific datasets.

Demo: Using a similarity search for asking questions from a wikipedia

First, let’s go through some prerequisites.

Install LangChain and Milvus on your local system:

! python -m pip install --upgrade pymilvus langchain openai tiktoken
Enter fullscreen mode Exit fullscreen mode

Then, import required modules:

from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Milvus
from langchain.document_loaders import TextLoader
from langchain.document_loaders import WebBaseLoader
Enter fullscreen mode Exit fullscreen mode

Next, import your OpenAI API key:

import os
import getpass

os.environ['OPENAI_API_KEY'] = "your-openai-api-key"
Enter fullscreen mode Exit fullscreen mode

Then, load in a Wikipedia document (here we’re grabbing the article for Intuit QuickBooks) using WebBaseLoader client, and split it into chunks:

loader = WebBaseLoader([
   "https://en.wikipedia.org/wiki/QuickBooks",
])

docs = loader.load()
# Split the documents into smaller chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(docs)
Enter fullscreen mode Exit fullscreen mode

Afterwards, use OpenAIEmbeddings and store everything in a Milvus vector database.

from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
vector_db = Milvus.from_documents(
   docs,
   embeddings,
   connection_args={"host": "HostName", "port": "19530"},
)
Enter fullscreen mode Exit fullscreen mode

It’s time to try semantic searching! Let’s ask a question using LangChain and Milvus:

query = "What is quickbooks?"
docs = vector_db.similarity_search(query)
docs[0].page_content
Enter fullscreen mode Exit fullscreen mode

Output:

'Retrieved from "https://en.wikipedia.org/w/index.php?title=QuickBooks&oldid=1155606425"\nCategories: Accounting softwareIntuit softwareHidden categories: CS1 maint: url-statusArticles with short descriptionShort description is different from WikidataUse mdy dates from March 2019Articles containing potentially dated statements from May 2014All articles containing potentially dated statements\n\n\n\n\n\n\n This page was last edited on 18 May 2023, at 23:04\xa0(UTC).\nText is available under the Creative Commons Attribution-ShareAlike License 3.0;\nadditional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.\n\n\nPrivacy policy\nAbout Wikipedia\nDisclaimers\nContact Wikipedia\nMobile view\nDevelopers\nStatistics\nCookie statement\n\n\n\n\n\n\n\n\n\n\n\nToggle limited content width'

The results above are decent, but need quite a lot of formatting help. Let’s try using load_qa_with_sources_chain to ask the questions instead for a cleaner output:

from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.llms import OpenAI

chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type="map_reduce", return_intermediate_steps=True)
query = "What is quickbooks?"
chain({"input_documents": docs, "question": query}, return_only_outputs=True)
Enter fullscreen mode Exit fullscreen mode

Output:
{'intermediate_steps': [' No relevant text.',
' QuickBooks is an accounting software package developed and marketed by Intuit. First introduced in 1983, QuickBooks products are geared mainly toward small and medium-sized businesses and offer on-premises accounting applications as well as cloud-based versions that accept business payments, manage and pay bills, and payroll functions.',
" Intuit also offers a cloud service called QuickBooks Online (QBO). The user pays a monthly subscription fee rather than an upfront fee and accesses the software exclusively through a secure logon via a Web browser. QuickBooks Online is supported on Chrome, Firefox, Internet Explorer 10, Safari 6.1, and also accessible via Chrome on Android and Safari on iOS 7. Quickbooks Online offers integration with other third-party software and financial services, such as banks, payroll companies, and expense management software. QuickBooks desktop also supports a migration feature where customers can migrate their desktop data from a pro or prem SKU's to Quickbooks Online.",
' QuickBooks - Wikipedia \nInitial release, Subsequent releases, QuickBooks Online, QuickBooks Point of Sale, Add-on programs.'],
'output_text': ' QuickBooks is an accounting software package developed and marketed by Intuit. It offers on-premises accounting applications as well as cloud-based versions that accept business payments, manage and pay bills, and payroll functions. QuickBooks Online is a cloud service that offers integration with other third-party software and financial services.\nSOURCES: https://en.wikipedia.org/wiki/QuickBooks'}

Other search-related use cases using LangChain and Milvus:

  • E-commerce search engine: the language model can be trained on product descriptions and reviews, and the data can be converted into vectors using Milvus. The vectors can then be indexed in Milvus, and a search interface can be built to retrieve relevant products based on user queries.
  • Image search engine: the language model can be trained on image captions and tags, and the images can be converted into vectors using Milvus. The vectors can then be indexed in Milvus, and a search interface can be built to retrieve relevant images based on user queries.
  • Video search engine: the language model can be trained on video titles and descriptions, and the videos can be converted into vectors using Milvus. The vectors can then be indexed in Milvus, and a search interface can be built to retrieve relevant videos based on user queries.

By following the simple steps we’ve outlined here, developers can use LangChain and Milvus to build search engines for various use cases ranging from a simple document search to applications in e-commerce, image, and video search. We hope this was a helpful starter guide, please leave a comment if you have any further questions!

*References and further reading: *

Top comments (1)

Collapse
 
chrischurilo profile image
Chris Churilo

Love this!