DEV Community

Cover image for 🚀 Tutorial: local embedding financial PDF reports by MongoDB vector search
Danny Chan for MongoDB Builders

Posted on

🚀 Tutorial: local embedding financial PDF reports by MongoDB vector search

Step 1: Create database clusters
Step 2: Input database clusters information
Step 3: Waiting clusters deploy
Step 4: Add network access whitelist
Step 5: Add database access user
Step 6: Connect to your local Atlas deployment or Atlas Cluster
Step 7: retrieve text from PDF
Step 8: local embedding PDF text then insert MongoDB
Step 9: check collections document record
Step 10: Create vector search index
Step 11: query by vector search index



Step 1: Create database clusters

Image description

Image description



Step 2: Input database clusters information

Image description

Image description

Image description

Image description

Image description

Image description



Step 3: Waiting clusters deploy

Image description

Image description



Step 4: Add network access whitelist

Image description

Image description

Image description

Image description

Image description

Image description

Image description



Step 5: Add database access user

Image description

Image description

Image description

Image description

Image description

Image description



Step 6: Connect to your local Atlas deployment or Atlas Cluster

Image description

Image description

Image description

Image description

Image description

Image description



Step 7: retrieve text from PDF

Image description

Image description



Step 8: local embedding PDF text then insert MongoDB

pip install sentence-transformers==2.7.0
pip install pymongo==4.7.2
pip install langchain==0.2.6
pip install langchain-mongodb==0.1.5
pip install pandas==2.2.0
pip install langchain-openai==0.1.20
pip install langchain-chroma==0.1.0
pip install langchain-core==0.2.26
pip install langchain-huggingface==0.0.3
pip install langchain-mongodb==0.1.4
Enter fullscreen mode Exit fullscreen mode
from pymongo import MongoClient
from langchain_huggingface import HuggingFaceEmbeddings
Enter fullscreen mode Exit fullscreen mode
print("get documents")

data = ""
with open("./txt_final/payment.txt","r",encoding="utf8") as file:
    data = file.read()
Enter fullscreen mode Exit fullscreen mode
print("Split txt into documents by page")

splits = data.split("www.iresearch.com.cn")
Enter fullscreen mode Exit fullscreen mode
print("get model then embedding")

model = HuggingFaceEmbeddings(model_name="BAAI/bge-large-zh-v1.5")
Enter fullscreen mode Exit fullscreen mode
print("Connect to your local Atlas deployment or Atlas Cluster")
mongo_client = MongoClient("mongodb+srv://<username>:<password>@internal-knowledge-base.xxxxx.mongodb.net/")

collection = mongo_client["internal-knowledge-base"]["papers"]

for split in splits:
    embedding = model.embed_query(split)
    collection.insert_one({ 'text_embedding': embedding, 'summary': split })
Enter fullscreen mode Exit fullscreen mode



Step 9: check collections document record

Image description

Image description

Image description

Image description

PDF page 3

Image description

PDF page 4

Image description

Data structure

{
    "_id": "66b79fd22e6781dc9195820fL",
    "text_embedding": [0.019098538905382156, -0.0010181389516219497],
    "summary": "Diversified development paths for third-party payment platforms Third-party payment platforms integrate into every detail of consumer life through lightweight reach...."
}
Enter fullscreen mode Exit fullscreen mode



Step 10: Create vector search index

Image description

Image description

Image description

Image description

Image description

{
  "fields": [
    {
      "type": "vector",
      "path": "text_embedding",
      "numDimensions": 1024,
      "similarity": "cosine"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Image description



Step 11: query by vector search index

from pymongo import MongoClient
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_mongodb import MongoDBAtlasVectorSearch
import pprint
Enter fullscreen mode Exit fullscreen mode
print("Connect to your local Atlas deployment or Atlas Cluster")
mongo_client = MongoClient("mongodb+srv://<username>:<password>@internal-knowledge-base.xxxxx.mongodb.net/")

collection = mongo_client["internal-knowledge-base"]["papers"]

model = HuggingFaceEmbeddings(model_name="BAAI/bge-large-zh-v1.5")

vector_store = MongoDBAtlasVectorSearch(
   collection=collection,
   embedding=model,
   index_name="vector_index",
   embedding_key="text_embedding",
   text_key="summary"
)
Enter fullscreen mode Exit fullscreen mode
query = "蚂蚁集团" # payment
results = vector_store.similarity_search(query)
pprint.pprint(results)
Enter fullscreen mode Exit fullscreen mode

Result:
English version

[
    Document(metadata={'_id': {'$oid': '66b79fdc2e6781dc91958211'}}, page_content='Ant Group-Alipay Ecological Foundation}
    Document(metadata={'_id': {'$oid': '66b79fcd2e6781dc9195820e'}}, page_content='The competitive landscape of independent third-party payment platforms has formed, led by Alipay"}
    Document(metadata={'_id': {'$oid': '66b79fc32e6781dc9195820c'}}, page_content='Aikan Series Monthly Inventory of Tourism Activity in Scenic Areas"}
]
Enter fullscreen mode Exit fullscreen mode

Chinese version

[
    Document(metadata={'_id': {'$oid': '66b79fdc2e6781dc91958211'}}, page_content='蚂蚁集团—支付宝生态筑基"}
    Document(metadata={'_id': {'$oid': '66b79fcd2e6781dc9195820e'}}, page_content='独立第三方支付平台竞争格局形成以支付宝为首"}
    Document(metadata={'_id': {'$oid': '66b79fc32e6781dc9195820c'}}, page_content='-艾瞰系列-景区旅游活跃度盘点月报"}
]
Enter fullscreen mode Exit fullscreen mode



Reference:

https://python.langchain.com/v0.2/docs/tutorials/pdf_qa/
Build a PDF ingestion and Question/Answering system

https://www.mongodb.com/docs/atlas/atlas-vector-search/create-embeddings/
How to Create Vector Embeddings

https://www.mongodb.com/docs/atlas/atlas-vector-search/tutorials/local-rag/#std-label-local-rag
Build a Local RAG Implementation with Atlas Vector Search

https://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/langchain/#std-label-langchain
Get Started with the LangChain Integration

https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface/

Local Embeddings with HuggingFace


Editor

Image description

Danny Chan, specialty of FSI and Serverless

Image description

Kenny Chan, specialty of FSI and Machine Learning

Top comments (0)