DEV Community

Cover image for 🚀 教程:通过 MongoDB 向量搜索财务 PDF 报告 (本地嵌入)
Danny Chan for MongoDB Builders

Posted on

🚀 教程:通过 MongoDB 向量搜索财务 PDF 报告 (本地嵌入)

步骤 1:创建数据库集群
步骤 2:输入数据库集群信息
步骤 3:等待集群部署
步骤 4:添加网络访问白名单
步骤 5:添加数据库访问用户
步骤 6:连接到本地 Atlas 部署或 Atlas 集群
步骤 7:从 PDF 中检索文本
步骤 8:本地嵌入 PDF 文本然后插入 MongoDB
步骤 9:检查集合文档记录
步骤 10:创建向量搜索索引
步骤 11:通过向量搜索索引查询



步骤 1:创建数据库集群

Image description

Image description



步骤 2:输入数据库集群信息

Image description

Image description

Image description

Image description

Image description

Image description



步骤 3:等待集群部署

Image description

Image description



步骤 4:添加网络访问白名单

Image description

Image description

Image description

Image description

Image description

Image description

Image description



步骤 5:添加数据库访问用户

Image description

Image description

Image description

Image description

Image description

Image description



步骤 6:连接到本地 Atlas 部署或 Atlas 集群

Image description

Image description

Image description

Image description

Image description

Image description



步骤 7:从 PDF 中检索文本

Image description

Image description



步骤 8:本地嵌入 PDF 文本然后插入 MongoDB

pip install sentence-transformers==2.7.0
pip install pymongo==4.7.2
pip install langchain==0.2.6
pip install langchain-mongodb==0.1.5
pip install pandas==2.2.0
pip install langchain-openai==0.1.20
pip install langchain-chroma==0.1.0
pip install langchain-core==0.2.26
pip install langchain-huggingface==0.0.3
pip install langchain-mongodb==0.1.4
Enter fullscreen mode Exit fullscreen mode
from pymongo import MongoClient
from langchain_huggingface import HuggingFaceEmbeddings
Enter fullscreen mode Exit fullscreen mode
print("get documents")

data = ""
with open("./txt_final/payment.txt","r",encoding="utf8") as file:
    data = file.read()
Enter fullscreen mode Exit fullscreen mode
print("Split txt into documents by page")

splits = data.split("www.iresearch.com.cn")
Enter fullscreen mode Exit fullscreen mode
print("get model then embedding")

model = HuggingFaceEmbeddings(model_name="BAAI/bge-large-zh-v1.5")
Enter fullscreen mode Exit fullscreen mode
print("Connect to your local Atlas deployment or Atlas Cluster")
mongo_client = MongoClient("mongodb+srv://<username>:<password>@internal-knowledge-base.xxxxx.mongodb.net/")

collection = mongo_client["internal-knowledge-base"]["papers"]

for split in splits:
    embedding = model.embed_query(split)
    collection.insert_one({ 'text_embedding': embedding, 'summary': split })
Enter fullscreen mode Exit fullscreen mode



步骤 9:检查集合文档记录

Image description

Image description

Image description

Image description

PDF page 3

Image description

PDF page 4

Image description

Data structure

{
    "_id": "66b79fd22e6781dc9195820fL",
    "text_embedding": [0.019098538905382156, -0.0010181389516219497],
    "summary": "Diversified development paths for third-party payment platforms Third-party payment platforms integrate into every detail of consumer life through lightweight reach...."
}
Enter fullscreen mode Exit fullscreen mode



步骤 10:创建向量搜索索引

Image description

Image description

Image description

Image description

Image description

{
  "fields": [
    {
      "type": "vector",
      "path": "text_embedding",
      "numDimensions": 1024,
      "similarity": "cosine"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Image description



步骤 11:通过向量搜索索引查询

from pymongo import MongoClient
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_mongodb import MongoDBAtlasVectorSearch
import pprint
Enter fullscreen mode Exit fullscreen mode
print("Connect to your local Atlas deployment or Atlas Cluster")
mongo_client = MongoClient("mongodb+srv://<username>:<password>@internal-knowledge-base.xxxxx.mongodb.net/")

collection = mongo_client["internal-knowledge-base"]["papers"]

model = HuggingFaceEmbeddings(model_name="BAAI/bge-large-zh-v1.5")

vector_store = MongoDBAtlasVectorSearch(
   collection=collection,
   embedding=model,
   index_name="vector_index",
   embedding_key="text_embedding",
   text_key="summary"
)
Enter fullscreen mode Exit fullscreen mode
query = "蚂蚁集团" # payment
results = vector_store.similarity_search(query)
pprint.pprint(results)
Enter fullscreen mode Exit fullscreen mode

Result:
English version

[
    Document(metadata={'_id': {'$oid': '66b79fdc2e6781dc91958211'}}, page_content='Ant Group-Alipay Ecological Foundation}
    Document(metadata={'_id': {'$oid': '66b79fcd2e6781dc9195820e'}}, page_content='The competitive landscape of independent third-party payment platforms has formed, led by Alipay"}
    Document(metadata={'_id': {'$oid': '66b79fc32e6781dc9195820c'}}, page_content='Aikan Series Monthly Inventory of Tourism Activity in Scenic Areas"}
]
Enter fullscreen mode Exit fullscreen mode

Chinese version

[
    Document(metadata={'_id': {'$oid': '66b79fdc2e6781dc91958211'}}, page_content='蚂蚁集团—支付宝生态筑基"}
    Document(metadata={'_id': {'$oid': '66b79fcd2e6781dc9195820e'}}, page_content='独立第三方支付平台竞争格局形成以支付宝为首"}
    Document(metadata={'_id': {'$oid': '66b79fc32e6781dc9195820c'}}, page_content='-艾瞰系列-景区旅游活跃度盘点月报"}
]
Enter fullscreen mode Exit fullscreen mode



Reference:

https://python.langchain.com/v0.2/docs/tutorials/pdf_qa/
Build a PDF ingestion and Question/Answering system

https://www.mongodb.com/docs/atlas/atlas-vector-search/create-embeddings/
How to Create Vector Embeddings

https://www.mongodb.com/docs/atlas/atlas-vector-search/tutorials/local-rag/#std-label-local-rag
Build a Local RAG Implementation with Atlas Vector Search

https://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/langchain/#std-label-langchain
Get Started with the LangChain Integration

https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface/

Local Embeddings with HuggingFace


Editor

Image description

Danny Chan, specialty of FSI and Serverless

Image description

Kenny Chan, specialty of FSI and Machine Learning

Top comments (0)