Tilde A. Thurium

Posted on Jan 10, 2024 • Originally published at haystack.deepset.ai

Using Jina Embeddings v2 with Haystack 2.0 pipelines to summarize legal documents

#llm #ai #haystack #embedding

Jina.ai recently upgraded and expanded the capabilities of their previous embedding model in a v2 release.

With the Jina Haystack extension, you can now take advantage of these new text embedders in your Haystack pipelines! In this post, we'll show what's cool about Jina Embeddings v2 and how to use them.

You can follow along in the accompanying Colab notebook of a RAG pipeline that uses the Jina Haystack extension.

Advantages of Jina Embeddings v2

Handling long documents. The large token window, accommodating up to 8192 tokens, allows you to break the embeddings into larger chunks. It's more computationally and memory-efficient to use a few larger vectors than a lot of small ones, so this allows Jina v2 to process large documents efficiently.
Improved semantic understanding. Larger text chunks also contain more context within each chunk, which can help LLMs better understand your documents. Improved understanding means better long document retrieval, semantic textual similarity, text reranking, recommendation, RAG and LLM-based generative search.
Short vector length: Jina Embeddings v2 emits embedding vectors of length 768 (base model) or 512 (small model), which are both significantly less than that of the only other embedding model that supports 8k tokens input length, while not compromising on the quality of retrieval, similarity, reranking or other downstream tasks. A shorter vector length implies cost-savings for the vector database, which typically price based on stored vector dimensions.
Fully open source 💙 There are both small and large embedding models available, depending on your computing resources and requirements. To run the embedding models yourself, check out this documentation on HuggingFace. Alternately, you can use Jina's fully managed embedding service to handle that for you, which we'll be doing for this demo.

Getting started using Jina Embeddings v2 with Haystack

To use the integration you'll need a free Jina api key - get one here.

You can use Jina Embedding models with two Haystack components: JinaTextEmbedder and JinaDocumentEmbedder.

To create semantic embeddings for documents, use JinaDocumentEmbedder in your indexing pipeline. For generating embeddings for queries, use JinaTextEmbedder.

In the following code we'll demonstrate how to use both components. You can also see the Haystack docs for some minimum viable code examples.

Summarizing legal text with a Haystack RAG pipeline

I'm not a lawyer, and neither are large language models. But LLMs are good at analyzing long, complex documents. So let's try using the Jina v2 embedding models for some legal summarization.

In October 2023, I narrowly escaped jury duty. I had slight FOMO since the case sounded interesting (Google v. Sonos). Let's see how it turned out.

To follow along with this demo, in addition to a Jina api key you'll also need a Hugging Face access token, since we'll use the Mixtral 8x7b LLM for question answering.

First, let's install all the packages we'll need.

pip install jina-haystack chroma-haystack pypdf

Then let's input our credentials. Or you can set them as environment variables instead if you're feeling fancy.

import getpass

jina_api_key  = getpass.getpass("JINA api key:")
hf_token = getpass.getpass("Enter your HuggingFace api token:")

Building the indexing pipeline

Our indexing pipeline will preprocess the legal document, turn it into vectors, and store them. We'll use the Chroma DocumentStore to store the vector embeddings, via the Chroma Document Store Haystack integration.

from chroma_haystack.document_store import ChromaDocumentStore
document_store = ChromaDocumentStore()

At a high level, the LinkContentFetcher pulls this document from its URL. Then we convert it from a PDF into a Document object Haystack can understand.

We preprocess it by removing whitespace and redundant substrings. Then split it into chunks, generate embeddings, and write these embeddings into the ChromaDocumentStore.

from haystack import Pipeline

from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import PyPDFToDocument
from haystack.components.writers import DocumentWriter
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from chroma_haystack.retriever import ChromaEmbeddingRetriever
from haystack.document_stores import DuplicatePolicy

from jina_haystack.document_embedder import JinaDocumentEmbedder
from jina_haystack.text_embedder import JinaTextEmbedder

fetcher = LinkContentFetcher()
converter = PyPDFToDocument()
# remove repeated substrings to get rid of headers/footers
cleaner = DocumentCleaner(remove_repeated_substrings=True)

# Since jina-v2 can handle 8192 tokens, 500 words seems like a safe chunk size
splitter = DocumentSplitter(split_by="word", split_length=500)

# DuplicatePolicy.SKIP is optional but helps avoid errors if you want to re-run the pipeline
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

retriever = ChromaEmbeddingRetriever(document_store=document_store)

document_embedder = JinaDocumentEmbedder(api_key=jina_api_key, model_name="jina-embeddings-v2-base-en")

indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=fetcher, name="fetcher")
indexing_pipeline.add_component(instance=converter, name="converter")
indexing_pipeline.add_component(instance=cleaner, name="cleaner")
indexing_pipeline.add_component(instance=splitter, name="splitter")
indexing_pipeline.add_component(instance=document_embedder, name="embedder")
indexing_pipeline.add_component(instance=writer, name="writer")

indexing_pipeline.connect("fetcher.streams", "converter.sources")
indexing_pipeline.connect("converter.documents", "cleaner.documents")
indexing_pipeline.connect("cleaner.documents", "splitter.documents")
indexing_pipeline.connect("splitter.documents", "embedder.documents")
indexing_pipeline.connect("embedder.documents", "writer.documents")

# This case references Google V Sonos, October 2023
urls = ["https://cases.justia.com/federal/district-courts/california/candce/3:2020cv06754/366520/813/0.pdf"]

indexing_pipeline.run(data={"fetcher": {"urls": urls}})

Building the query pipeline

Now the real fun begins. Let's create a query pipeline so we can actually start asking questions. We write a prompt allowing us to pass our documents to the Mixtral-8x7B LLM. Then we initiatialize the LLM via the HuggingFaceTGIGenerator.

In Haystack 2.0 retrievers are tightly coupled to DocumentStores. If we pass the document store in the retriever we initialized earlier, this pipeline can access those embeddings we generated, and pass them to the LLM.


from haystack.components.generators import HuggingFaceTGIGenerator
from haystack.components.builders.prompt_builder import PromptBuilder

from jina_haystack.text_embedder import JinaTextEmbedder
prompt = """ Answer the question, based on the
content in the documents. If you can't answer based on the documents, say so.

Documents:
{% for doc in documents %}
  {{doc.content}}
{% endfor %}

question: {{question}}
"""

text_embedder = JinaTextEmbedder(api_key=jina_api_key, model_name="jina-embeddings-v2-base-en")
generator = HuggingFaceTGIGenerator("mistralai/Mixtral-8x7B-Instruct-v0.1", token=hf_token)
generator.warm_up()

prompt_builder = PromptBuilder(template=prompt)
query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder",text_embedder)
query_pipeline.add_component(instance=prompt_builder, name="prompt_builder")
query_pipeline.add_component("retriever", retriever)
query_pipeline.add_component("generator", generator)

query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
query_pipeline.connect("retriever.documents", "prompt_builder.documents")
query_pipeline.connect("prompt_builder.prompt", "generator.prompt")

Time to ask a question!

question = "Summarize what happened in Google v. Sonos"

result = query_pipeline.run(data={"text_embedder":{"text": question},
                                  "retriever": {"top_k": 3},
                                  "prompt_builder":{"question": question},
                                  "generator": {"generation_kwargs": {"max_new_tokens": 350}}})

print(result['generator']['replies'][0])

Answer: Google v. Sonos is a patent infringement case in which Sonos sued Google for infringing on two of its patents related to customizing and saving overlapping groups of smart speakers or other zone players according to a common theme..

Exploring more questions and documents

You can swap the question variable out and then call pipeline.run again:

What role did If This Then That play in Google v. Sonos?
What judge presided over Google v. Sonos?
What should Sonos have done differently?

The indexing pipeline is written so that you can swap in other documents and analyze them. You can try plugging the following URLs (or any PDF written in English) into the indexing pipeline and re-running all the code blocks below it.

Google v. Oracle: https://supreme.justia.com/cases/federal/us/593/18-956/case.pdf
JACK DANIEL’S PROPERTIES, INC. v. VIP PRODUCTS LLC: https://www.supremecourt.gov/opinions/22pdf/22-148_3e04.pdf

Note: if you want to change the prompt template, you'll also need to re-run the code blocks starting where the DocumentStore is defined.

Wrapping it up

Thanks for reading! If you want to stay on top of the latest Haystack developments, you can subscribe to our newsletter or join our Discord community.

To learn more about the technologies used here, check out these blog posts:

DEV Community