DEV Community

Cover image for Steps to Build RAG Application with Gemma 7B LLM
Akriti Upadhyay
Akriti Upadhyay

Posted on

Steps to Build RAG Application with Gemma 7B LLM


As large language models are advancing, the craze for building RAG (Retrieval Augmented Generation) applications is increasing. Google just launched an open-source model: Gemma. As we know, RAG represents a fusion between two fundamental methodologies: retrieval-based techniques and generative models. Retrieval-based techniques involve sourcing pertinent information from expansive knowledge repositories or corpora in response to specific queries. Generative models excel in crafting original text or responses by leveraging insights taken from training data to create new content from scratch. With this launch, why not try the new open-source model for building a RAG pipeline and see how it is performing?

Let’s get started and break the process into these steps:

  1. Loading the Dataset: Cosmopedia
  2. Embedding Generation with Hugging Face
  3. Storing in the FAISS DB
  4. Gemma: Introducing the SOTA model
  5. Querying the RAG Pipeline

Building RAG Application on Gemma 7B

Before rolling our sleeves on, let’s install and import the required dependencies.

%pip install -q -U langchain torch transformers sentence-transformers datasets faiss-cpu
Enter fullscreen mode Exit fullscreen mode
import torch
from datasets import load_dataset
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer
from transformers import AutoTokenizer, pipeline
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA
Enter fullscreen mode Exit fullscreen mode

Loading the Dataset: Cosmopedia

To make a RAG application, we have selected a Hugging Face dataset, Cosmopedia. This dataset consists of synthetic textbooks, blog posts, stories, posts, and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1. The dataset contains over 30 million files and 25 billion tokens, which makes it the largest open synthetic dataset to date.

This dataset contains 8 subsets. We’ll move with the ‘stories’ subset. We’ll load the dataset using the datasets library.

data = load_dataset("HuggingFaceTB/cosmopedia", "stories", split="train")
Enter fullscreen mode Exit fullscreen mode

Then, we will convert it to a Pandas dataframe, and save it to a CSV file.

data = data.to_pandas()
Enter fullscreen mode Exit fullscreen mode

Now that the dataset is saved on our system, we will use LangChain to load the dataset.

loader = CSVLoader(file_path='./dataset.csv')
data = loader.load()

Enter fullscreen mode Exit fullscreen mode

Now that the data is loaded, we need to split the documents inside the data. Here, we split the documents into chunk sizes of 1000. This will help the model to work fast and efficiently.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(data)

Enter fullscreen mode Exit fullscreen mode

Embedding Generation with Hugging Face

After that, we will generate embeddings using Hugging Face Embeddings and with the help of the Sentence Transformers model.

modelPath = "sentence-transformers/all-MiniLM-l6-v2"
model_kwargs = {'device':'cpu'}
encode_kwargs = {'normalize_embeddings': False}

embeddings = HuggingFaceEmbeddings(
Enter fullscreen mode Exit fullscreen mode

Storing in the FAISS DB

The embeddings are generated, but we need them to be stored in a vector database. We’ll be saving those embeddings in the FAISS vector store, which is a library for efficient similarity search and clustering dense vectors.

db = FAISS.from_documents(docs, embeddings)
Enter fullscreen mode Exit fullscreen mode

Gemma: Introducing the SOTA model

Gemma offers two model sizes, with 2 billion and 7 billion parameters respectively, catering to different computational constraints and application scenarios. Both pre-trained and fine-tuned checkpoints are provided, along with an open-source codebase for inference and serving. It is trained on up to 6 trillion tokens of text data and leverages similar architectures, datasets, and training methodologies as the Gemini models. Both exhibit strong generalist capabilities across text domains and excel in understanding and reasoning tasks on a large scale.

The release includes raw, pre-trained checkpoints as well as fine-tuned checkpoints optimized for specific tasks such as dialogue, instruction-following, helpfulness, and safety. Comprehensive evaluations have been conducted to assess the models' performance and address any shortcomings, which enables thorough research and investigation into model tuning regimes and the development of safer and more responsible model development methodologies. Gemma's performance surpasses that of comparable-scale open models across various domains, including question-answering, commonsense reasoning, mathematics and science, and coding, as demonstrated through both automated benchmarks and human evaluations. To know more about the Gemma model, visit this technical report.

To get started with the Gemma model, you should acknowledge their terms on Hugging Face. Then pass the Hugging Face token while logging in.

from huggingface_hub import notebook_login
Enter fullscreen mode Exit fullscreen mode

Initialize the tokenizer with the model.

model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b", padding=True, truncation=True, max_length=512)
Enter fullscreen mode Exit fullscreen mode

Create a text generation pipeline.

pipe = pipeline(
    model_kwargs={"torch_dtype": torch.bfloat16},
Enter fullscreen mode Exit fullscreen mode

Initialize the LLM with pipeline and model kwargs.

llm = HuggingFacePipeline(
    model_kwargs={"temperature": 0.7, "max_length": 512},
Enter fullscreen mode Exit fullscreen mode

Now it is time to use the vector store and the LLM for question-answering retrieval.

qa = RetrievalQA.from_chain_type(
Enter fullscreen mode Exit fullscreen mode

Querying the RAG Pipeline

The RAG pipeline is ready; let’s pass the queries and see how it performs.

qa.invoke("Write an educational story for young children.")
Enter fullscreen mode Exit fullscreen mode

The result is:

Once upon a time, in a cozy little village nestled between rolling hills and green meadows, there lived a curious kitten named Whiskers. Whiskers loved to explore every nook and cranny of the village, from the bustling marketplace to the quiet corners where flowers bloomed. One sunny morning, as Whiskers trotted down the cobblestone path, he spotted something shimmering in the distance. With his whiskers twitching in excitement, he scampered towards it, his little paws pitter-pattering on the ground. To his delight, he found a shiny object peeking out from beneath a bush--a beautiful, colorful kite! With a twinkle in his eye, Whiskers decided to take the kite on an adventure. He tugged at the string, and the kite soared into the sky, dancing gracefully with the gentle breeze. Whiskers giggled with joy as he watched the kite soar higher and higher, painting the sky with its vibrant colors.
Enter fullscreen mode Exit fullscreen mode

Final Words

The Gemma 7B model performed very well. We got to read a beautiful story about a kitten. The new SOTA model was interesting and exciting to use. With the help of the FAISS vector store, we were able to build a RAG pipeline. Thanks for reading!

This article was originally published here:

Top comments (0)