DEV Community

Cover image for The Adventures of Blink #28: RAGs to Riches!
Ben Link
Ben Link

Posted on

The Adventures of Blink #28: RAGs to Riches!

Last week we began to scratch the surface... we created a simple command-line chat program using Orca that would carry on a conversation with the user.

But that's so 2023, y'all! We've all seen dozens of demos where LLMs chat with people. We're moving beyond the whiz-bang excitement of the technology and starting to think about how we apply it! Blink, you gotta up your demo game!

Never fear, friends! Today we expand on the concept by creating our very own RAG application. We're going to create a "social media manager" app that can use AI to help us create titles and descriptions of our video and even helps us find quotable spots that we could use for short-form video clips to market our post!

Building on the shoulders of giants!

Any good AI-driven app, particularly a RAG app, needs to have input data. For this post I elected to use an interview between two friends of mine: @shreythecray and Craig Dennis! Shreya runs a youtube channel she calls "Developer Diaries" where she has folks on to discuss lots of cool stuff and her interview with Craig was about his unique career path to the tech industry. This is the video that I'm using in my inputs - you should totally check out her channel and give her a follow too!

YouTube Link

Don't wanna read it all? Here ya go πŸ˜‰

The RAGs

For starters, let's understand the buzzword: RAG (Retrieval Augmented Generation) is an application architecture where we join our LLM to a database of Vector Embeddings, which are numeric representations of our data that the model can understand how to work with. This effectively allows our model to query the data set represented by the Embeddings; in the vernacular, we "teach" our model to become an "expert" on whatever data we've provided.

Install Ollama

This one's nice & easy - go to ollama and download the client. This will provide a background app that runs locally... you interact with it by command line.

ollama front page

After it's installed and running, you can drop to your command prompt and do this:

ollama pull llama3
Enter fullscreen mode Exit fullscreen mode

There are lots of models available in the ollama library, but I picked Llama3 to build with today. Play around and see what you like!

Getting data out of YouTube

This turns out to be really easy - while you could build a LLM-based tool to extract the audio and create a transcript, it's much easier to just ask the YouTube API for it! πŸ™ƒ

I wrote this into a Python class just so I could easily augment it someday with something that will work on locally-stored video files - after all, the goal is to improve the YouTube posting, so it would be an icky flow to have to upload it and THEN do all this processing!

Where do embeddings come from? πŸΌπŸ‘Ά

Once you have your transcript, you have to send it to the model to create vector embeddings.

But you don't just say "Hey here's a giant string with all the things you'd ever need to know, aaaaaand GO". It won't work well!

What we have to do to prepare the data is to break it up into chunks. Think of these as bite-sized knowledge nuggets, out of which we assemble our whole knowledge base.

But it's not ONLY broken up into pieces... we have to make them overlap a little! This helps the machine understand the context a bit better. Let's imagine this with an example. If you had the following text:

The quick brown fox jumps over the lazy dog

You might chunk it up like this:

The quick
Brown fox
jumps over
the lazy
dog

But if you think of these chunks as rows in a database, and then try to query "who jumped?"...

It's a hard problem for the model to figure out. "jump" is related only directly to "over". Let's try chunking with some overlap now:

The quick brown
brown fox jumps
fox jumps over
jumps over the
over the lazy
the lazy dog

Now when we query for "who jumped?" the answer becomes much clearer, doesn't it?

This takes more memory, fills up a little more space... but the added context makes interpretation of the data much, much easier.

So we do this using langchain:

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=128)
chunks = text_splitter.split_text(result)
Enter fullscreen mode Exit fullscreen mode

Storing embeddings for consumption πŸͺ

Now that we created the chunks, we can get them built into a ChromaDB. You'll note in my code that I did this in a method that creates a vector store retriever along with the chunk-maker:

embed_model = OllamaEmbeddings(model=MODEL_ID,
        base_url=BASE_URL
    )
return Chroma.from_texts(chunks, embed_model).as_retriever()
Enter fullscreen mode Exit fullscreen mode

What comes back from this method is a fully-processed vector store that you can plug directly into the rest of the process. You may want to break this part up in your own code, but since this is a quick example with only one data source, I just took the liberty of smashing it all together in one method πŸ™ƒ

Making a retrieval chain

Now that you have a vector store retriever, you want to create a chain that connects it to the language model.

print("Combining docs...")
retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)

retrieval_chain = create_retrieval_chain(vector_store_retriever, combine_docs_chain) 
Enter fullscreen mode Exit fullscreen mode

First, you need a chat prompt template to work from. This helps the model know how to process the conversation inputs.

Next, you join that template to the model - now the model has the ability to communicate on a rudimentary level.

Finally, you build the retrieval chain - this combines the vector store with the language chain (hehe, get it? lang-chain? πŸ˜‰) and effectively connects the mouth to the memory!

Putting it all together 🧩

Now you're ready to invoke your chain. I put all my invocations in a single method just so it was easy to see what was going on...

def invocations(retrieval_chain):
    response1 = retrieval_chain.invoke({"input": "Create a summary of this message that's less than 800 characters long.  Then add several hashtags that would be appropriate if this were the youtube description of the video, in order to maximize its social media reach."})

    print(f"{response1['answer']}\n")

    response2 = retrieval_chain.invoke({"input": "Create a clickbait style title for the message based on its overall theme."})

    print(f"{response2['answer']}\n")

    response3 = retrieval_chain.invoke({"input": "Locate at least 3 potential quotable snippets within the message that could make good short-form video content.  Provide ONLY the snippets, do not explain why you selected them."})

    print(f"{response3['answer']}\n")
Enter fullscreen mode Exit fullscreen mode

The structure of this is totally up to you - in this case I wanted simple, one-time communications - given this request, put out a response. I imagine that this would grow into much more interesting things - you could build in a loop to keep the conversation going, you could make it update the Chroma DB to add more context and knowledge live... the possibilities are endless!

Wrapping up

It turns out that the tooling already available makes it easy for a programmer to begin experimenting with AI applications. I was thrilled with how easy it was to set this up and get to a working product! It's always a good feeling when your imagination is the constraint - you aren't limited by the capabilities of the tech so much as what you haven't thought of yet!

I'd love to see what you're building with this - drop me a comment and let me know, or send me a PR and let's collaborate!

Top comments (0)