DEV Community

Cover image for Setting up the database and search for RAG
Cris Crawford
Cris Crawford

Posted on

Setting up the database and search for RAG

In video 1.3 of the datatalksclub's llm-zoomcamp, we're focusing on retrieval. In this video, I set up the database and search capabilities for RAG. I used a simple in-memory minimal search engine for now, which was created in a pre-course video. I didn't create it - I just downloaded the one from the instructor's repository.

Next I imported a json file into which I had read the contents of the course FAQs for the three other zoomcamps. I did this in the first of the pre-course workshops. This file had the form:

 {"course": <course name>,
  "documents": [{"text": <answer to question>,
                 "question": <question>,
                 "section": <section>}]
 }
Enter fullscreen mode Exit fullscreen mode

I flattened the file (i.e. made it into a list of documents). Then I put it into the search engine. I had to specify which were the searchable fields and which were the keywords to filter the search. I created an index, and then fit the index with the list of documents. Then I performed the search. This was pretty easy and everything worked as it should. The python notebook is as follows:

!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

import minsearch

import json

with open('documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

documents = []

// Flattening
for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

// Indexing
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

index.fit(documents)

q = 'the course has already started, can I still enroll?'

boost = {'question': 3.0, 'section': 0.5}

results = index.search(
    query=q,
    filter_dict={'course': 'data-engineering-zoomcamp'},
    boost_dict=boost,
    num_results=5
)
Enter fullscreen mode Exit fullscreen mode

"boost" raises the importance of 'question' in the search relative to the other fields, and lowers the importance of 'section'. The default is 1.0. filter_dict takes out courses other than data-engineering-zoomcamp.

We have a query, we have indexed our knowledge base, and now we can ask this knowledge base for the context, and we can proceed to the next video to invoke OpenAI.

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
  'section': 'General course-related questions',
  'question': 'Course - When will the course start?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
  'section': 'General course-related questions',
  'question': 'Course - What can I do before the course starts?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, the slack channel remains open and you can ask questions there. But always sDocker containers exit code w search the channel first and second, check the FAQ (this document), most likely all your questions are already answered here.\nYou can also tag the bot @ZoomcampQABot to help you conduct the search, but don’t rely on its answers 100%, it is pretty good though.',
  'section': 'General course-related questions',
  'question': 'Course - Can I get support if I take the course in the self-paced mode?',
  'course': 'data-engineering-zoomcamp'}]
Enter fullscreen mode Exit fullscreen mode

Previous post: Learning how to make an OLIVER
Next post: Generating a result with a context

Top comments (0)