Introduction
For reference, prior to this journey, I barely had more knowledge about AI than the average person. Sure, I fired off the occasional ChatGPT request for one task or another, but I was always more focused on coding than AI, having picked up Python and Java during quarantine.
Despite my initial skepticism at being able to successfully understand the examples, particularly in a short time frame, I found LLMWare's "Fast Start to RAG" series highly accessible. I will cover example one of the course in this article - hopefully it can help you as well! If you are interested in learning more about LLMWare, feel free to check out our website as well as another DEV article outlining the Fast Start to RAG examples.
To clarify, extensive knowledge of coding, specifically Python 3, is not necessarily a prerequisite for the examples that I used to get my start in AI and RAG. However, basic understanding is certainly helpful in comprehending content and parsing code.
Getting started
To run these examples, you will need to install the LLMWare package by running pip3 install llmware
in the command line. Further instructions can be found in our README file. Then, you will be able to run example 1, which is directly copy-paste ready.
I will also point out that the AI community tends to use acronyms (like AI itself!) and technical language extending beyond the scope of everyday conversation. The acronym "RAG" stands for Retrieval Augmented Generation, which enhances outputs of LLMs (Large Language Models) using external knowledge. In Example 1, we will be focusing on the first step in RAG - converting a pile of files into an AI ready knowledge base.
Extra resources
For visual learners, here is a video that works through example 1. Feel free to watch the video before following the steps in this article. Also, here is a Python Notebook that breaks down this example's code alongside the output: Example 1 Notebook.
Part 1 - Execution configuration
By default, the active database being used is called "mongo", but we will select "sqlite" since it does not require a separate installation.
Additionally, we can use different debug mode options to see more or less information as it is processed. We can set debug_mode
to 2 for more detailed outputs compared to 0, the default.
For this example, sample data sets are imported through from llmware.setup import Setup
and are stored in sample_folders
. These sets include documents of different subject matters and sizes, but you will be able to replace them with your own data as well. We can choose a name for our library (go ahead and customize!) and select a folder from the samples before running the main script.
LLMWareConfig().set_active_db("sqlite")
LLMWareConfig().set_config("debug_mode", 2)
sample_folders = ["Agreements", "Invoices", "UN-Resolutions-500", "SmallLibrary", "FinDocs", "AgreementsLarge"]
library_name = "example1_library"
selected_folder = sample_folders[0] # e.g., "Agreements"
output = parsing_documents_into_library(library_name, selected_folder)
Part 2 - Main body
Step 1: Now, we can create our library! This line of code will set up the database tables as well as supporting file repositories to store information about the library.
library = Library().create_new_library(library_name)
Steps 2 and 3: However, our library is still completely empty, so we need to fill it up. To do so, we will load in the LLMWare sample files and save them in sample_files_path
. If you are using your own data sets, you will need to point to a local folder path with your documents.
sample_files_path = Setup().load_sample_files(over_write=False)
ingestion_folder_path = os.path.join(sample_files_path, sample_folder)
Step 4: While adding files to a library, LLMWare performs parsing, text chunking, and indexing in the sqlite database. It will automatically choose the correct parser based on a file's extension type. This parser will extract information to store in database text chunks. Although this may seem like a lot of steps, it all happens incredibly quickly behind the scenes!
parsing_output = library.add_files(ingestion_folder_path)
Step 5: To check our progress, we can look at the updated_library_card
, which contains key metadata, counting data, and other important information. This .get_library_card()
method can be called at any time to retrieve information about your library,
updated_library_card = library.get_library_card()
doc_count = updated_library_card["documents"]
block_count = updated_library_card["blocks"]
Steps 6 and 7: We can check the library's main folder structure, but the library is ready to start running queries! We will do this by instantiating a Query object and passing it to the library. This test_query
may need to be adjusted to best suit the data set. For this example, we chose the "Agreements" sample set, so we can use "base salary" as a "hello world"-esque query.
Now, a text query is going to be run to look at every chunk of text to find the ones that contain "base salary" to return. The Query class contains many methods for different Query types. Today, we will use the simplest text_query
method.
query_results = Query(library).text_query(test_query, result_count=10)
We can print out our results, giving us a look at the metadata and attributes of the individual text blocks we created!
for i, result in enumerate(query_results):
# here are a few useful attributes
text = result["text"]
file_source = result["file_source"]
page_number = result["page_num"]
doc_id = result["doc_ID"]
block_id = result["block_ID"]
matches = result["matches"]
print("query results: ", i, result)
Part 3 - The results
The outputted summary will include key information such as total pdf files processed
, total blocks created
, total pages added
, and time elapsed
. Try and see if you can find all of them!
In particular, the LLMWare package includes "C based parsers" that are able to quickly and efficiently parse files. Once completed, the parsed information will be outputted as a dictionary. You will see the results of your work in the previous steps!
To summarize, we took our documents and broke them down into thousands of blocks. Then, we extracted text information and put it into the sqlite database. Lastly, we ran a text search against that data to retrieve our results (including details as small as pixel coordinates and character level matches!).
You just completed your first example, but there is so much more for you to explore! I would suggest rerunning this example with varied data sets to tap into the true potential of this technology, and of course, continue onto example 2 about building embeddings!
Happy coding!
Part 4 - To see more ...
Please join our LLMWare community on discord to learn more about RAG and LLMs! https://discord.gg/5mx42AGbHm
Top comments (0)