DEV Community

Cover image for Elastic D&D - Update 13 - Text Chunking
Joe
Joe

Posted on • Updated on

Elastic D&D - Update 13 - Text Chunking

In the last post we talked about how Veverbot works. If you missed it, you can check that out here!

Chunking

Chunking is the process of breaking something large into smaller, more manageable pieces. For example, the free audio transcription method uses this on the audio file. You can see that here.

While using Veverbot, I noticed that larger text passages were awful for returning relevant information back to the AI assistant. To make Veverbot better, I have been working on breaking these large text passages into smaller ones with context; meaning that the text chunks have some overlap in order to return better responses.

Python Function

Accomplishing chunking with overlap ended up being fairly easy. Using the Natural Language Toolkit, specifically Punkt, we are able to tokenize text passages into an array of sentences. From there, we can loop through this array and check the length of the chunk and sentence. If the sum is greater than the chunk_size variable, it is added to the chunks array and the overlap is calculated. The overlap is calculated the same way, except in reverse, which makes this process quite fast. When it is finished, the function returns an array of text chunks to use in the log_payload for Elastic indexing.

def split_text_with_overlap(text, chunk_size=500, overlap_size=100):
    # download punky and initialize tokenizer
    nltk.download("punkt")
    tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

    # separate text into an array of sentences
    array = tokenizer.tokenize(text)

    # if length of text chunk > 500, index document
    # afterwards, prepend previous 100 characters for context overlap
    chunks = []
    chunk = ""
    for index, sentence in enumerate(array):
        if (len(chunk) + len(sentence)) >= chunk_size:
            chunks.append(chunk)

            overlap = ""
            overlap_length = len(overlap)
            overlap_index = index - 1
            while ((overlap_length + len(array[overlap_index])) < overlap_size) and overlap_index != -1:
                overlap = (array[overlap_index] + overlap)
                overlap_length = len(overlap)
                overlap_index = overlap_index - 1
            chunk = overlap + sentence
        else:
            chunk += sentence
    # index last bit of text that may not hit length limit
    chunks.append(chunk)

    return chunks
Enter fullscreen mode Exit fullscreen mode

NOTE:

I will show how this process fits into note input once I finish my rewrite of that page. It is almost done and I am super happy with it.

Closing Remarks

I am quite pleased with how this process panned out. It works very well and it is lightning fast, which is something that I was worried about.

I plan on finishing my note input rewrite by next week so I hope to talk about that in the next post. If not, I can begin talking about the new player dashboard that will be replacing the home page.

Check out the GitHub repo below. You can also find my Twitch account in the socials link, where I will be actively working on this during the week while interacting with whoever is hanging out!

GitHub Repo
Socials

Happy Coding,
Joe

Top comments (0)