Large language models are powerful tools with extensive capabilities; nonetheless, they grapple with a distinct limitation known as the context window. This context window defines the boundaries within which these models can proficiently process text. Take, for example, gpt-3.5-turbo, which operates within a context length of 4,096 tokens, approximately corresponding to 3,500 words.
But what occurs when you present these models with a document that exceeds their context window? This is where a clever strategy known as "chunking" comes into play. Chunking involves dividing the document into smaller, more manageable sections that fit comfortably within the context window of the large language model.
RecursiveCharacterTextSplitter takes a large text and splits it based on a specified chunk size. It does this by using a set of characters. The default characters provided to it are
["\n\n", "\n", " ", ""].
It takes in the large text then tries to split it by the first character
\n\n. If the first split by
\n\n is still large then it moves to the next character which is
\n and tries to split by it. If it is still larger than our specified chunk size it moves to the next character in the set until we get a split that is less than our specified chunk size.
What I Worked On
Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.
The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.
The text above is extracted from an article written by Paul Graham, titled: What I Worked On. Let's utilize the
RecursiveCharacterTextSplitter to break it into small chunks, each with a maximum size of 100 characters.
First we import it from langchain:
from langchain.text_splitter import RecursiveCharacterTextSplitter
Let's load the text we wish to create chunks from into a variable called
text = """What I Worked On February 2021 Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights. """
Next we create a
RecursiveCharacterTextSplitter instance, configuring it with a
chunk_size of 100 and a
chunk_overlap value of zero. Our approach involves using the length function to measure each chunk based on its character count.
text_splitter = RecursiveCharacterTextSplitter( chunk_size = 100, chunk_overlap = 0, length_function = len, )
RecursiveCharacterTextSplitter offers several methods for performing splits. In our case, we will utilize the split_text method. This method requires a string input representing the text and returns an array of strings, each representing a chunk after the splitting process.
texts = text_splitter.split_text(text) print(len(texts)) # 11 print(texts) # 'What I Worked On\n\nFebruary 2021'
Upon performing the split our text was successfully divided into a total of 11 separate chunks.
Just as its name suggests, the
RecursiveCharacterTextSplitter employs recursion as the core mechanism to accomplish text splitting. Now, let's take a detailed journey through the process of how our earlier code was capable of achieving this feat.
For our walkthrough, we'll utilize the same text and parameters that we employed during the code implementation. This involves a segment from Paul Graham's essay, and we'll consider a chunk size of 100 characters. The characters we use for splitting will be
['\n\n', '\n', ' ', '' ].
Let's begin with our initial text. Currently presented in human-readable form, our next step involves transforming it into a format that computers can readily comprehend.
Now, the new lines have been converted to
\n, which is precisely what we need in order to carry out our splitting process.
Let's select our text. This can be likened to invoking the
split_text method on our
As mentioned earlier, the
RecursiveCharacterTextSplitter attempts to initiate splits using a predefined set of characters. Its first attempt involves the
\n\n character, which serves as a means to split by paragraphs. Let's now identify all occurrences of this character within our text.
Once we've located all instances of the
\n\n characters, the subsequent step involves executing a split using this character as our designated separator.
Presently, we have four splits. Our next step involves assessing each split to check whether they meet the condition of being smaller than our specified chunk size, which is set at 100 characters.
The first two splits satisfy this condition, thus earning them the label of good splits. Since both segments consist of fewer than 100 characters, we can combine them to create our initial chunk.
Proceeding to the second split, we find ourselves in a situation where further reduction isn't achievable using the
\n\n character. Therefore, we proceed to the next character:
\n. Our objective here is to execute a split using the
\n character and determine if we can achieve a reduction in the split's size.
This operation is akin to invoking the
split_text on the second split text, but with the inclusion of the
\n character. This is where the concept of recursion comes into play.
Upon executing the split using the
\n character, we end up with two splits. The first split qualifies as a good split, given that it contains only one character. However, the second split surpasses our designated chunk size.
Consequently, we need to invoke the
split_text method on this particular split once again. However, this time we'll employ a split using the next character in our character list, which happens to be the
' ' character.
Finally, we have successfully decreased the split size. Now, we proceed to iterate through each split in order to perform a merge. The guiding principle for these merges is that no resulting merged split should exceed our designated chunk size of 100 characters.
Following the merge, we end up with four chunks, each adhering to our condition that a chunk should not surpass 100 characters.
Now, let's revisit the original text splits and identify which split remains to be processed.
We still have one split that is greater than our chunk size. We repeat the same procedures again.
We initiate the split using the new line character as the separator.
We perform a split using spaces as the separators.
Next, we proceed with a merge, ensuring that no merged segments exceed the defined chunk size.
After going through the entire process, we arrive at generating eleven individual chunks. Each of these eleven chunks successfully adheres to the 100-character limit.
This outcome aligns precisely with what we achieved programmatically.
And there we have it. We've delved into the inner workings of LangChain's
RecursiveCharacterTextSplitter. For those who are intrigued, you can explore the source code here. If you found this article informative, please consider showing your appreciation with a reaction: 💖 🦄 🤯 🙌 🔥