If you have tried doing any form of important work that requires text analysis, natural language processing, and machine learning, you will soon find that text splitting is either going to make your analysis very effective or worse than even if you had never gone down that road at all.
There are many different applications and use cases for this task but a more common hurdle you’ll run into is how to do this process of text splitting, most libraries have the chunk size and chunk overlap parameters to aid in this process, which is the subject of this article.
Chunk size is the maximum number of characters that a chunk can contain.
Chunk overlap is the number of characters that should overlap between two adjacent chunks.
The chunk size and chunk overlap parameters can be used to control the granularity of the text splitting. A smaller chunk size will result in more chunks, while a larger chunk size will result in fewer chunks. A larger chunk overlap will result in more chunks sharing common characters, while a smaller chunk overlap will result in fewer chunks sharing common characters.
There are many different ways to split text. Some common methods include:
Character-based splitting: This method divides the text into chunks based on individual characters.
Word-based splitting: This method divides the text into chunks based on words.
Sentence-based splitting: This method divides the text into chunks based on sentences.
The Recursive Text Splitter
The Recursive Text Splitter Module is a module in the LangChain library that can be used to split text recursively. This means that the module will try to split the text into different characters until the chunks are small enough.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text = "This is a piece of text."
splitter = RecursiveCharacterTextSplitter()
chunks = splitter.split_text(text)
for chunk in chunks:
print(chunk)
Output
This
is
a
piece
of
text.
The best way to choose the chunk size and chunk overlap parameters depends on the specific problem you are trying to solve. However, in general, it is a good idea to use a small chunk size for tasks that require a fine-grained view of the text and a larger chunk size for tasks that require a more holistic view of the text.
Fine-grained view
Identifying individual words or characters can be useful for tasks such as spell-checking, grammar-checking, and text analysis.
Finding patterns in the text can be useful for tasks such as identifying spam, identifying plagiarism, and finding sentiment in the text.
Extracting keywords can be useful for tasks such as search engine optimization (SEO), topic modeling, and machine translation.
Example
# Fine-grained view
chunk_size = 1
chunk_overlap = 0
text = "This is a piece of text."
chunks = splitter.split_text(text, chunk_size, chunk_overlap)
for chunk in chunks:
print(chunk)
Output
This
is
a
piece
of
text.
Holistic view
Understanding the overall meaning of the text: This can be useful for tasks such as machine translation, text summarization, and question answering.
Identifying the relationships between different parts of the text: This can be useful for tasks such as natural language inference, question answering, and machine translation.
Generating new text: This can be useful for tasks such as machine translation, text summarization, and creative writing.
Example
# Holistic view
chunk_size = 10
chunk_overlap = 5
text = "This is a piece of text."
chunks = splitter.split_text(text, chunk_size, chunk_overlap)
for chunk in chunks:
print(chunk)
Output
This is a
piece of text.
Here are some additional tips for using the recursive text splitter module:
Use a consistent chunk size and chunk overlap throughout your code. This will help to ensure that your results are consistent.
Consider the nature of the text you are splitting.
If the text is highly structured, such as code or HTML, you may want to use a larger chunk size. If the text is less structured, such as a novel or a news article, you may want to use a smaller chunk size.
Experiment with different chunk sizes and chunk overlaps
This will allow you to see what works best for your specific problem.
Good coding!
Top comments (5)
useful and detailed article, thank you!
proof?
Here's what experts say:
Interestingly, Pinecone, a popular vector database, suggests that having different segment lengths within a single database could potentially improve results. By incorporating both short and long chunks, the database can capture a wider range of context and information, accommodating different types of queries more flexibly.
ai.plainenglish.io/investigating-c...
Makes total sense. Multi-resolution approach always works best.
Newer approaches suggest chunking at the level of paragraphs, not character or sentence counts.
absolutely useless