DEV Community

Cover image for Exploring Different Chunking Strategies and Working with Unstructured Data
Mrunmay Shelar for LangDB

Posted on • Edited on • Originally published at app.langdb.ai

Exploring Different Chunking Strategies and Working with Unstructured Data

LangDB provides a powerful arsenal of functions for developers to deal with unstructured data. These functions are designed to streamline common tasks in data extraction, and text chunking. Let's dive into some of the key functions and see how they can make your life easier.

load

The load function converts any webpage/file into bytes. These bytes can be used to extract text or layout from the file/webpage.

SELECT * FROM load('s3://sample-onlineboutique-codefiles/onlineboutique-codefiles/just-deserts-spring-obooko-small.pdf');
Enter fullscreen mode Exit fullscreen mode
content
[37,80,68,70,45,49,46,54,10,37,-30,-29,-49,-45,10,53,32,48,32,111,98,106,10,60,60,10,47,66,77,32,47,78,111,114,109,97,108,10,47,99,97,32,49,10,62,62,10,101,110,100,111,98,106,10,56,32,48,32,111,98,106,10,60,60,10,47,70,105,108,116,101,114,32,47,70,108,97,116,101,68,101,99,111,100,101,10,47,76,101,110,103,116,104,32,50,57,54,10,47,78,32,51,10,62,62,10,115,116,114,101,97,109,10,120,-100,125,-112,-67,74,-61,96,20,-122,31,107,65,20,-59,65,-121,14,14,25,28,92,-44,-2,104,127,-64,-91,-83,88,92,91,-123,86,-89,52,77,-117,-40,-97,-112,-90,-24,5,-24,-26,-32,-22,38,46,-34,-128,-24,101,40,8,14,-30,-32,37,-120,-96,-77,111,26,36,5,-87,-25,-16,-26,123,120,-13,-110,47,-25,64,36,-122,42,26,-121,78,-41,115,-53,-91,-126,81,-83,29,24,83,-17,76,-88,-121,101,90,125,-121,-15,-91,-44,-9,75,-112,125,94,-3,39,55,-82,-90,27,118,-33,-46,-7,33,121,-82,46,-41,39,27,-30,-59,86,-64,-89,62,-41,3,-66,-16,-7,-60,115,60,-15,-75,-49,-18,94,-71,40,-66,19,-81,-76,70,-72,62,10]

extract_text

The extract_text() function extracts text from various file types, with specific options available for PDF files.

Parameters

Parameter Type Optional Description Possible Values Sample Value
path String No The file path to extract text from Any valid URL 'https://example.com'
type String Yes Type of file PDF, Markdown, Text, HTML 'pdf'
page_rage Array(Int) Yes Extra parameter for PDF file type for the range of page numbers Array of Start and Ending page numbers [1, 10]
per_page Bool Yes Extra parameter for PDF file type to chunk per Page true, false true

Usage with load function

SELECT * FROM extract_text((SELECT * from load('s3://sample-onlineboutique-codefiles/onlineboutique-codefiles/just-deserts-spring-obooko-small.pdf')),
    type => 'pdf' ,
    per_page => false
);
Enter fullscreen mode Exit fullscreen mode
content metadata page_no

JUST DESERTS
Aniela Spring
© Copyright Aniela Spring 2024
This is an authorised free edition from www.obooko.com
Although you do not have to pay for this book, the author’s intellectual property
rights remain fully protected by international Copyright laws. You are licensed to
use this digital copy strictly for your personal enjoyment only: it must not be
redistributed commercially or offered for sale in any form. If you paid for this free
edition, or to gain access to it, we suggest you demand an immediate refund and
report the transaction to the author and Obooko.
All characters are fictitious and any resemblance to real persons, living or dead,
is utterly coincidental.
1
{"total_pages":2,"page_range":"(0, 2)"} 0

These all functions are best suited for raw text. However, if you want to get the layout information from a document, LangDB has support for it too.

extract_layout

The extract_layout function enables structured data extraction with layout information from a document.

Parameters

Parameter Type Optional Description Possible Values Sample Value
path String No The file path to extract text from Any valid file URL 'https://example.pdf'
type String Yes Type of file Raw, PDF, Image 'pdf'
page_range Array(Int) Yes Extra parameter for PDF file type for the range of page numbers Array of Start and Ending page numbers [1, 10]
parallelism Int Yes Extra parameter for PDF file type to process pages parallelly 2, 4, 5 2

Extracting Layout information from a PDF

SELECT * FROM extract_layout(
    path => 's3://sample-onlineboutique-codefiles/onlineboutique-codefiles/just-deserts-spring-obooko-small.pdf',
    type=> 'pdf'
);
Enter fullscreen mode Exit fullscreen mode
page block_idx block_id block_type row_id col_id text confidence entity_types relationships
0 0 c7261e9c-be58-4776-a1de-70adf6e4e6e6 PAGE 0 0 0 [] [["CHILD",["23112b0d-4062-424d-bbb3-4f4aa82f4d80","3e3c5562-b018-4f75-85d9-6e7771489ba0","f08a9210-eedb-4150-99e2-a5d22b26e029","f3087bee-7680-4024-aeff-60ab0bdc1dac"]]]
0 1 23112b0d-4062-424d-bbb3-4f4aa82f4d80 LINE 0 0 Don't forget about your past, because it never forgets about you. 99.88849 [] [["CHILD",["102e10d5-fd45-46ee-9890-b70279c6e532","af6bad3a-34fc-462e-9033-c1af2bd5aa1a","aab2849a-4a4b-499c-a16f-43c55fb5dffd","78ef1f76-d8a5-413f-be87-cff88194b7e1","9f41e657-f307-487f-872e-569272305ad4","01ae2b2a-755f-4ef9-9ddc-24eaefbaabd4","7484031d-5259-48ad-a3c7-bbcb862d34f0","783ddab6-47a3-48aa-b56b-adc564daa8cd","d7a69ab3-c601-4d9d-9632-7d4f176b2462","d8f537c7-4c64-4a01-9792-088660b1631d","fb7ad2cb-e72d-4013-8399-fa32d46cb21d"]]]
0 2 3e3c5562-b018-4f75-85d9-6e7771489ba0 LINE 0 0 JUST DESERTS 98.635315 [] [["CHILD",["5e4fc404-7326-4195-a3d7-343a4dea7a8f","f3efd0b2-0c54-49bc-9867-6830eab05403"]]]
0 3 f08a9210-eedb-4150-99e2-a5d22b26e029 LINE 0 0 ANIELA SPRING 99.87999 [] [["CHILD",["f4e95636-470f-45eb-a599-4d3e00f754d6","5e8676d2-f90c-4455-b179-083da72c647e"]]]
0 4 102e10d5-fd45-46ee-9890-b70279c6e532 WORD 0 0 Don't 99.96765 [] []
0 5 af6bad3a-34fc-462e-9033-c1af2bd5aa1a WORD 0 0 forget 99.908676 [] []
0 6 aab2849a-4a4b-499c-a16f-43c55fb5dffd WORD 0 0 about 99.9353 [] []
0 7 78ef1f76-d8a5-413f-be87-cff88194b7e1 WORD 0 0 your 99.92315 [] []
0 8 9f41e657-f307-487f-872e-569272305ad4 WORD 0 0 past, 99.73978 [] []
0 9 01ae2b2a-755f-4ef9-9ddc-24eaefbaabd4 WORD 0 0 because 99.9515 [] []

Extracting Layout information from an Image

Similarly, you can extract layout information from an image through the following code:

SELECT * FROM extract_layout(
    path => 'https://langdb-sample-data.s3.ap-southeast-1.amazonaws.com/Screenshot+from+2024-08-09+09-49-18.png',
    type => 'image'
);
Enter fullscreen mode Exit fullscreen mode

chunk

The chunk function breaks down large texts into smaller, manageable pieces. This is particularly useful for processing long documents, especially when working with models that have input size limitations.

Parameters

Parameter Type Optional Description Possible Values Sample Value
raw_text String No The raw text which needs to be chuncked Any String 'In a quaint village...'
type String No Unit of chunking Char, Word, Sentence, Paragraph 'Char'
chunk_size Int Yes Number of units to be present in a Chunk Any non-negative integer 100
overlap Int Yes Number of units to overlap between consecutive chunks Any non-negative integer 20
trim Bool Yes Whether to trim whitespace from the start and end of each chunk true, false true

Chunking Raw text into Char with Chunk Size

SELECT * FROM chunk('In a quaint village nestled in the heart of the countryside, there lived a young girl named Lily. She was known throughout the village for her vibrant imagination and her love for adventure. Every day, Lily would set out to explore the lush forests and rolling hills that surrounded her home, always eager to discover something new and exciting.

One particularly sunny morning, Lily decided to venture deeper into the woods than she ever had before. As she walked, she stumbled upon a hidden grove filled with the most beautiful flowers she had ever seen. The colors were so vivid and the petals so delicate that Lily couldnt help but marvel at their beauty. She spent hours in the grove, carefully examining each flower and breathing in their sweet fragrance.',
    type => 'Char',
    trim => true,
    chunk_size => 200);
Enter fullscreen mode Exit fullscreen mode
text index
In a quaint village nestled in the heart of the countryside, there lived a young girl named Lily. She was known throughout the village for her vibrant imagination and her love for adventure. 0
Every day, Lily would set out to explore the lush forests and rolling hills that surrounded her home, always eager to discover something new and exciting. 1
One particularly sunny morning, Lily decided to venture deeper into the woods than she ever had before. 2
As she walked, she stumbled upon a hidden grove filled with the most beautiful flowers she had ever seen. 3
The colors were so vivid and the petals so delicate that Lily couldnt help but marvel at their beauty. 4
She spent hours in the grove, carefully examining each flower and breathing in their sweet fragrance. 5

Chunking Raw text into Word with Chunk Size and Overlap

SELECT * FROM chunk('In a quaint village nestled in the heart of the countryside, there lived a young girl named Lily. She was known throughout the village for her vibrant imagination and her love for adventure. Every day, Lily would set out to explore the lush forests and rolling hills that surrounded her home, always eager to discover something new and exciting.

One particularly sunny morning, Lily decided to venture deeper into the woods than she ever had before. As she walked, she stumbled upon a hidden grove filled with the most beautiful flowers she had ever seen. The colors were so vivid and the petals so delicate that Lily couldnt help but marvel at their beauty. She spent hours in the grove, carefully examining each flower and breathing in their sweet fragrance.',
    type => 'Word',
    chunk_size => 30,
    overlap => 10);
Enter fullscreen mode Exit fullscreen mode
text index
In a quaint village nestled in the heart of the countryside there lived a young girl named Lily She was known throughout the village for her vibrant imagination and her 0
known throughout the village for her vibrant imagination and her love for adventure Every day Lily would set out to explore the lush forests and rolling hills that surrounded her 1
explore the lush forests and rolling hills that surrounded her home always eager to discover something new and exciting

One particularly sunny morning Lily decided to venture deeper into the

2
particularly sunny morning Lily decided to venture deeper into the woods than she ever had before As she walked she stumbled upon a hidden grove filled with the most beautiful 3
stumbled upon a hidden grove filled with the most beautiful flowers she had ever seen The colors were so vivid and the petals so delicate that Lily couldnt help but 4
and the petals so delicate that Lily couldnt help but marvel at their beauty She spent hours in the grove carefully examining each flower and breathing in their sweet fragrance 5

Chunking Raw Text into Sentences

SELECT * FROM chunk('In a quaint village nestled in the heart of the countryside, there lived a young girl named Lily. She was known throughout the village for her vibrant imagination and her love for adventure. Every day, Lily would set out to explore the lush forests and rolling hills that surrounded her home, always eager to discover something new and exciting.

One particularly sunny morning, Lily decided to venture deeper into the woods than she ever had before. As she walked, she stumbled upon a hidden grove filled with the most beautiful flowers she had ever seen. The colors were so vivid and the petals so delicate that Lily couldnt help but marvel at their beauty. She spent hours in the grove, carefully examining each flower and breathing in their sweet fragrance.',
    type => 'Sentence');
Enter fullscreen mode Exit fullscreen mode
text index
In a quaint village nestled in the heart of the countryside, there lived a young girl named Lily 0
She was known throughout the village for her vibrant imagination and her love for adventure 1
Every day, Lily would set out to explore the lush forests and rolling hills that surrounded her home, always eager to discover something new and exciting 2
One particularly sunny morning, Lily decided to venture deeper into the woods than she ever had before 3
As she walked, she stumbled upon a hidden grove filled with the most beautiful flowers she had ever seen 4
The colors were so vivid and the petals so delicate that Lily couldnt help but marvel at their beauty 5
She spent hours in the grove, carefully examining each flower and breathing in their sweet fragrance 6

Chunking Raw Text into Paragraphs

SELECT * FROM chunk('In a quaint village nestled in the heart of the countryside, there lived a young girl named Lily. She was known throughout the village for her vibrant imagination and her love for adventure. Every day, Lily would set out to explore the lush forests and rolling hills that surrounded her home, always eager to discover something new and exciting.

One particularly sunny morning, Lily decided to venture deeper into the woods than she ever had before. As she walked, she stumbled upon a hidden grove filled with the most beautiful flowers she had ever seen. The colors were so vivid and the petals so delicate that Lily couldnt help but marvel at their beauty. She spent hours in the grove, carefully examining each flower and breathing in their sweet fragrance.',
    type => 'Paragraph');
Enter fullscreen mode Exit fullscreen mode
text index
In a quaint village nestled in the heart of the countryside, there lived a young girl named Lily. She was known throughout the village for her vibrant imagination and her love for adventure. Every day, Lily would set out to explore the lush forests and rolling hills that surrounded her home, always eager to discover something new and exciting. 0
One particularly sunny morning, Lily decided to venture deeper into the woods than she ever had before. As she walked, she stumbled upon a hidden grove filled with the most beautiful flowers she had ever seen. The colors were so vivid and the petals so delicate that Lily couldnt help but marvel at their beauty. She spent hours in the grove, carefully examining each flower and breathing in their sweet fragrance. 1

Combining functions

We have seen how these functions behave individually, but the real power of these functions and LangDB lies within combining. Let's take an example of a job description pdf.

Firstly, we will use load to convert the file into bytes and then extract_text to get all the raw text from it.
After that, we will Chunk by Char with a chunk_size of 2000.

select * from chunk(
    (
        select content from extract_text((
            select * from load('https://www.stjohneyehospital.org/wp-content/uploads/2024/05/Job-Description-Accountant.pdf',
            type=> 'pdf')
        ))
    ),
    chunk_size => 2000,
    type => 'Char',
    trim => false
)
Enter fullscreen mode Exit fullscreen mode
text index
ST. JOHN EYE HOSPITAL – JERUSALEM
JOB DESCRIPTION
Title Accountant
Department Finance
Section
Reports to Director of Finance
Hours 40 hrs per week (inc of lunch breaks)
Date February 24
formulated/updated
General Statement of Duties: To play a major role in controlling the costing system of purchases and
payroll by supporting the existing accountants and providing reports as instructed by the Director of Finance.
Main Responsibilities:
  1. To act as a substitute for the senior/payroll accountant during her absence.
  2. Act as the Projects’ accountant and point of contact by providing reports and supporting documents for projects and any other assistance as needed.
  3. Act as the Cafeteria’s accountant which includes recording of expenses and income, produce reports for management, as well as reporting to the tax authority.
  4. Responsible for examining, recording, and summarizing the organization’s West Bank costs, mainly payroll and purchases. The Accountant records and classifies expenditures to create financial statements for senior management.
  5. Ensure that all costs are identified and recorded accurately.
  6. Maintaining accurate costing records in relation to labour and supplies.
  7. Process accounting transactions using the existing accounting software.
  8. Assist in the preparation of the monthly local management accounts and comparing it to budget, and report on any variance to DOF and other heads of departments.
  9. Process Palestinian payroll transactions using accounting and payroll systems and assist with the Israeli payroll system when needed (and ensure that the payroll taxes and national insurance are paid to the regulatory bodies on timely basis).
  10. Revision of purchases recorded at the pharmacy system.
  11. Monitor and coordinate payments for West Bank Suppliers
  12. Any other duties as assigned by the Director of Finance. General Responsibilities:
0
1. All staff are expected to report for work on time and fulfil their hours of duty, from time to time some flexibility may be required in order to meet the needs of the job and this may be outside regular hours of work.
  • All staff are expected to promote and contribute to a cooperative and productive work environment. Staff are also expected to show respect and consideration to their colleagues and all patients and visitors to the hospital.
  • All staff are expected to follow the dress code for their area of work. All uniforms as required by different work areas should be worn at all times. Staff who do not have a uniform are expected to wear appropriate, respectful, modest business dress. Jeans are not considered appropriate attire.
  • The hospital is a no smoking hospital and smoking is only permitted in the designated smoking areas and only during official break periods.
  • All staff will abide by confidentiality rules and will not disclose any information about patients, the staff or the workings of the hospital, except in certain circumstances where express permission is given as per the Confidentiality Policy.
  • All staff are expected to comply at all times with the requirements of Health and Safety regulations and to take responsibility for the health and safety and welfare of others in the working environment ensuring that agreed safety procedures are carried out to maintain a safe environment.
  • The Hospital has a Control of Visits in the Hospital and Security of Workers policy in order to help protect patients, visitors and staff and to safeguard their property. All employees have a responsibility to ensure that those persons using the Hospital and its service are as secure as possible.
  • The Hospital is committed to equality and all staff are expected to treat colleagues, patients and visitors to the Hospital with dignity and respect, regardless of their ethnic background, religion, race, gender, age or
  • 1
    sexual orientation.
  • All staff are expected to familiarise themselves with the requirements of the Hospitals policies and procedures for staff and also their specific area of work.
  • All appointments within the Hospital are subject to pre-employment health screening.
  • All staff are responsible for ensuring that all risks of cross infection to patients are minimised and that all policies, procedures and guidance relating to infection control practice are adhered to.
  • All staff are responsible, where relevant, for ensuring that all equipment used by patients is clean/decontaminated as instructed by manufacturers and in line with the infection control/guidelines protocol and policy.
  • The job description gives a general outline of the duties of the position and is not intended to be an inflexible or finite list of tasks. It may be varied, from time to time, after consultation with the member of staff.
  • Any other duties as designated by your manager and which are commensurate with the grade. Essential requirements for the post: Bachelor’s degree in accounting. At least one year experience in the accounting field mainly in the payable’s sections. At least one year experience in processing payroll. Knowledge and experience of the Israeli & Palestinian Payroll systems is required. Previous experience in projects is a plus. Very Good in English and Hebrew languages. Computer literate especially excel spread sheets. Good eye for details. Methodical and organised. Ability to work under pressure. Ability to meet deadlines. Ability to lead & contribute to team work as necessary. Name ______________________________________________ Date ________________________ Signed ______________________________________
  • 2

    By leveraging functions like load, extract_text, extract_layout and chunk, LangDB equips developers with a powerful toolkit for overcoming unstructured data challenges. Whether you're dealing with disorganized text, intricate document layouts, or vast amounts of data, these functions provide the versatility and efficiency needed to convert raw information into structured, actionable insights. LangDB not only simplifies the complexity of data extraction and processing but also enhances the overall productivity of your development workflow.

    Top comments (0)