DEV Community

Cover image for Data factory for LLM video models
moses omondi
moses omondi

Posted on

Data factory for LLM video models

Hadithi is an open-source, bash-based command-line tool that enables AI and ML developers to easily convert Youtube, Torrent, and enterprise videos into high-quality datasets for fine-tuning large language models (LLMs).
access source code

Top comments (1)

Collapse
 
moses_omondi_d411af81e579 profile image
moses omondi

Hadithi automates video processing: it organizes and renames videos with timestamps, segments them into clips, detects scenes, removes audio if needed, filters out short videos, rescales and extracts frames, batches videos, validates image counts in folders, and creates videos from images at the correct frame rate.

It is easy to use, open-source, and runs entirely on a CPU with minimal setup:

Developers simply point the path to their dataset folder and, with the click of a single button, start extracting structured datasets—a task that is usually time consuming, very expensive, and requires expert skill.

The source code is written in bash, which is lightweight and easy to understand.Developers can modify the source code to suit their needs. They can even use it to set up their own data foundry!

Unlike most video processing tools, it doesn't require a GPU.Anyone with a moderate cpu and sufficient storage hardware can create thousands of videos.

Only Bash, FFmpeg, and Exiftool are required to setup the system.Sorry, Windows and Mac OS users.,I developed the system on Ubuntu 18.04 but you can test it on your operating systems.