DEV Community

Theo Vasilis for Apify

Posted on • Originally published at blog.apify.com on

What is data ingestion for large language models?

Data ingestion for LLMs is super easy said no one ever! The fact is, its a complex process that involves collecting, preprocessing, and preparing data. Find out how to go about gathering and processing the data for your own large language models.

Robots reading books - Data ingestion for LLMs

Image produced by DeepAI

In the blink of an eye, LLMs went from being something only AI geeks knew and cared about to something everyone is trying to cash in on. Influential people famous for anything but AI are offering their take on it in interviews, and YouTubers who had never touched the subject are suddenly giving us their two cents, whether we wanted their opinion or not.

You get the idea. Everyone is going nuts for large language models because of ChatGPT and all the spin-offs and sequels that will inevitably ensue. But how do you go about getting the data needed to train one of your very own robot overlords?

Machine learning, smart apps, and real-time analytics: all begin with data and tons of it. And were not only talking about structured data, such as databases, but also unstructured data (videos, images, text messages, and whatnot). Getting the data from your data source to your data storage for processing, preparation, and training is a vital step known as data ingestion.

➡️ Designed for generative AI and LLMs, Website Content Crawler can help you feed, fine-tune or train your large language models or provide context for prompts for ChatGPT. In return, the model will answer questions based on your or your customer's websites and content.

What is data ingestion?

Data ingestion is basically the process of collecting, processing and preparing data for analysis or machine learning. In the context of large language models, data ingestion involves collecting vast quantities of text data (web scraping), preprocessing it (cleaning, normalization, tokenization), and preparing it for training the LLM (feature engineering). If those terms raise more questions for you than answers, dont panic: all is explained below.

➡️ Related: Applications of ChatGPT and other large language models in web scraping

How does data ingestion work?

Data ingestion is a complex process that involves multiple layers and processes, but for the sake of time and clarity, Ill break it down into four layers (three of which I briefly mentioned earlier): collection, preprocessing, feature engineering, and storage.

Data collection

The first layer involves collecting data from various sources such as the web, social media, or text documents. The data collected needs to be relevant to the task the LLM is being trained for. For example, if the LLM is being trained to perform sentiment analysis, the data collected should include a large number of reviews, comments, and social media posts. So, the first step to data ingestion for LLMs is to define the data requirements. What types of data are needed to train the model?

Once youve figured that out, you need to start gathering the data. The most common and popular form of web data collection is web scraping, which is an automated method of extracting data from websites. Two ways to do this are by building a scraper with a web scraping library (try Crawlee) or by using a ready-made scraping tool. Two types of such tools are universal scrapers designed for web data extraction from any site, and site-specific scrapers, for example, a Google Maps scraper or a Twitter scraper.

Fast, reliable data for your AI and machine learning · Apify

Get the data to train ChatGPT API and Large Language Models, fast.

favicon apify.com

Ingest entire websites automatically. Gather your customers' documentation, knowledge bases, help centers, forums, blog posts, and other sources of information to train or prompt your LLMs. Integrate Apify into your product and let your customers upload their content in minutes.


Preprocessing

Once the data has been collected, it needs to be preprocessed before it can be used to train your T-1000 LLM. Preprocessing involves several steps, including cleaning the data, normalization, and tokenization.

  • Data cleaning

Data cleaning involves identifying and correcting or removing inaccurate, incomplete, or irrelevant data. If you want to ensure data quality and consistency, youve got to do some data chores. This typically involves things like removing duplicates, fixing missing or incorrect values, and removing outliers.

  • Normalization

Normalization means transforming data into a standard format that allows for easy comparison and analysis. This step is particularly important when dealing with text data, as it helps to reduce the dimensionality of the data and makes it easier to compare and analyze. Typical examples include converting all text to lowercase, removing punctuation, and removing stop words.

  • Tokenization

Tokenization involves breaking down the text into individual words or phrases, which will be used to create the vocabulary for the language model. This is especially important in natural language processing (NLP) because it allows for the analysis of individual words or phrases within the text. This tokenization can be done at a word level, character level, or subword level.

➡️ Related: Building functional AI models for web scraping

Feature engineering

Feature engineering involves creating features from preprocessed data. Features are numerical representations of the text that the LLM can understand.

There are several feature engineering techniques that can be used, such as word embedding , which represents the text as a dense vector of real numbers to capture the meaning of the words. Word embeddings are produced by techniques that use neural networks, such as Word2Vec.

We could divide this feature engineering stage into three steps:

  • Split
    First, you need to divide the data into training, validation, and testing sets. Use the training set to teach the LLM and the validation and testing sets to evaluate the machines performance.

  • Augment

    Next, increase the size and diversity of the data by adding new examples, synthesizing new data, or transforming existing data.

  • Encode

    Finally, do the encoding by embedding data into tokens or vectors.

➡️ Related: What is a vector database?

Storage

Once the data has been preprocessed and features have been created, it needs to be stored in a format that can be easily accessed by the language model during training. The data can be stored in a database or file system, and the format may be structured or unstructured.

What is LangChain? How it works and how to get started

Find out how LangChain overcomes the limits of ChatGPT

favicon blog.apify.com

That's it!... Not!

Even after your data is collected, preprocessed, engineered, and stored, you should continuously monitor the quality and relevance of the data and update it as needed to improve the performance of your large language model. Otherwise, your LLM may soon become as obsolete as the T-800.

Lets sum up

Assuming you have the stomach for a recap on data ingestion, lets conclude with a quick run-down of the process:

  1. Define the data requirements for training the LLM.

  2. Collect the data (scrape websites, databases, or public datasets).

  3. Organize the data (cleaning, preprocessing, normalization, tokenization).

  4. Split, augment and encode your data (feature engineering).

  5. Save and store for easy access by the LLM during training.

  6. Monitor to ensure data quality and relevance.

Now you have some basic idea of what training your very own LLM involves, but I cant be held responsible for what you choose to do with this information. If you lose control over your LLM once AI reaches the technological singularity, thats on you!

How I use GPT Scraper to let ChatGPT access the internet

Do you dream of letting ChatGPT roam the web?

favicon blog.apify.com

Top comments (0)