DEV Community

Mohsin Rashid
Mohsin Rashid

Posted on

RAG Web Scraping

What I Built

I have built a Retrieval-Augmented Generation (RAG) system that leverages Ollama's nuextract model to scrape and extract specific content from HTML documents. The system first extracts content from the HTML, splits it into chunks, and stores the embeddings in a PostgreSQL database using PgVector. With the help of Ollama's nuextract, the model processes the HTML content and provides relevant results based on custom queries. The entire process integrates HTML content scraping with powerful vector search capabilities, enabling the extraction of precise and useful data from complex web pages.

Demo

Link to GitHub

Ollama: I used Ollama’s nuextract model to generate embeddings from the HTML content and perform scraping operations based on custom queries.
PgVector: This tool helped store and manage the embeddings in PostgreSQL. I used PgVector to handle vector-based search and retrieval from the HTML data stored in the database.
PostgreSQL: The vectorized data from HTML content was stored in a PostgreSQL database, making it easy to scale and query for relevant data.
Docker: I utilized Docker to run PgVector in a containerized environment, which simplified the setup and ensured a smooth deployment process.
LangChain: LangChain was used to build the retrieval chain, connecting the embeddings with Ollama's nuextract model for efficient query processing and data extraction.
Jupyter Notebook: The project is designed to be run within a Jupyter Notebook, providing a convenient and interactive way to load, process, and query the data.

Sample HTML Content

Image description

Code Demo

Image description

Image description

Image description

Final Thoughts

This project demonstrates the potential of combining modern LLMs with vector-based retrieval techniques to efficiently scrape and extract meaningful information from HTML documents. Integrating PgVector with Ollama's nuextract allows for high-quality, scalable web scraping operations, which can be applied to a variety of use cases, from automated data extraction to content aggregation.

The overall experience of building this project was rewarding, especially exploring the power of vector embeddings and retrieval augmented generation for real-world tasks like web scraping. The combination of PgVector, Ollama, LangChain, and the nuextract model makes for a powerful toolset that can be extended to different AI applications requiring efficient content extraction from complex documents.

This submission is eligible for the following prize categories:

  1. Open-source Models from Ollama: This project utilizes Ollama's nuextract model for extracting structured data from HTML content.
  2. Vectorizer: The use of PgVector for storing and retrieving document embeddings qualifies this project for the Vectorizer Vibe category.

Top comments (0)