DEV Community

AIRabbit
AIRabbit

Posted on • Edited on

Chat with any website (not just a single page). Complete Tutorial

Chat with any full website (not just a single page). Complete Tutorial

The web is full of information, and sometimes it's hard to find the right information for your specific needs. Even more so if the website provider has no or weak search capabilities, such as keyword only search.

What is "Chat With"?

With the "chat with website" approach, we don't just ask ChatGPT about a topic. We first feed GPT the information from the website and then start asking questions about that website. Unlike simply uploading a single file, this method allows you to ask questions about the whole website, not just a single page.

Why not just use the site search directly?.

The difference with site search is that we can actually chat and interact with the documentation in natural language, not just keyword based. You can even ask a question and get an answer to that specific query.

Why not use ChatGPT directly, i.e. without crawling?.

Unlike just asking an LLM without embedding, crawling and feeding it the documentation, there is a risk of hallucination - where the LLM makes up things that don't exist or are factually wrong, misleading the user. This can leave you lost and unable to follow the steps you are supposed to take to complete the task.

The solution: Crawl + Embed + Chat.

Luckily, there's a solution for querying this kind of unstructured but very valuable information on the web: using an LLM to interpret this information and answer our questions. There's plenty of information on the web about how to do this with a file or folder, but for the complex structure of the web - including bot protection and link following - it can be a not-so-simple undertaking.

Here are some of the things you typically need to consider if you want to go beyond analysing or crawling the text of a single web page:

  • Imitating JavaScript

  • Getting around bot protection using proxies

  • Following links to a certain depth

  • Automatic scrolling

  • Extracting the text and skipping all the unimportant stuff like navigation, etc.

  • And many more

But let's take a step back. What do we usually need to implement a complete solution? After crawling the entire website, we need to talk to it.

In a perfect world, you'd have a workflow like this:

Crawl > Extract/Cleanup > Embed for vector search > Do QA on this data with LLM.

Some commercial off-the-shelf solutions offer AI search features to crawl an entire site, which is great. But there's a problem: you have no control over that crawler.

And much more: What happens after the data is crawled, i.e. how is it Chunking and embedded for the Retrieval etc.? This part is usually limited as well.

There are dozens of out-of-the-box solutions, but most or all of them lack this kind of control. I admit that sometimes for a small site with simple links to follow, using off-the-shelf solutions may be sufficient (if it's free or the price is affordable). But in many cases you simply need more control over the crawler and the extraction process if you want to get reliable results.

If you want to build a reliable solution, I recommend breaking it down into pieces and using tools that specialise in each part, rather than relying on one vendor to do it all, but highly opinionated, with limited control and, on top of that, full vendor lock. This is my personal opinion, especially for the application that really matters to you.

OK, so let's go back to our example: Creating a RAG for chatting with a website (here: obisidan documentation)

Our pipeline, I see two main parts:

Part I: Gathering the data:

Crawl > Extract/Cleanup.

Part 2: analysing and actually using that data.

Embedding > Retrieval / QA with LLM

In this tutorial we'll use Apify for the first part of the workflow (data extraction) and Open WebUI for the second part (embedding and using the data for QA).

Read more in my Blog Post


The rest of the article continues with detailed explanations and walkthroughs.

Top comments (0)