In this blog, we will explore the process of building a web scraper using HTML parsing capabilities in TypeScript. This is the first part of a two-part series, where we will use this HTML scraper to extract relevant information from a website and generate text files. In the second part of the series, we'll feed these files to a locally running GPT-4 instance, privateGPT, and analyze whether it can produce coherent code snippets based on the provided documentation.
Here's a link to the original post.
For those eager to get their hands dirty, here's a link to the repo.
Before diving into the code and its explanation, let's have a brief overview of our project. We aim to create a web scraper that can:
- Download and parse HTML from a given target URL.
- Extract all links from the web page.
- Fetch content for each link and save it as a text file.
- Calculate a similarity index between each text file and the overall content and ignore indices that already exist, or pieces of text that are at least 90% similar.
The project is built using TypeScript and consists of several modules that handle different functionalities. Let's analyze each module in chronological order, starting with the main entry point of our application –
Once URLs are collected, the scraper fetches the content of each link and saves it as a text file. For improved performance, we run this process using multiple worker threads (since our
getSimilarityIndex function takes a while to complete, depending on the length of text), implemented in Node.js as
After saving content, the similarity index of each text file with the overall content is calculated and stored which we'll then use to gauge whether creating a new text file containing html content is necessary.
This is a simple module defining constants like the default output folder paths for our web scraper.
export const DEFAULT_OUTPUT_FOLDER = "/tmp/html-web-parser/" export const DEFAULT_OUTPUT_FOLDER_HTML = DEFAULT_OUTPUT_FOLDER + "html-output"; export const DEFAULT_OUTPUT_FOLDER_STORAGE = DEFAULT_OUTPUT_FOLDER + "storage"; export const DEFAULT_OUTPUT_FOLDER_LOCK = DEFAULT_OUTPUT_FOLDER + "lock";
lock.ts module provides a
Lock class that deals with acquiring and releasing filesystem locks during concurrent read and write operations. We use this class to synchronize access to the number file in which similarity index values will be stored. This module also provides a
sleep method to implement a waiting mechanism for locks and allow file referencing after the lock is released (lock file is deleted).
This module contains a function
fetchAndSaveContent responsible for fetching HTML content from the provided URL, extracting text content (textContent), and preparing it for further processing. It eliminates irrelevant links by using the
IGNORED_LINKS array and focuses on links that belong to the same domain as the target URL. The function returns a tuple, with the first element being a unique identifier (filename) and the second element being the extracted text content.
Using worker threads, the
save-html-content-to-file.js module accepts workerData, containing the URL, output folder, text content, and overall content. In this module, the similarity index is calculated using
getSimilarityIndex() function, and the text content is saved to a file named by the identifier provided by the
getSimilarityIndex takes a long time to run, so we utilize nodejs threads to improve on execution speed.
Storage class of
lib/storage.ts deals with managing storage operations for similarity index values. It stores an in-memory set of similarity indexes and provides methods to lock/unlock file access to read and write the persistent storage securely such that all writes by the different running worker reads are all committed.
This module provides a single function,
getSimilarityIndex(), which calculates text similarity based on the edit distance (also known as Levenshtein distance). It measures the minimum number of edits required to transform one string into another.
After setting up a TypeScript configuration file,
tsconfig.json, we compile the project using the following command:
node output/html-web-parser.js https://riverpod.dev/docs/getting_started. This command starts the web scraper, processing the given URL, and generating the text files.
Stay tuned for the second part of this series, where we will feed the generated files to privateGPT and observe whether it can produce coherent code based on the documentation.
If you have any thoughts, would like to improve different sections in this blog, or would just like to chat, feel free to reach out to me using any links on the contact page.