Web scraping with Rust

#sraping #rust #tutorial #programming

Web scraping is the process of extracting data from websites and storing it for later use. In this tutorial, we will learn how to perform web scraping in Rust, a statically typed, multi-paradigm programming language that was designed to be safe, concurrent, and fast.

To perform web scraping in Rust, we will need a few tools:

The reqwest library: This library provides a convenient and easy-to-use API for making HTTP requests and handling responses.
The select.rs library: This library allows us to easily extract data from HTML documents using CSS selectors.

First, let's create a new Rust project and add the reqwest and select.rs libraries as dependencies in our Cargo.toml file:

[dependencies]
reqwest = "0.10.4"
select = "0.4.4"

Next, let's create a new file src/main.rs and add the following code:

use std::io;

use reqwest::Client;
use select::document::Document;
use select::predicate::{Attr, Name};

fn main() -> io::Result<()> {
    let mut resp = Client::new()
        .get("https://www.rust-lang.org")
        .send()?;

    let body = resp.text()?;
    let document = Document::from(body.as_str());

    for node in document.find(Attr("id", "blog-entries")) {
        for entry in node.find(Name("a")) {
            let title = entry.text();
            let url = entry.attr("href").unwrap();
            println!("{} ({})", title, url);
        }
    }

    Ok(())
}

In this code, we are using the reqwest library to make an HTTP GET request to the Rust website, and then we are using the select.rs library to extract data from the HTML response. We are using a CSS selector to find the div element with the id attribute "blog-entries", and then we are finding all a elements within that div. For each a element, we are printing the text (the title of the blog post) and the href attribute (the URL of the blog post).

Now, let's run our web scraping program:

$ cargo run
   Compiling webscraper v0.1.0 (/home/user/webscraper)
    Finished dev [unoptimized + debuginfo] target(s) in 1.17s
     Running `target/debug/webscraper`
Introducing the Rust 1.52 release channel (https://blog.rust-lang.org/2022/03/03/Rust-1.52.html)
How does the Rust release process work? (https://blog.rust-lang.org/inside-rust/inside-rust-february-2022.html#how-does-the-rust-release-process-work)
…

As you can see, our web scraping program has successfully extracted the title and URL of each blog post from the Rust website.

In conclusion, web scraping in Rust is relatively simple and straightforward using the reqwest and select.rs libraries.

DEV Community

Web scraping with Rust

Top comments (0)

Read next

Implementing the Idempotency-Key specification on Apache APISIX

What is a Monad?

How to Use Intellicode in Visual Studio

Is System Design Interview — An Insider Guide book really worth it? Review