Web scraping is the process of extracting data from websites and storing it for later use. In this tutorial, we will learn how to perform web scraping in Rust, a statically typed, multi-paradigm programming language that was designed to be safe, concurrent, and fast.
To perform web scraping in Rust, we will need a few tools:
The
reqwest
library: This library provides a convenient and easy-to-use API for making HTTP requests and handling responses.The
select.rs
library: This library allows us to easily extract data from HTML documents using CSS selectors.
First, let's create a new Rust project and add the reqwest
and select.rs
libraries as dependencies in our Cargo.toml file:
[dependencies]
reqwest = "0.10.4"
select = "0.4.4"
Next, let's create a new file src/main.rs
and add the following code:
use std::io;
use reqwest::Client;
use select::document::Document;
use select::predicate::{Attr, Name};
fn main() -> io::Result<()> {
let mut resp = Client::new()
.get("https://www.rust-lang.org")
.send()?;
let body = resp.text()?;
let document = Document::from(body.as_str());
for node in document.find(Attr("id", "blog-entries")) {
for entry in node.find(Name("a")) {
let title = entry.text();
let url = entry.attr("href").unwrap();
println!("{} ({})", title, url);
}
}
Ok(())
}
In this code, we are using the reqwest
library to make an HTTP GET request to the Rust website, and then we are using the select.rs
library to extract data from the HTML response. We are using a CSS selector to find the div
element with the id
attribute "blog-entries", and then we are finding all a
elements within that div
. For each a
element, we are printing the text (the title of the blog post) and the href
attribute (the URL of the blog post).
Now, let's run our web scraping program:
$ cargo run
Compiling webscraper v0.1.0 (/home/user/webscraper)
Finished dev [unoptimized + debuginfo] target(s) in 1.17s
Running `target/debug/webscraper`
Introducing the Rust 1.52 release channel (https://blog.rust-lang.org/2022/03/03/Rust-1.52.html)
How does the Rust release process work? (https://blog.rust-lang.org/inside-rust/inside-rust-february-2022.html#how-does-the-rust-release-process-work)
…
As you can see, our web scraping program has successfully extracted the title and URL of each blog post from the Rust website.
In conclusion, web scraping in Rust is relatively simple and straightforward using the reqwest
and select.rs
libraries.
Top comments (0)