jsoup: A Powerful Java Library for Working With HTML and XML Documents

#java #html #xml #parsing

jsoup is a popular open-source Java library that enables developers to parse, manipulate, and extract data from HTML and XML documents. In this article, we will explore the basics of using jsoup, including parsing HTML documents, selecting and manipulating elements, and updating content in HTML. We'll provide code snippets along the way to help illustrate its capabilities.

jsoup simplifies working with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and xpath selectors.

jsoup website mentions that it implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers.

scrape and parse HTML from a URL, file, or string
find and extract data, using DOM traversal or CSS selectors
manipulate the HTML elements, attributes, and text
clean user-submitted content against a safelist, to prevent XSS attacks output tidy HTML
jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

Getting Started with jsoup

To begin using jsoup, you first need to add the library as a dependency in your project. If you are using Maven, include the following in your pom.xml file:

<dependency>
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>x.xx.x</version>
</dependency>

where x.xx.x is the relevant version, as of this writing it is 1.15.3

or if you are using Gradle, include the following in your build.gradle file:

implementation 'org.jsoup:jsoup:x.xx.x'

Parsing an HTML Document

To parse an HTML document using jsoup, you can use the jsoup.connect() method followed by the URL of the HTML file or webpage you want to work with. Here's a simple example:

import org.jsoup.jsoup;
import org.jsoup.nodes.Document;

public class jsoupExample {
  public static void main(String[] args) throws Exception {
    Document document = jsoup.connect("https://www.example.com").get();

    // Continue working with the parsed document
  }
}

Selecting and Manipulating Elements

jsoup provides several methods to select and manipulate elements in an HTML or XML document. For example, you can use the select() method to select elements based
on their tags or attributes, like this:

import org.jsoup.jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class jsoupExample {
  public static void main(String[] args) throws Exception {
    Document document = jsoup.connect("https://www.example.com").get();

    // Select all 'h1' tags in the document
    Elements h1Elements = document.select("h1");

    // Update the content of the first 'h1' tag
    for (Element h1 : h1Elements) {
      h1.text(h1.text().replaceAll("old", "new"));
    }
  }
}

Updating Content in HTML

In addition to selecting and manipulating elements, you can also update the content of individual elements or the entire document using various methods provided by
jsoup. For example:

import org.jsoup.jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class jsoupExample {
  public static void main(String[] args) throws Exception {
    Document document = jsoup.connect("https://www.example.com").get();

    // Update the content of a specific element
    Element header = document.selectFirst("h1");
    if (header != null) {
      header.text("New Header");
    }

    // Update the entire document's content
    String newContent = "This is the updated content.";
    document.body().html(newContent);
  }
}

jsoup is a powerful Java library for working with HTML and XML documents, enabling developers to parse, extract data, and manipulate elements efficiently. By using jsoup's simple yet effective APIs, you can save time and effort while producing cleaner, more maintainable code. It can effectively be used in content scraping (of course, without violating any policies or legal requirements) or editing and manipulating the documents in the document store or archive. Happy Coding!