Parse HTML With Java and Jsoup

How to Parse HTML With Java and Jsoup

With Jsoup, I can efficiently extract the information I need from web pages and process various tasks. The library’s simplicity and flexibility let me focus on the core of my task without getting bogged down in complex code. Whether I’m scraping data or running tests, Jsoup is a reliable choice that makes handling HTML straightforward.

Alternatives to Parsing HTML with Jsoup

In some cases, we can avoid scraping or parsing data ourselves. Take a look at my list of the best web scraping tools, maybe you will find the perfect solution for yourself or your business.

I would recommend using Bright Data or ScrapingBee for large scale operations. I am not affilaited with any of them.

What is Jsoup?

Jsoup is a powerful Java library designed specifically for working with real-world HTML. It provides a convenient API to fetch, parse, and manipulate HTML. One of its greatest strengths is its ability to handle valid and invalid HTML, which is common when working with web pages.

Key Features of Jsoup

  • Parse and clean HTML from URLs, files, or strings.
  • Extract data using DOM traversal or CSS-like selectors.
  • Manipulate the HTML content.
  • Automatically clean untrusted HTML.
  • Provides support for cookies, POST requests, and session handling.

Setting Up Jsoup

Before we dive into the code, you’ll need to add Jsoup to your project. If you’re using Maven, include the following dependency in your pom.xml:

<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.15.3</version> <! - Check for the latest version →
</dependency>

For Gradle, add this to your build.gradle:

implementation 'org.jsoup:jsoup:1.15.3'

Alternatively, you can manually download the jar from the Jsoup website and add it to your project.

Basic HTML Parsing with Jsoup

Let’s begin with a simple example of parsing an HTML document. Consider the following HTML snippet:

<html>
<head>
<title>Sample Page</title>
</head>
<body>
<h1>Hello, World!</h1>
<p>This is a sample paragraph.</p>
</body>
</html>

To parse this HTML in Java using Jsoup, you can write the following code:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class HTMLParserExample {
public static void main(String[] args) {
String html = "<html><head><title>Sample Page</title></head>"
+ "<body><h1>Hello, World!</h1><p>This is a sample paragraph.</p></body></html>";
Document document = Jsoup.parse(html);
System.out.println("Title: " + document.title());
System.out.println("Heading: " + document.select("h1").text());
System.out.println("Paragraph: " + document.select("p").text());
}
}

Parsing an HTML File

Jsoup can also load and parse an HTML file from your local filesystem. For example:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.File;
import java.io.IOException;
public class HTMLFileParser {
public static void main(String[] args) throws IOException {
File input = new File("path/to/your/file.html");
Document document = Jsoup.parse(input, "UTF-8");
System.out.println("Title: " + document.title());
}
}

Fetching and Parsing from a URL

A common use case is fetching and parsing an HTML document from a URL. This is often used in web scraping applications.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
public class URLParserExample {
public static void main(String[] args) throws IOException {
String url = "https://example.com";
Document document = Jsoup.connect(url).get();
System.out.println("Title: " + document.title());
System.out.println("First Paragraph: " + document.select("p").first().text());
}
}

Extracting Data with Selectors

One of Jsoup’s most powerful features is its ability to extract data using CSS selectors. This feature lets you retrieve specific elements or attributes from the HTML structure.

For example, to extract all links (<a> tags) and their href attributes from a webpage:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class LinkExtractor {
public static void main(String[] args) throws IOException {
String url = "https://example.com";
Document document = Jsoup.connect(url).get();
Elements links = document.select("a[href]"); // Select all <a> tags with href attribute
for (Element link : links) {
System.out.println("Link: " + link.attr("href"));
System.out.println("Text: " + link.text());
}
}
}

Manipulating HTML Elements

Jsoup also allows you to manipulate the HTML content, like adding or removing elements, changing attributes, or modifying text content. For instance, to modify the text inside a paragraph:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class HTMLManipulationExample {
public static void main(String[] args) {
String html = "<html><body><p>Original Text</p></body></html>";
Document document = Jsoup.parse(html);
document.select("p").first().text("Updated Text");
System.out.println(document.html());
}
}

Cleaning HTML Content

In cases where you want to sanitize or clean HTML, Jsoup provides methods to remove unwanted tags, attributes, or elements, making it useful for input validation or content filtering.

import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
public class HTMLCleanerExample {
public static void main(String[] args) {
String dirtyHtml = "<p>This is <script>alert('unsafe')</script> content</p>";
String cleanHtml = Jsoup.clean(dirtyHtml, Whitelist.basic());
System.out.println("Cleaned HTML: " + cleanHtml);
}
}

Conclusion

Jsoup is a versatile and efficient tool for parsing, manipulating, and extracting data from HTML documents in Java. With its simple API and powerful CSS-like selector capabilities, it’s widely used for web scraping, data extraction, and content manipulation tasks.

Following the steps and examples in this guide, you can start parsing and working with HTML content efficiently. Whether you need to handle complex web scraping projects or simple document manipulation tasks, Jsoup offers the flexibility and performance to get the job done.

Similar Posts