5 Best Rust HTML Parsers for Web Scraping

If you’re using Rust (like I am), you’ll want a parser that’s fast, safe, and easy to work with. Luckily, there are a few great libraries out there built just for that. In this guide, I’ll walk you through the 5 best Rust HTML parsers I’ve found for web scraping. I’ll share what each one does well, where it falls short, and when to use it.

Why You Need an HTML Parser in Rust

When you fetch a webpage using a Rust HTTP client like reqwest, you receive the raw HTML content. However, HTML is not easy to read or navigate using basic string methods. HTML parsers let you:

Search for elements using tags or CSS classes
Navigate parent-child relationships between elements
Extract text or attribute values
Handle broken or malformed HTML

This is especially helpful when scraping real-world websites that don’t always follow perfect HTML rules.

Setup for Testing

Before we dive into the individual libraries, here’s the sample async Rust function we use to get the HTML from a test website:

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let resp = reqwest::get("https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/")
.await?
.text()
.await?;
println!("{}", resp);
Ok(())
}

Top 5 Best Rust HTML Parsers for Web Scraping

Let’s take a closer look at each HTML parser on the list. I’ll show you how they work with real-world HTML and how well they perform.

Scraper — Easy and Reliable

Scraper is a high-level HTML parsing library in Rust. It’s built on top of html5ever and selectors, which means it’s fast and designed to handle messy, real-world HTML.

Key Features:

High-level interface
CSS selector support
Based on real browser technology
DOM-like HTML tree

Pros:

Very easy to use
Can handle broken or invalid HTML
Active development and community support
Good documentation

Cons:

Consumes more memory with large pages
Limited support for modifying the HTML

Example:

use scraper::{Html, Selector};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let resp = reqwest::get("https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/")
.await?
.text()
.await?;
let fragment = Html::parse_document(&resp);
let selector = Selector::parse(".price").unwrap();
if let Some(element) = fragment.select(&selector).next() {
let price = element.text().collect::<String>();
println!("Price: {}", price);
}
Ok(())
}

Use Case: Best for most web scraping projects, especially when ease-of-use and real-world compatibility are key.

html5ever — Fast and Low-Level

Html5ever is a fast and powerful HTML parser written in Rust. It was created as part of the Servo browser project. It does not build a full DOM tree. Instead, it uses a tokenizer model, emitting each HTML element as a token.

Key Features:

Complies with the HTML5 standard
Very fast parsing
Used in browser engines like Servo

Pros:

Extremely fast and efficient
Handles malformed HTML well
Full control over HTML structure
Low-level customization possible

Cons:

Verbose and harder to use
No DOM tree representation
Requires more code to extract specific data

Example:

extern crate html5ever;
use html5ever::tendril::TendrilSink;
use html5ever::tokenizer::{Tokenizer, TokenSink, Token, TokenSinkResult, StartTag, EndTag, CharacterTokens, TokenizerOpts};
use html5ever::tokenizer::BufferQueue;
struct PriceFinder {
inside_price: bool,
result: String,
}
impl TokenSink for PriceFinder {
type Handle = ();
fn process_token(&mut self, token: Token, _line: u64) -> TokenSinkResult<()> {
match token {
Token::TagToken(tag) => {
if tag.kind == StartTag && tag.name.as_ref() == "p" {
for attr in tag.attrs {
if attr.name.local.as_ref() == "class" && attr.value.as_ref() == "price" {
self.inside_price = true;
}
}
} else if tag.kind == EndTag && tag.name.as_ref() == "p" {
self.inside_price = false;
}
}
Token::CharacterTokens(t) => {
if self.inside_price {
self.result.push_str(&t);
}
}
_ => {}
}
TokenSinkResult::Continue
}
}

Use Case: Best for performance-critical scraping or when you want complete control.

Select.rs — jQuery-Style Syntax

Select.rs is another powerful HTML parser for Rust. Its familiar jQuery-style syntax makes it easy to write and understand selectors.

Key Features:

Supports CSS and XPath-like queries
Allows DOM modification
Returns clean output formats like JSON and YAML

Pros:

jQuery-style syntax
Easy for beginners
Can export data in multiple formats
Built-in functions for finding and editing nodes

Cons:

Uses more memory than other tools
Slower than some alternatives for large documents

Example:

use select::document::Document;
use select::predicate::{Class, Name};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let resp = reqwest::get("https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/")
.await?
.text()
.await?;
let document = Document::from(resp.as_str());
if let Some(price_node) = document.find(Name("p").and(Class("price"))).next() {
println!("Price: {}", price_node.text());
}
Ok(())
}

Use Case: Great for users who prefer jQuery-style syntax and want to output structured data.

Kuchiki — DOM-Like Tree

Kuchiki (Japanese for “rotten tree”) is a fun and intuitive HTML parser that creates a DOM tree, like a real browser. It’s built on html5ever, but provides a tree-based interface with CSS-style selectors.

Key Features:

DOM tree support
CSS selectors
Allows traversal and modification

Pros:

Very easy to understand
Full tree structure
Good selector support
Great for traversing complex HTML

Cons:

No longer actively maintained (archived in 2023)
Larger binary size due to tree handling

Example:

use kuchiki::parse_html;
use kuchiki::traits::*;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let resp = reqwest::get("https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/")
.await?
.text()
.await?;
let document = parse_html().one(resp);
if let Ok(price_node) = document.select_first("p.price") {
println!("Price: {}", price_node.text_contents());
}
Ok(())
}

Use Case: This is best for tree traversal and simple HTML tasks. It should be avoided for large projects due to the lack of support.

Pulldown-Cmark — Markdown First

Pulldown-cmark is not an HTML parser in the traditional sense. It’s designed to parse Markdown, but it can be adapted for some HTML parsing tasks. It works well when combined with tools like html2md that convert HTML to Markdown.

Key Features:

Pull parser for Markdown
Extremely low memory usage
Can convert HTML → Markdown → parse text

Pros:

Super lightweight
Efficient for large text files
Useful when you only need plain text
Simple to use

Cons:

Not a true HTML parser
Lacks CSS selector or DOM support
Needs other crates for HTML handling

Example:

use pulldown_cmark::{Parser, Event};
use html2md;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let resp = reqwest::get("https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/")
.await?
.text()
.await?;
let markdown = html2md::parse_html(&resp);
let parser = Parser::new(&markdown);
for event in parser {
if let Event::Text(text) = event {
if text.contains("$") {
println!("Price: {}", text);
}
}
}
Ok(())
}

Use Case: Great for converting HTML content to Markdown for analysis or display.

Performance Benchmark

We tested each parser using the same webpage and extraction task (extracting product price). Here’s how they performed on average (lower is better):

These tests were run using criterion.rs on a Ryzen 9 machine. Real-world results may vary.

When to Use Each HTML Parser

Conclusion

Rust has several capable HTML parsers for web scraping. Whether you want high performance, easy syntax, or advanced control, there’s a crate for you. If you’re just getting started, Scraper is a great balance of power and simplicity. If you care about raw performance and have time to write more code, go for html5ever. For structured output and jQuery-style selection, try Select.rs.

And if you’re converting content to Markdown, Pulldown-cmark may surprise you with its speed. Whichever parser you choose, pair it with a reliable proxies or scraping APIs like Bright Data or ScrapingBee. This ensures your scrapers avoid IP blocks and can scale effectively!

5 Best Rust HTML Parsers for Web Scraping

Why You Need an HTML Parser in Rust

Setup for Testing

Top 5 Best Rust HTML Parsers for Web Scraping

Scraper — Easy and Reliable

html5ever — Fast and Low-Level

Select.rs — jQuery-Style Syntax

Kuchiki — DOM-Like Tree

Pulldown-Cmark — Markdown First

Performance Benchmark

When to Use Each HTML Parser

Conclusion

Using curl_cffi for Web Scraping in Python

How to Scrape Etsy: A Step-by-Step Guide

How to Parse HTML With Python

How to Set Scrapy Headers: A Step-by-Step Guide

How to Bypass CAPTCHAs with Scrapy

How to Use Google Trends for Market Research

Why You Need an HTML Parser in Rust

Setup for Testing

Top 5 Best Rust HTML Parsers for Web Scraping

Scraper — Easy and Reliable

html5ever — Fast and Low-Level

Select.rs — jQuery-Style Syntax

Kuchiki — DOM-Like Tree

Pulldown-Cmark — Markdown First

Performance Benchmark

When to Use Each HTML Parser

Conclusion

Similar Posts