How to Parse HTML in Golang Guide
In this article, I’ll show you how to parse HTML in Go using the built-in net/html package. We’ll go through the basics, and I’ll also introduce you to some other libraries that might be better suited for specific tasks. By the end, you’ll have the skills to parse HTML, extract structured data, and use Go to handle web scraping like a pro. Let’s get started!
Prerequisites
Before diving into parsing HTML in Go, make sure you have the following:
- Go installed: If you don’t have Go installed yet, visit the official Go website and follow the installation instructions.
- Basic understanding of Go syntax: If you’re new to Go, consider brushing up on its syntax, including how to work with packages, functions, and error handling.
- HTML knowledge: Understanding HTML structure is crucial, as you’ll need to identify the tags and attributes you want to extract.
- Go modules: Ensure that Go modules are initialized in your project by running go mod init <module-name> if you’re starting a new project.
Parsing HTML Using net/html
The Go standard library provides the net/html package for parsing HTML. It offers two main approaches for parsing: the tokenizer API and the node parsing API, making it one of the best HTML parsers. In this tutorial, we’ll primarily focus on the node parsing API as it provides a higher-level abstraction and is easier to work with. However, we’ll also briefly touch on the tokenizer API for completeness.
Alternatives to Parsing With Go
The best alternative that exists right now, is datasets. Visit my article about the best dataset websites. If you are in a hurry, here’s a TL;DR of the top 5 dataset providers:
- Bright Data — Customizable and pre-built datasets across industries.
- Statista — Extensive statistics and reports for business and research.
- Datarade — Marketplace for premium data products from various providers.
- AWS Data Exchange — Third-party datasets integrated with AWS services.
- Zyte — Web scraping and custom datasets tailored to business needs.
I am not affiliated with any of the providers mentioned.
Step 1: Fetching HTML from a Web Page
To parse HTML, you must first fetch the raw HTML content from a web page. This is done by requesting an HTTP to the page URL you want to scrape. In Go, you can use the net/http package to request HTTP.
Here’s an example of how to make a simple HTTP GET request:
package main
import (
"fmt"
"io"
"net/http"
)
func main() {
url := "https://example.com" // Replace with your target URL
resp, err := http.Get(url)
if err != nil {
fmt.Println("Error fetching URL:", err)
return
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
fmt.Println("Error reading response body:", err)
return
}
fmt.Println("Fetched HTML:")
fmt.Println(string(body))
}
In this code, we use HTTP.Get() to send a GET request and retrieve the response body. The page’s HTML content is printed as a string.
Step 2: Parsing HTML with the Node Parsing API
Now that we have the raw HTML, let’s parse it. The net/html package provides a function called html.Parse() which can parse an HTML document and return a tree of nodes. Each node in the tree represents an HTML element, attribute, or piece of text.
To parse the HTML, we pass the response body (which is a nio.Reader) to html.Parse(). This will return a root node that you can traverse recursively to extract the information you need.
Here’s an example that parses the HTML document and prints the tag names:
package main
import (
"fmt"
"net/http"
"golang.org/x/net/html"
"io"
)
func main() {
url := "https://example.com"
resp, err := http.Get(url)
if err != nil {
fmt.Println("Error fetching URL:", err)
return
}
defer resp.Body.Close()
doc, err := html.Parse(resp.Body)
if err != nil {
fmt.Println("Error parsing HTML:", err)
return
}
traverse(doc)
}
// Recursively traverse the HTML nodes and print their tag names
func traverse(n *html.Node) {
if n.Type == html.ElementNode {
fmt.Println("Tag:", n.Data)
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
traverse(c)
}
}
This code will output the tag names (e.g.,<html>,<body>,<div>, etc.) found in the HTML document.
Step 3: Extracting Specific Data from the HTML
To extract specific data, you’ll need to inspect the HTML structure and identify the elements that contain the information you want. For example, you want to extract all product names, prices, and image URLs from an e-commerce page. You would need to look for specific tags (such as <h2> for product names, <span> for prices, and <img> for images) and extract their content.
Here’s an example function that extracts product names, prices, and image URLs:
package main
import (
"fmt"
"net/http"
"golang.org/x/net/html"
"strings"
"io"
)
func main() {
url := "https://example.com/ecommerce"
resp, err := http.Get(url)
if err != nil {
fmt.Println("Error fetching URL:", err)
return
}
defer resp.Body.Close()
doc, err := html.Parse(resp.Body)
if err != nil {
fmt.Println("Error parsing HTML:", err)
return
}
processProducts(doc)
}
// Recursively process the HTML nodes to extract product details
func processProducts(n *html.Node) {
if n.Type == html.ElementNode && n.Data == "li" {
// Look for product name
var name string
var price string
var imageURL string
for c := n.FirstChild; c != nil; c = c.NextSibling {
switch c.Data {
case "h2":
if c.FirstChild != nil && c.FirstChild.Type == html.TextNode {
name = c.FirstChild.Data
}
case "span":
for _, a := range c.Attr {
if a.Key == "class" && strings.Contains(a.Val, "price") {
if c.FirstChild != nil && c.FirstChild.Type == html.TextNode {
price = c.FirstChild.Data
}
}
}
case "img":
for _, a := range c.Attr {
if a.Key == "src" {
imageURL = a.Val
}
}
}
}
fmt.Println("Product Name:", name)
fmt.Println("Price:", price)
fmt.Println("Image URL:", imageURL)
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
processProducts(c)
}
}
In this example, the processProducts function recursively searches for <li> elements. Inside each <li>, it looks for <h2> (product name), <span> (price), and <img> (image URL). It then extracts the relevant text or attributes and prints the information.
Step 4: Handling Errors and Edge Cases
When working with HTML, you’ll encounter edge cases such as missing tags, empty attributes, or malformed HTML. It’s important to handle errors gracefully and ensure your scraper can handle these scenarios.
For example, when extracting text from nodes, always check if the node has children and if the child is a text node. Similarly, when extracting attributes (likesrcfor images), ensure that the attribute exists before using it.
if n.FirstChild != nil && n.FirstChild.Type == html.TextNode {
fmt.Println("Text:", n.FirstChild.Data)
}
Alternative Libraries for Parsing HTML in Go
While net/html is Go’s most commonly used HTML parsing package, other libraries offer additional features or alternative APIs that may be useful depending on your specific use case.
Goquery
Goquery is a popular Go library with a jQuery-like syntax for traversing and manipulating HTML documents. Goquery builds on the net/html package but provides a more convenient API that allows you to query elements using CSS selectors. This makes it easier to extract specific data without manually traversing the HTML tree.
Here’s an example using Goquery to extract product names from a page:
package main
import (
"fmt"
"github.com/PuerkitoBio/goquery"
"net/http"
)
func main() {
url := "https://example.com/ecommerce"
res, err := http.Get(url)
if err != nil {
fmt.Println("Error:", err)
return
}
defer res.Body.Close()
doc, err := goquery.NewDocumentFromReader(res.Body)
if err != nil {
fmt.Println("Error parsing HTML:", err)
return
}
doc.Find("li").Each(func(i int, s *goquery.Selection) {
name := s.Find("h2").Text()
fmt.Println("Product Name:", name)
})
}
Goquery allows you to easily select elements using CSS selectors like doc.Find(“li h2”).Text(), making the code much cleaner and easier to read.
Other Libraries
- Go-html-transform: This library provides an API for transforming HTML documents using CSS selectors, similar to Goquery but with a different approach. However, it is no longer actively maintained.
- Html2go: This tool mainly converts HTML files into Go code (structs). It can be useful if you need to define your HTML document structure statically.
Conclusion
Parsing HTML in Go is pretty simple with the net/html package. You can fetch raw HTML and use html.Parse() to go through the HTML tree and get the data you need. If you want something more convenient, libraries like Goquery offer jQuery-like features that make working with HTML even easier. Whether you’re building a quick web scraper or diving into a larger web crawling project, Go’s libraries and concurrency features make it a solid choice for handling HTML parsing efficiently.