How to Use GoSpider for Web Crawling
In this guide, I’ll walk you through setting up GoSpider, how to start crawling websites, and how to scrape the valuable information you need. By the end, you’ll have everything you need to get started with web crawling in no time. Let’s dive in!
What is GoSpider?
GoSpider is a command-line web crawling tool built with Go (Golang). It is designed to be lightweight and fast, making it ideal for simple web scraping tasks. GoSpider allows you to quickly collect links from websites and follow them to scrape relevant data. It is suitable for projects where you need to crawl multiple pages, such as e-commerce sites or blogs, and gather data like product details, prices, or other relevant information.
Read also how to scrape websites with Geziyor and Go.
Benefits of Using GoSpider
Before diving into the tutorial, let’s explore the key benefits of using GoSpider for web crawling:
- Fast and Efficient: GoSpider is built in Go, a language known for its performance. This makes GoSpider one of the fastest crawling tools available.
- Easy to Use: With simple commands, GoSpider lets you crawl websites and scrape data with minimal setup.
- Customizable: You can easily customize the crawling depth, follow specific links, and adjust headers and user agents to mimic real browser traffic.
- Works on Multiple Platforms: GoSpider is cross-platform, which means it works on Windows, macOS, and Linux.
- Open Source: GoSpider is free and open-source, allowing you to contribute or modify the code as needed.
Getting Started with GoSpider
Before you can start crawling, you need to set up your environment and install GoSpider. The following steps will guide you through the installation process.
Step 1: Install Go
The first thing you need to do is make sure you have Go installed on your computer. If you don’t have Go, you can download it from the official Go website: https://golang.org/dl/.
Once you have installed Go, you can verify the installation by opening your terminal (Command Prompt on Windows or Terminal on macOS/Linux) and typing the following command:
go version
If Go is installed correctly, this command will output the installed version of Go, such as:
go version go1.23.4 windows/amd64
Step 2: Install GoSpider
After ensuring that Go is installed, you can install GoSpider using the following command:
GO111MODULE=on go install github.com/jaeles-project/gospider@latest
This command installs the latest version of GoSpider from GitHub. On Windows, this will place the GoSpider executable in the $GOPATH/bin directory, making it accessible from your terminal.
Once the installation is complete, you can check if GoSpider is installed correctly by typing the following command:
gospider -h
This will display the help menu with an overview of the commands and options available in GoSpider.
Basic Crawling with GoSpider
Now that you have GoSpider installed, let’s walk through a simple example of how to use it to crawl a website and collect links.
Step 3: Start Crawling
Let’s say you want to crawl a specific website, such as the homepage of an e-commerce store. You can do this by using the following GoSpider command:
gospider -q -s "https://www.example.com" -o output
Here’s a breakdown of the command:
- -q: This flag suppresses the verbose output, meaning it will only show the URLs found during crawling.
- -s: This flag specifies the URL of the site to crawl.
- -o: This flag specifies the output folder where GoSpider will save the crawled data. In this case, it will save the results in an “ output folder.”
After running this command, GoSpider will start crawling the website and saving the links it finds into text files inside the “output” folder. You should see an output similar to this:
https://www.example.com/products
https://www.example.com/contact
Step 4: Follow Links with GoSpider
By default, GoSpider only crawls the page you specify and finds the links present on that page. However, you may want to follow links to other pages on the website. For example, you might want to crawl all the product pages in an e-commerce store.
To follow links, you can use the -d flag, which sets the recursion depth for following links. The default depth is 1, meaning GoSpider will only crawl the links on the starting page. You can increase the depth value if you want to follow links to a deeper level.
For example, to crawl up to three levels of pages, use this command:
gospider -q -s "https://www.example.com/products" -d 3 -o output
This will make GoSpider crawl the “products” page and follow links to up to three additional levels (e.g., product detail pages, reviews, etc.).
Step 5: Filter Results Using Regular Expressions
Sometimes, you may want to filter out specific types of links. For example, you may only be interested in crawling pages with a specific pattern in their URLs, such as product pages ending with /product/ in the URL.
GoSpider allows you to filter links using regular expressions. You can use the — whitelist and — blacklist options to specify patterns to include or exclude.
For example, to crawl only URLs that contain /product/, use the following command:
gospider -q -s "https://www.example.com" - whitelist "/product/" -o output
This will make GoSpider crawl only the links that match the /product/ pattern and ignore other pages.
Scraping Data from Collected Links
Once you’ve crawled a website and gathered the links, the next step is to scrape valuable data from those pages. You may often want to extract specific information, such as product names, prices, or image URLs.
GoSpider does not provide built-in support for scraping data from HTML content, but you can process the data you collect using other Go libraries, such as Colly.
Step 6: Install Colly for Web Scraping
Colly is a powerful and fast web scraping library for Go that can extract structured data from websites. To install Colly, run the following command:
go get github.com/gocolly/colly/v2
After installing Colly, you can scrape data from the links you collected with GoSpider.
Step 7: Scrape Product Data with Colly
Let’s assume you want to extract product names, prices, and image URLs from a list of product pages. Here’s how you can do it.
First, create a Go file named crawler.go and import the necessary libraries:
package main
import (
"fmt"
"log"
"os"
"github.com/gocolly/colly/v2"
)
Then, define a Product struct to store the scraped data:
type Product struct {
Name string
Price string
ImageURL string
}
var products []Product
Now, set up Colly to visit the URLs and extract product details:
func main() {
c := colly.NewCollector()
c.OnHTML("li.product", func(e *colly.HTMLElement) {
productName := e.ChildText(".product-name")
productPrice := e.ChildText(".product-price")
imageURL := e.ChildAttr(".product-image", "src")
product := Product{
Name: productName,
Price: productPrice,
ImageURL: imageURL,
}
products = append(products, product)
fmt.Printf("Product Name: %snProduct Price: %snImage URL: %sn", productName, productPrice, imageURL)
})
file, err := os.Open("pagination_links")
if err != nil {
log.Fatalf("Error opening file: %v", err)
}
defer file.Close()
var urls []string
// Read links from file
// …
for _, url := range urls {
err := c.Visit(url)
if err != nil {
log.Printf("Error visiting %s: %v", url, err)
}
}
}
This code will visit each link and extract the product name, price, and image URL. You can then use the data as needed, such as saving it to a CSV file.
Exporting Data to CSV
Once you’ve collected the data, you might want to save it to a CSV file for later analysis. Go provides a built-in encoding/csv package to make this easy.
Here’s how you can write the scraped product data to a CSV file:
import (
"encoding/csv"
"os"
)
func exportToCSV(filename string) {
file, err := os.Create(filename)
if err != nil {
fmt.Println("Error creating CSV file:", err)
return
}
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
writer.Write([]string{"Name", "Price", "Image URL"})
for _, product := range products {
writer.Write([]string{product.Name, product.Price, product.ImageURL})
}
fmt.Println("Product details exported to", filename)
}
Call the exportToCSV() function when scraping is complete:
c.OnScraped(func(r *colly.Response) {
exportToCSV("product_data.csv")
})
This will create a CSV file with the product name, price, and image URL for each product.
Conclusion
GoSpider is a fantastic tool for web crawling and scraping due to its simplicity and speed. It’s perfect for quickly crawling websites and gathering links. When you pair GoSpider with libraries like Colly, you can easily take your crawlers to the next level by extracting valuable data.
In this guide, we’ve covered how to set up GoSpider, crawl websites, scrape valuable data, then export that data to CSV. Whether you’re working on a small site or a large project, GoSpider makes it easy to gather the information you need. With some practice, you can create powerful crawlers that save you time and help you easily collect data from the web.