Building a Web Crawler in C#

Building a Web Crawler in C#: Step-by-Step Tutorial

In this article, I’ll show you how to build a web crawler in C#. We’ll start from scratch, step by step. You’ll have your efficient, scalable crawler ready to collect the needed data by the end.

Let’s get started!

What Is a Web Crawler?

A web crawler, a spider or bot, is an automated program that systematically navigates web pages, discovers links, and gathers data. Unlike web scraping, which targets specific data extraction, web crawling focuses on navigating websites and building a structural map of their content. Crawlers can also integrate scraping functionalities to extract relevant data while exploring links.

Compare web crawling with web scraping here.

Alternative to Building a Web Crawler

If building and maintaining a web crawler feels overwhelming, Bright Data offers powerful alternatives to simplify your workflow. Use the Web Scraper API for hassle-free, structured data extraction or access ready-to-use datasets tailored to your needs. These solutions save time, scale effortlessly, and include features like CAPTCHA solving, IP rotation, and compliance with privacy laws — letting you focus on analyzing data, not collecting it.

I am not affiliated with Bright Data, it’s just a suggestion.

Prerequisites for Building a Web Crawler in C#

Before starting, ensure you have the following tools and libraries:

  • .NET SDK (Version 8 or Later): Download and install the latest version from the official Microsoft .NET website.
  • IDE: Use Visual Studio 2022 or Visual Studio Code with the C# extension.
  • NuGet Package Manager: Included with Visual Studio and used to install dependencies like Html Agility Pack and CsvHelper.

Step 1: Setting Up the Environment

Start by creating a new console application:

mkdir web-crawler
cd web-crawler
dotnet new console - framework net8.0

Installing Dependencies

Add the following libraries using NuGet:

  • Html Agility Pack: For parsing HTML.
dotnet add package HtmlAgilityPack
  • Html Agility Pack CSS Selectors: Simplifies selecting elements using CSS selectors.
dotnet add package HtmlAgilityPack.CssSelectors
  • CsvHelper: For exporting data to CSV files.
dotnet add package CsvHelper

Step 2: Writing the Basic Crawler

Loading a Web Page

Set up the program to fetch and parse a webpage:

using HtmlAgilityPack;

class Program
{
    static void Main(string[] args)
    {
        var web = new HtmlWeb();
        var document = web.Load("https://example.com");

        Console.WriteLine("Page loaded successfully!");
    }
}

Run the application with:

dotnet run

Discovering Links

Expand the code to identify links on the page. Use HtmlAgilityPack to locate all <a> elements and extract their href attributes:

var links = document.DocumentNode.SelectNodes("//a[@href]");
foreach (var link in links)
{
    var url = link.GetAttributeValue("href", string.Empty);
    Console.WriteLine($"Found URL: {url}");
}

Step 3: Managing the Crawling Process

To crawl multiple pages systematically, maintain a queue of URLs to visit and a list of discovered URLs to avoid duplication.

Implementing URL Queueing

Use a Queue for URLs to visit and a HashSet to track visited URLs:

var urlsToVisit = new Queue<string>();
var visitedUrls = new HashSet<string>();

urlsToVisit.Enqueue("https://example.com");

while (urlsToVisit.Count > 0)
{
    var currentUrl = urlsToVisit.Dequeue();
    if (visitedUrls.Contains(currentUrl)) continue;

    visitedUrls.Add(currentUrl);
    Console.WriteLine($"Crawling: {currentUrl}");

    var currentDocument = web.Load(currentUrl);
    var links = currentDocument.DocumentNode.SelectNodes("//a[@href]");
    if (links == null) continue;

    foreach (var link in links)
    {
        var url = link.GetAttributeValue("href", string.Empty);
        if (!visitedUrls.Contains(url))
        {
            urlsToVisit.Enqueue(url);
        }
    }
}

Step 4: Extracting Data from Pages

Structuring Data

Define a Product class to store the scraped data:

public class Product
{
    public string Name { get; set; }
    public string Price { get; set; }
    public string ImageUrl { get; set; }
}

Scraping Products

Update the crawler to find and process product elements on each page:

var products = new List<Product>();
foreach (var productNode in currentDocument.DocumentNode.SelectNodes("//li[@class='product']"))
{
    var name = productNode.SelectSingleNode(".//h2").InnerText.Trim();
    var price = productNode.SelectSingleNode(".//span[@class='price']").InnerText.Trim();
    var imageUrl = productNode.SelectSingleNode(".//img").GetAttributeValue("src", string.Empty);

    products.Add(new Product { Name = name, Price = price, ImageUrl = imageUrl });
    Console.WriteLine($"Found product: {name}");
}

Step 5: Saving Data to a CSV File

Use CsvHelper to export the collected product data to a CSV file:

using CsvHelper;
using System.Globalization;
using System.IO;

using (var writer = new StreamWriter("products.csv"))
using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
{
    csv.WriteRecords(products);
}

Run the application to generate a products.csv file with all the scraped data.

Step 6: Optimizing the Crawler

  • Parallel Crawling: Crawl multiple pages concurrently using Task.Run.
  • Handling Dynamic Content: Use PuppeteerSharp for JavaScript-rendered pages.
  • Avoiding Blocks: Rotate user agents, respect robots.txt, and introduce delays.

Conclusion

Building a web crawler in C# is all about exploring web pages, pulling out the data you need, and ensuring it runs smoothly. With this guide, you’ll be ready to tackle any web data project. Good luck and happy crawling!

Similar Posts