Building a Web Crawler in C#: Step-by-Step Tutorial
In this article, I’ll show you how to build a web crawler in C#. We’ll start from scratch, step by step. You’ll have your efficient, scalable crawler ready to collect the needed data by the end.
Let’s get started!
What Is a Web Crawler?
A web crawler, a spider or bot, is an automated program that systematically navigates web pages, discovers links, and gathers data. Unlike web scraping, which targets specific data extraction, web crawling focuses on navigating websites and building a structural map of their content. Crawlers can also integrate scraping functionalities to extract relevant data while exploring links.
Compare web crawling with web scraping here.
Alternative to Building a Web Crawler
If building and maintaining a web crawler feels overwhelming, Bright Data offers powerful alternatives to simplify your workflow. Use the Web Scraper API for hassle-free, structured data extraction or access ready-to-use datasets tailored to your needs. These solutions save time, scale effortlessly, and include features like CAPTCHA solving, IP rotation, and compliance with privacy laws — letting you focus on analyzing data, not collecting it.
I am not affiliated with Bright Data, it’s just a suggestion.
Prerequisites for Building a Web Crawler in C#
Before starting, ensure you have the following tools and libraries:
- .NET SDK (Version 8 or Later): Download and install the latest version from the official Microsoft .NET website.
- IDE: Use Visual Studio 2022 or Visual Studio Code with the C# extension.
- NuGet Package Manager: Included with Visual Studio and used to install dependencies like Html Agility Pack and CsvHelper.
Step 1: Setting Up the Environment
Start by creating a new console application:
mkdir web-crawler
cd web-crawler
dotnet new console - framework net8.0
Installing Dependencies
Add the following libraries using NuGet:
- Html Agility Pack: For parsing HTML.
dotnet add package HtmlAgilityPack
- Html Agility Pack CSS Selectors: Simplifies selecting elements using CSS selectors.
dotnet add package HtmlAgilityPack.CssSelectors
- CsvHelper: For exporting data to CSV files.
dotnet add package CsvHelper
Step 2: Writing the Basic Crawler
Loading a Web Page
Set up the program to fetch and parse a webpage:
using HtmlAgilityPack;
class Program
{
static void Main(string[] args)
{
var web = new HtmlWeb();
var document = web.Load("https://example.com");
Console.WriteLine("Page loaded successfully!");
}
}
Run the application with:
dotnet run
Discovering Links
Expand the code to identify links on the page. Use HtmlAgilityPack
to locate all <a>
elements and extract their href
attributes:
var links = document.DocumentNode.SelectNodes("//a[@href]");
foreach (var link in links)
{
var url = link.GetAttributeValue("href", string.Empty);
Console.WriteLine($"Found URL: {url}");
}
Step 3: Managing the Crawling Process
To crawl multiple pages systematically, maintain a queue of URLs to visit and a list of discovered URLs to avoid duplication.
Implementing URL Queueing
Use a Queue
for URLs to visit and a HashSet
to track visited URLs:
var urlsToVisit = new Queue<string>();
var visitedUrls = new HashSet<string>();
urlsToVisit.Enqueue("https://example.com");
while (urlsToVisit.Count > 0)
{
var currentUrl = urlsToVisit.Dequeue();
if (visitedUrls.Contains(currentUrl)) continue;
visitedUrls.Add(currentUrl);
Console.WriteLine($"Crawling: {currentUrl}");
var currentDocument = web.Load(currentUrl);
var links = currentDocument.DocumentNode.SelectNodes("//a[@href]");
if (links == null) continue;
foreach (var link in links)
{
var url = link.GetAttributeValue("href", string.Empty);
if (!visitedUrls.Contains(url))
{
urlsToVisit.Enqueue(url);
}
}
}
Step 4: Extracting Data from Pages
Structuring Data
Define a Product
class to store the scraped data:
public class Product
{
public string Name { get; set; }
public string Price { get; set; }
public string ImageUrl { get; set; }
}
Scraping Products
Update the crawler to find and process product elements on each page:
var products = new List<Product>();
foreach (var productNode in currentDocument.DocumentNode.SelectNodes("//li[@class='product']"))
{
var name = productNode.SelectSingleNode(".//h2").InnerText.Trim();
var price = productNode.SelectSingleNode(".//span[@class='price']").InnerText.Trim();
var imageUrl = productNode.SelectSingleNode(".//img").GetAttributeValue("src", string.Empty);
products.Add(new Product { Name = name, Price = price, ImageUrl = imageUrl });
Console.WriteLine($"Found product: {name}");
}
Step 5: Saving Data to a CSV File
Use CsvHelper
to export the collected product data to a CSV file:
using CsvHelper;
using System.Globalization;
using System.IO;
using (var writer = new StreamWriter("products.csv"))
using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
{
csv.WriteRecords(products);
}
Run the application to generate a products.csv
file with all the scraped data.
Step 6: Optimizing the Crawler
- Parallel Crawling: Crawl multiple pages concurrently using
Task.Run
. - Handling Dynamic Content: Use
PuppeteerSharp
for JavaScript-rendered pages. - Avoiding Blocks: Rotate user agents, respect
robots.txt
, and introduce delays.
Conclusion
Building a web crawler in C# is all about exploring web pages, pulling out the data you need, and ensuring it runs smoothly. With this guide, you’ll be ready to tackle any web data project. Good luck and happy crawling!