Web Scraping with C

Web Scraping with C in 2025

In this simple guide, I’ll show you how I build a web scraper using C. I’ll walk you through getting HTML from a webpage, picking out the information I want using XPath, and saving everything into a CSV file. It might sound tricky at first, but with the right tools and steps, it’s very doable — even in 2025. Let’s dive in!

Why Use C for Web Scraping in 2025?

Despite being over five decades old, C remains a performance juggernaut. Web scraping with C might seem unconventional, but its advantages are clear:

  • Blazing fast execution: Ideal for scraping high-volume or heavy-content sites.
  • Precise resource control: Minimize memory and CPU usage.
  • Low-level debugging and optimization: Spot bottlenecks with ease.
  • Better integration: C can be used with embedded systems, microcontrollers, or high-performance clusters.

Considering Other Approaches?

While C gives you maximum control and performance, it can require significant setup and maintenance — especially as web technologies evolve. If you need to extract data from dynamic, protected, or rapidly changing sites, modern scraping platforms such as those from Bright Data can streamline the process.

Bright Data and other providers offer solutions like managed proxy networks, browser automation, and prebuilt APIs. These can help automate scraping tasks, handle complex websites, and save development time. Depending on your use case, it may be worth exploring these options alongside your C-based approach.

Step-by-Step: Building a Web Scraper in C

Tools You’ll Need:

Install these dependencies using a package manager like vcpkg or your system’s package manager:

vcpkg install curl libxml2

Or for Ubuntu/Debian systems:

sudo apt-get install libcurl4-openssl-dev libxml2-dev

You’ll need:

  • libcurl — for making HTTP requests
  • libxml2 — for parsing HTML using XPath

Step 1: Installing Dependencies

Before diving into code, set up your environment.

On Linux/macOS:

sudo apt install libcurl4-openssl-dev libxml2-dev

On Windows using vcpkg:

vcpkg install curl libxml2

Ensure your development environment is configured to link against these libraries.

Step 2: Setting Up the Basic C Scraper

Create a file called scraper.c. Start with the basic includes and a “Hello World” to verify your setup:

#include 
#include 
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>
int main() {
printf("Web Scraping in C - 2025!n");
return 0;
}

Compile with:

gcc scraper.c -o scraper -lcurl -lxml2

If everything works, you’re ready to go.

Step 3: Fetching Web Pages with libcurl

Next, write a function to perform HTTP GET requests and capture HTML:

struct CURLResponse {
char *html;
size_t size;
};
static size_t WriteHTMLCallback(void *contents, size_t size, size_t nmemb, void *userp) {
size_t realsize = size * nmemb;
struct CURLResponse *mem = (struct CURLResponse *)userp;
char *ptr = realloc(mem->html, mem->size   realsize   1);
if (!ptr) return 0;
mem->html = ptr;
memcpy(&(mem->html[mem->size]), contents, realsize);
mem->size  = realsize;
mem->html[mem->size] = 0;
return realsize;
}
struct CURLResponse GetHTML(const char *url) {
CURL *curl = curl_easy_init();
struct CURLResponse res = {.html = malloc(1), .size = 0};
curl_easy_setopt(curl, CURLOPT_URL, url);
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteHTMLCallback);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, (void *)&res);
curl_easy_setopt(curl, CURLOPT_USERAGENT, "Mozilla/5.0");
curl_easy_perform(curl);
curl_easy_cleanup(curl);
return res;
}

Now you can call GetHTML(“https://example.com”) and retrieve the raw HTML.

Step 4: Parsing HTML with libxml2

Let’s extract meaningful content from the fetched HTML using XPath queries.

htmlDocPtr ParseHTML(const char *html, size_t size) {
return htmlReadMemory(html, size, NULL, NULL, HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING);
}
xmlXPathObjectPtr GetNodesByXPath(htmlDocPtr doc, const char *xpathExpr) {
xmlXPathContextPtr context = xmlXPathNewContext(doc);
return xmlXPathEvalExpression((xmlChar *)xpathExpr, context);
}

Step 5: Define the Data Structure

Use a struct to store product data:

typedef struct {
char *url;
char *image;
char *name;
char *price;
} Product;
#define MAX_PRODUCTS 192

Step 6: Scraping Product Data

Suppose you’re scraping an e-commerce demo site with structured HTML. Here’s how to extract product information:

int ExtractProducts(htmlDocPtr doc, Product *products, int startIndex) {
xmlXPathContextPtr context = xmlXPathNewContext(doc);
xmlXPathObjectPtr items = xmlXPathEvalExpression((xmlChar *)"//li[contains(@class,'product')]", context);
int count = 0;
for (int i = 0; i < items->nodesetval->nodeNr;   i) {
xmlNodePtr node = items->nodesetval->nodeTab[i];
xmlXPathSetContextNode(node, context);
char *url = (char *)xmlGetProp(xmlXPathEvalExpression((xmlChar *)".//a", context)->nodesetval->nodeTab[0], (xmlChar *)"href");
char *img = (char *)xmlGetProp(xmlXPathEvalExpression((xmlChar *)".//a/img", context)->nodesetval->nodeTab[0], (xmlChar *)"src");
char *name = (char *)xmlNodeGetContent(xmlXPathEvalExpression((xmlChar *)".//a/h2", context)->nodesetval->nodeTab[0]);
char *price = (char *)xmlNodeGetContent(xmlXPathEvalExpression((xmlChar *)".//a/span", context)->nodesetval->nodeTab[0]);
products[startIndex   count].url = strdup(url);
products[startIndex   count].image = strdup(img);
products[startIndex   count].name = strdup(name);
products[startIndex   count].price = strdup(price);
count  ;
free(url); free(img); free(name); free(price);
}
xmlXPathFreeContext(context);
xmlXPathFreeObject(items);
return count;
}

Step 7: Exporting Data to CSV

Once you’ve scraped the data, export it:

void ExportToCSV(Product *products, int count) {
FILE *fp = fopen("products.csv", "w");
fprintf(fp, "url,image,name,pricen");
for (int i = 0; i < count; i  ) {
fprintf(fp, ""%s","%s","%s","%s"n",
products[i].url, products[i].image, products[i].name, products[i].price);
}
fclose(fp);
}

Step 8: Crawling Multiple Pages

Many sites paginate products. For example:

https://example.com/products/page/1/

https://example.com/products/page/2/

Loop through these pages:

#define TOTAL_PAGES 5
int main() {
curl_global_init(CURL_GLOBAL_ALL);
Product products[MAX_PRODUCTS];
int totalCount = 0;
for (int page = 1; page <= TOTAL_PAGES;   page) {
char url[256];
snprintf(url, sizeof(url), "https://example.com/products/page/%d/", page);
struct CURLResponse res = GetHTML(url);
htmlDocPtr doc = ParseHTML(res.html, res.size);
int count = ExtractProducts(doc, products, totalCount);
totalCount  = count;
xmlFreeDoc(doc);
free(res.html);
}
ExportToCSV(products, totalCount);
for (int i = 0; i < totalCount; i  ) {
free(products[i].url);
free(products[i].image);
free(products[i].name);
free(products[i].price);
}
curl_global_cleanup();
return 0;
}

Anti-Scraping Measures in 2025

Websites in 2025 are smarter. You must be mindful of:

  • Rate limits: Add delays between requests. Alternatively, check my list of the best rotating proxies.
  • Headers: Use realistic User-Agent, Accept, and Referer values.
  • Session cookies: Maintain cookies across requests if necessary.
  • CAPTCHAs: These are harder to bypass in C without browser automation. Check my article on the best CAPTCHA solving tools.

Here’s how to simulate a real browser using headers:

curl_easy_setopt(curl, CURLOPT_HTTPHEADER, curl_slist_append(NULL, “Accept: text/html”));

Future of Web Scraping in C

Although dynamic sites using JavaScript are not natively supported in C due to the lack of a headless browser engine, there are two ways around it:

  1. Use intermediate rendering services (HTML snapshot providers).
  2. Offload JavaScript rendering to other services and only process the raw HTML in C.

There are emerging projects exploring lightweight headless rendering in C or C , but C is best used for static or semi-static sites where raw performance is required.

Conclusion

Web scraping with C in 2025 is a compelling yet advanced practice. It grants unparalleled speed and control, making it ideal for specialized applications where every byte of memory and every millisecond counts. By leveraging powerful libraries like libcurl and libxml2, and adhering to responsible scraping principles, you can build robust and efficient scraping solutions in C.

Whether you’re scraping ecommerce pages, collecting data for research, or embedding scraping logic into a constrained environment, C remains a high-performance tool in your developer toolkit.

Similar Posts