Web Scraping with C
Web Scraping with C
zenrows.com/blog/web-scraping-c
C is one of the most efficient programming languages on the planet, and its
performance makes it ideal for web scraping, which involves tons of pages or very
large ones! In this step-by-step tutorial, you'll learn how to do web scraping in C
with the libcurl and libxml2 libraries.
At the same time, C is a viable option for doing web scraping, especially when
resource use is critical. A web scraping application in C can achieve extreme
performance thanks to the low-level nature of the language.
Learn more about the best programming languages for web scraping.
1/28
Click to open the image in full screen
The C scraper you're going to build will be able to retrieve all product data from
each page of the site.
Then, install the C package manager vcpkg and set it up in Visual Studio as
explained in the official guide. This package manager allows you to install the
dependencies required to build a web scraper in C:
libcurl helps you retrieve the HTML of the target pages that you can then parse
with libxml2 to extract the desired data.
To install libcurl and libxml2, run the command below in the root folder of the
project:
Terminal
2/28
Fantastic! You're now fully set up.
Time to initialize your web scraping C script. Create a scraper.c file in your project
as follows. This is the easiest C program, but the main() function will soon contain
some scraping logic.
scraper.c
#include <stdio.h>
#include <stdlib.h>
int main() {
printf("Hello, World!\n");
return 0;
}
Import the two libraries installed earlier by adding the below three lines on top of
the scraper.c file. The first import is for libcurl while the other two come from
libxml2. In detail, HTMLparser.h exposes functions to parse an HTML document
and XPath.h to select the desired elements from it.
scraper.c
#include <curl/curl.h>
#include "libxml/HTMLparser.h"
#include "libxml/xpath.h"
Great! You're now ready to learn the basics of web scraping with C!
This is how:
3/28
scraper.c
4/28
struct CURLResponse
{
char *html;
size_t size;
};
if (!ptr)
{
printf("Not enough memory available (realloc returned NULL)\n");
return 0;
}
mem->html = ptr;
memcpy(&(mem->html[mem->size]), contents, realsize);
mem->size += realsize;
mem->html[mem->size] = 0;
return realsize;
}
5/28
return response;
}
Now, use GetRequest() in the main() function of scraper.c to retrieve the target
HTML document as a char*:
scraper.c
#include <stdio.h>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>
// ...
// struct CURLResponse GetRequest(CURL *curl_handle, const char *url)
...
int main() {
// initialize curl globally
curl_global_init(CURL_GLOBAL_ALL);
// scraping logic...
return 0;
}
The script above is your initial script. Compile and run it to produce the following
output in your terminal:
Output
6/28
<!DOCTYPE html>
<html lang="en-US">
<head>
<!--- ... --->
</ul>
</body>
</html>
Frustrated that your web scrapers are blocked once and again?
After retrieving the HTML code, feed it to libxml2, where htmlReadMemory() parses
the HTML char * content and produces a tree you can explore via XPath
expressions.
scraper.c
7/28
The next step is to define an effective selection strategy. To do so, you need to
inspect the target site and familiarize yourself with its structure.
Open the target site in the browser, right-click on a product HTML node, and
choose the "Inspect" option. The following DevTools window will open:
Take a look at the DOM of the page and note that all products are <li> elements
with the product class. Thus, you can retrieve them all with the XPath query
below:
scraper.c
//li[contains(@class, 'product')]
scrape.c
8/28
Note
You can learn about XPath for web scraping in our tutorial.
To scrape that data and keep track of it, define a custom data structure on top of
scraper.c with typedef. C doesn't support classes but has structs, collections of
data fields grouped under the same name.
scraper.c
typedef struct
{
char *url;
char *image;
char *name;
char *price;
} Product;
There are many elements on a single pagination page, so you'll need an array of
Product:
scraper.c
Product products[MAX_PRODUCTS];
scraper.c
#define MAX_PRODUCTS 16
9/28
Time to iterate over the selected product nodes and extract the desired info from
each of them. At the end of the for cycle, products will contain all product data of
interest!
scraper.c
After the loop, remember to free up the resources you allocated to achieve the
goal:
10/28
scraper.c
free(response.html);
// free up libxml2 resources
xmlXPathFreeContext(context);
xmlFreeDoc(doc);
xmlCleanupParser();
scraper.c
11/28
#include <stdio.h>
#include <stdlib.h>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>
#define MAX_PRODUCTS 16
// ...
// struct CURLResponse GetRequest(CURL *curl_handle, const char *url)
...
int main(void)
{
// initialize curl globally
curl_global_init(CURL_GLOBAL_ALL);
12/28
// set the context to restrict XPath selectors
// to the children of the current element
xmlXPathSetContextNode(productHTMLElement, context);
xmlNodePtr urlHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a", context)->nodesetval->nodeTab[0];
char *url = (char *)xmlGetProp(urlHTMLElement, (xmlChar
*)"href");
xmlNodePtr imageHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/img", context)->nodesetval->nodeTab[0];
char *image = (char *)(xmlGetProp(imageHTMLElement, (xmlChar
*)"src"));
xmlNodePtr nameHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/h2", context)->nodesetval->nodeTab[0];
char *name = (char *)(xmlNodeGetContent(nameHTMLElement));
xmlNodePtr priceHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/span", context)->nodesetval->nodeTab[0];
char *price = (char *)(xmlNodeGetContent(priceHTMLElement));
return 0;
}
Good job! Now that you know how to extract data from HTML in C, it only remains
to get the output. See that in the next step, together with the final code.
13/28
Step 4: Export Data to CSV
Right now, the scraped data is stored in an array of C structs. That's not the best
format to share data with other users. Instead, export it to a more useful format,
such as CSV.
And you don't even need an extra library to achieve that. All you have to do is open
a .csv file, convert Product instances to CSV records, and append them to the
file:
scraper.c
scraper.c
Put it all together, and you'll get the final code for your C web scraping script:
14/28
scraper.c
15/28
#include <stdio.h>
#include <stdlib.h>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>
#define MAX_PRODUCTS 16
struct CURLResponse
{
char *html;
size_t size;
};
static size_t
WriteHTMLCallback(void *contents, size_t size, size_t nmemb, void
*userp)
{
size_t realsize = size * nmemb;
struct CURLResponse *mem = (struct CURLResponse *)userp;
mem->html = ptr;
memcpy(&(mem->html[mem->size]), contents, realsize);
mem->size += realsize;
mem->html[mem->size] = 0;
return realsize;
}
16/28
curl_easy_setopt(curl_handle, CURLOPT_URL, url);
return response;
}
int main(void)
{
// initialize curl globally
curl_global_init(CURL_GLOBAL_ALL);
17/28
{
// get the current element of the loop
xmlNodePtr productHTMLElement = productHTMLElements->nodesetval-
>nodeTab[i];
18/28
perror("Failed to open the CSV output file!");
return 1;
}
// write the CSV header
fprintf(csvFile, "url,image,name,price\n");
// write each product's data to the CSV file
for (int i = 0; i < productCount; i++)
{
fprintf(csvFile, "%s,%s,%s,%s\n", products[i].url,
products[i].image, products[i].name, products[i].price);
}
return 0;
}
Compile the scraper application and run it. The products.csv file below will
appear in your project's folder containing the products from ScrapingCourse first
page.
Amazing! You just learned the basics of web scraping with C, but there's still a lot
to learn. For example, you need to learn how to get the data from the next
ecommerce pages. Keep reading to become a C data scraping expert.
19/28
Advanced Web Scraping in C
Scraping requires more than the basics. Time to dig into the advanced concepts of
web scraping in C!
The script built above retrieves products from a single page. However, the target
site consists of several pages. To scrape them all, you need to go through each of
them with web crawling. In other words, you have to discover all the links on the
sites and visit them all automatically. This involves sets, support data structures,
and custom logic to avoid visiting a page twice.
Implementing web crawling with automatic page discovery in C is possible but also
complex and error-prone. To avoid a headache, you should go for a smart
approach. Take a look at the URLs of the pagination pages. These all follow the
format below:
scraper.c
https://www.scrapingcourse.com/ecommerce/page/<page>/
As there are 12 pages on the site, scrape them all by applying the following
scraping logic to each pagination URL:
20/28
scraper.c
// scraping logic...
}
NUM_PAGES is a macro that contains the number of pages the spider will visit. You'll
also need to adapt MAX_PRODUCTS accordingly:
scraper.c
#define NUM_PAGES 12
#define MAX_PRODUCTS NUM_PAGES * 16
scraper.c
21/28
#include <stdio.h>
#include <stdlib.h>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>
struct CURLResponse
{
char *html;
size_t size;
};
static size_t
WriteHTMLCallback(void *contents, size_t size, size_t nmemb, void
*userp)
{
size_t realsize = size * nmemb;
struct CURLResponse *mem = (struct CURLResponse *)userp;
mem->html = ptr;
memcpy(&(mem->html[mem->size]), contents, realsize);
mem->size += realsize;
mem->html[mem->size] = 0;
return realsize;
}
22/28
// specify URL to GET
curl_easy_setopt(curl_handle, CURLOPT_URL, url);
return response;
}
int main(void)
{
// initialize curl globally
curl_global_init(CURL_GLOBAL_ALL);
23/28
// get the product HTML elements on the page
xmlXPathObjectPtr productHTMLElements =
xmlXPathEvalExpression((xmlChar *)"//li[contains(@class, 'product')]",
context);
24/28
xmlCleanupParser();
}
return 0;
}
This C web scraping script crawls the entire site, getting the data from each
product on every pagination page. Run it, and the resulting CSV file will contain all
products discovered in the new visited pages, too.
25/28
Take for example the G2.com site, which uses the Cloudflare WAF to prevent bots
from accessing its pages, and try to make a request to it:
scraper.c
Terminal
<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Just a moment...</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta name="robots" content="noindex,nofollow">
<meta name="viewport" content="width=device-width,initial-scale=1">
<link href="/cdn-cgi/styles/challenges.css" rel="stylesheet">
</head>
<!-- Omitted for brevity... -->
Anti-bot measures represent the biggest challenge when performing web scraping
in C. There are, of course, some solutions. Find out more in our in-depth guide on
how to do web scraping without getting blocked.
At the same time, most of those techniques are tricks that work only for a while or
not consistently. A better alternative to avoid any blocks is ZenRows, a full-featured
web scraping API that provides premium proxies, headless browser capabilities,
and a complete anti-bot toolkit.
Sign up for free to get your free 1,000 credits, and you'll get to the Request Builder
page.
26/28
Click to open the image in full screen
On the right of the screen, select the "cURL” option, and then the “API” connection
mode. Next, pass the generated URL to your GetRequest() method:
scraper.c
Output
<!DOCTYPE html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/assets/favicon-
fdacc4208a68e8ae57a80bf869d155829f2400fa7dd128b9c9e60f07795c4915.ico"
rel="shortcut icon" type="image/x-icon" />
<title>Asana Reviews 2023: Details, Pricing, & Features |
G2</title>
<!-- omitted for brevity ... -->
27/28
Wow! Bye-bye, anti-bot limitations!
The closest project you can find is webdriverxx, but it only works with C++. You
can explore our C++ web scraping tutorial to learn more. But if you don't want to
change the programming language, the solution is to rely on ZenRows' JS
rendering capabilities.
ZenRows works with C and any other programming language, and can render
JavaScript. It also offers JavaScript actions to interact with pages as a human user
would do. You don't need to adopt a different language to deal with dynamic
content pages in C.
Conclusion
This step-by-step tutorial taught you how to build a C web scraping application.
You started from the basics and then dug into more complex topics. You have
become a web scraping C ninja!
28/28