[go: up one dir, main page]

0% found this document useful (0 votes)
18 views28 pages

Web Scraping with C

How to easily libscrape
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views28 pages

Web Scraping with C

How to easily libscrape
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Web Scraping with C in 2024

zenrows.com/blog/web-scraping-c

June 1, 2024 · 7 min read

C is one of the most efficient programming languages on the planet, and its
performance makes it ideal for web scraping, which involves tons of pages or very
large ones! In this step-by-step tutorial, you'll learn how to do web scraping in C
with the libcurl and libxml2 libraries.

Let's dive in!

Can You Do Web Scraping with C?


When you think of the online world, C isn't the first language that comes to mind.
For web scraping, most developers prefer Python because of its extensive
packages. Or they use JavaScript in Node.js because of its large community.

At the same time, C is a viable option for doing web scraping, especially when
resource use is critical. A web scraping application in C can achieve extreme
performance thanks to the low-level nature of the language.

Learn more about the best programming languages for web scraping.

How to Do Web Scraping in C


Web scraping using C involves three simple steps:

1. Download the HTML content of the target page with libcurl.


2. Parse the HTML and scrape data from it with the HTML parser libxml2.
3. Export the collected data to a file.

As a target site, we'll use ScrapingCourse.com, a demo website with e-commerce


features:

1/28
Click to open the image in full screen

The C scraper you're going to build will be able to retrieve all product data from
each page of the site.

Let's build a C web scraping program!

Step 1: Install the Necessary Tools

Coding in C requires an environment for compilation and execution. On Windows,


rely on Visual Studio. On macOS or Linux, go for Visual Studio Code with the
C/C++ extension. Open the IDE and follow the instructions to create a C project
based on your local compiler.

Then, install the C package manager vcpkg and set it up in Visual Studio as
explained in the official guide. This package manager allows you to install the
dependencies required to build a web scraper in C:

libcurl helps you retrieve the HTML of the target pages that you can then parse
with libxml2 to extract the desired data.

To install libcurl and libxml2, run the command below in the root folder of the
project:

Terminal

vcpkg install curl libxml2

2/28
Fantastic! You're now fully set up.

Time to initialize your web scraping C script. Create a scraper.c file in your project
as follows. This is the easiest C program, but the main() function will soon contain
some scraping logic.

scraper.c

#include <stdio.h>
#include <stdlib.h>

int main() {
printf("Hello, World!\n");
return 0;
}

Import the two libraries installed earlier by adding the below three lines on top of
the scraper.c file. The first import is for libcurl while the other two come from
libxml2. In detail, HTMLparser.h exposes functions to parse an HTML document
and XPath.h to select the desired elements from it.

scraper.c

#include <curl/curl.h>
#include "libxml/HTMLparser.h"
#include "libxml/xpath.h"

Great! You're now ready to learn the basics of web scraping with C!

Step 2: Get the HTML of Your Target Webpage


Requests with libcurl involve boilerplate operations you don't want to repeat every
time. Encapsulate them in a reusable function that:

1. Receives a cURL instance as a parameter.


2. Uses it to make an HTTP GET request to the URL passed as a parameter.
3. Returns the HTML document produced by the server as a special data
structure.

This is how:

3/28
scraper.c

4/28
struct CURLResponse
{
char *html;
size_t size;
};

static size_t WriteHTMLCallback(void *contents, size_t size, size_t


nmemb, void *userp)
{
size_t realsize = size * nmemb;
struct CURLResponse *mem = (struct CURLResponse *)userp;
char *ptr = realloc(mem->html, mem->size + realsize + 1);

if (!ptr)
{
printf("Not enough memory available (realloc returned NULL)\n");
return 0;
}

mem->html = ptr;
memcpy(&(mem->html[mem->size]), contents, realsize);
mem->size += realsize;
mem->html[mem->size] = 0;

return realsize;
}

struct CURLResponse GetRequest(CURL *curl_handle, const char *url)


{
CURLcode res;
struct CURLResponse response;

// initialize the response


response.html = malloc(1);
response.size = 0;

// specify URL to GET


curl_easy_setopt(curl_handle, CURLOPT_URL, url);
// send all data returned by the server to WriteHTMLCallback
curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION,
WriteHTMLCallback);
// pass "response" to the callback function
curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, (void *)&response);
// set a User-Agent header
curl_easy_setopt(curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0
(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/117.0.0.0 Safari/537.36");
// perform the GET request
res = curl_easy_perform(curl_handle);

// check for HTTP errors


if (res != CURLE_OK)
{
fprintf(stderr, "GET request failed: %s\n",
curl_easy_strerror(res));
}

5/28
return response;
}

Now, use GetRequest() in the main() function of scraper.c to retrieve the target
HTML document as a char*:

scraper.c

#include <stdio.h>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>

// ...
// struct CURLResponse GetRequest(CURL *curl_handle, const char *url)
...

int main() {
// initialize curl globally
curl_global_init(CURL_GLOBAL_ALL);

// initialize a CURL instance


CURL *curl_handle = curl_easy_init();

// retrieve the HTML document of the target page


struct CURLResponse response = GetRequest(curl_handle,
"https://www.scrapingcourse.com/ecommerce/");
// print the HTML content
printf("%s\n", response.html);

// scraping logic...

// cleanup the curl instance


curl_easy_cleanup(curl_handle);
// cleanup the curl resources
curl_global_cleanup();

return 0;
}

The script above is your initial script. Compile and run it to produce the following
output in your terminal:

Output

6/28
<!DOCTYPE html>
<html lang="en-US">
<head>
<!--- ... --->

<title>Ecommerce Test Site to Learn Web Scraping –


ScrapingCourse.com</title>

<!--- ... --->


</head>
<body class="home archive ...">
<p class="woocommerce-result-count">Showing 1–16 of 188 results</p>
<ul class="products columns-4">

<!--- ... --->

</ul>
</body>
</html>

Wonderful! That's the HTML code of the target page!

Frustrated that your web scrapers are blocked once and again?

ZenRows API handles rotating proxies and headless browsers for


you.

Try for FREE

Step 3: Extract Specific Data from the Page

After retrieving the HTML code, feed it to libxml2, where htmlReadMemory() parses
the HTML char * content and produces a tree you can explore via XPath
expressions.

scraper.c

htmlDocPtr doc = htmlReadMemory(response.html, (unsigned


long)response.size, NULL, NULL, HTML_PARSE_NOERROR);

7/28
The next step is to define an effective selection strategy. To do so, you need to
inspect the target site and familiarize yourself with its structure.

Open the target site in the browser, right-click on a product HTML node, and
choose the "Inspect" option. The following DevTools window will open:

Click to open the image in full screen

Take a look at the DOM of the page and note that all products are <li> elements
with the product class. Thus, you can retrieve them all with the XPath query
below:

scraper.c

//li[contains(@class, 'product')]

Apply the XPath selector in libxml2 to retrieve all HTML products.


xmlXPathNewContext() sets the XPath context to the entire document. Next,
xmlXPathEvalExpression() applies the selector strategy defined above.

scrape.c

xmlXPathContextPtr context = xmlXPathNewContext(doc);


xmlXPathObjectPtr productHTMLElements = xmlXPathEvalExpression((xmlChar
*)"//li[contains(@class, 'product')]", context);

8/28
Note
You can learn about XPath for web scraping in our tutorial.

Note that each product on the page contains this information:

A link to the detail page in an <a>.


An image in an <img>.
A name in an <h2>.
A price in a <span>.

To scrape that data and keep track of it, define a custom data structure on top of
scraper.c with typedef. C doesn't support classes but has structs, collections of
data fields grouped under the same name.

scraper.c

typedef struct
{
char *url;
char *image;
char *name;
char *price;
} Product;

There are many elements on a single pagination page, so you'll need an array of
Product:

scraper.c

Product products[MAX_PRODUCTS];

MAX_PRODUCTS is a macro storing the number of products on a page:

scraper.c

#define MAX_PRODUCTS 16

9/28
Time to iterate over the selected product nodes and extract the desired info from
each of them. At the end of the for cycle, products will contain all product data of
interest!

scraper.c

for (int i = 0; i < productHTMLElements->nodesetval->nodeNr; ++i)


{
// get the current element of the loop
xmlNodePtr productHTMLElement = productHTMLElements->nodesetval-
>nodeTab[i];

// set the context to restrict XPath selectors


// to the children of the current element
xmlXPathSetContextNode(productHTMLElement, context);
xmlNodePtr urlHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a", context)->nodesetval->nodeTab[0];
char *url = (char *)xmlGetProp(urlHTMLElement, (xmlChar *)"href");
xmlNodePtr imageHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/img", context)->nodesetval->nodeTab[0];
char *image = (char *)(xmlGetProp(imageHTMLElement, (xmlChar
*)"src"));
xmlNodePtr nameHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/h2", context)->nodesetval->nodeTab[0];
char *name = (char *)(xmlNodeGetContent(nameHTMLElement));
xmlNodePtr priceHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/span", context)->nodesetval->nodeTab[0];
char *price = (char *)(xmlNodeGetContent(priceHTMLElement));

// store the scraped data in a Product instance


Product product;
product.url = strdup(url);
product.image = strdup(image);
product.name = strdup(name);
product.price = strdup(price);

// free up the resources you no longer need


free(url);
free(image);
free(name);
free(price);

// add a new product to the array


products[productCount] = product;
productCount++;
}

After the loop, remember to free up the resources you allocated to achieve the
goal:

10/28
scraper.c

free(response.html);
// free up libxml2 resources
xmlXPathFreeContext(context);
xmlFreeDoc(doc);
xmlCleanupParser();

The current scraper.c file for C web scraping contains:

scraper.c

11/28
#include <stdio.h>
#include <stdlib.h>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>

#define MAX_PRODUCTS 16

// initialize a data structure to


// store the scraped data
typedef struct
{
char *url;
char *image;
char *name;
char *price;
} Product;

// ...
// struct CURLResponse GetRequest(CURL *curl_handle, const char *url)
...

int main(void)
{
// initialize curl globally
curl_global_init(CURL_GLOBAL_ALL);

// initialize a CURL instance


CURL *curl_handle = curl_easy_init();

// initialize the array that will contain


// the scraped data
Product products[MAX_PRODUCTS];
int productCount = 0;

// get the HTML document associated with the page


struct CURLResponse response = GetRequest(curl_handle,
"https://www.scrapingcourse.com/ecommerce/");

// parse the HTML document returned by the server


htmlDocPtr doc = htmlReadMemory(response.html, (unsigned
long)response.size, NULL, NULL, HTML_PARSE_NOERROR);
xmlXPathContextPtr context = xmlXPathNewContext(doc);

// get the product HTML elements on the page


xmlXPathObjectPtr productHTMLElements =
xmlXPathEvalExpression((xmlChar *)"//li[contains(@class, 'product')]",
context);

// iterate over them and scrape data from each of them


for (int i = 0; i < productHTMLElements->nodesetval->nodeNr; ++i)
{
// get the current element of the loop
xmlNodePtr productHTMLElement = productHTMLElements->nodesetval-
>nodeTab[i];

12/28
// set the context to restrict XPath selectors
// to the children of the current element
xmlXPathSetContextNode(productHTMLElement, context);
xmlNodePtr urlHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a", context)->nodesetval->nodeTab[0];
char *url = (char *)xmlGetProp(urlHTMLElement, (xmlChar
*)"href");
xmlNodePtr imageHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/img", context)->nodesetval->nodeTab[0];
char *image = (char *)(xmlGetProp(imageHTMLElement, (xmlChar
*)"src"));
xmlNodePtr nameHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/h2", context)->nodesetval->nodeTab[0];
char *name = (char *)(xmlNodeGetContent(nameHTMLElement));
xmlNodePtr priceHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/span", context)->nodesetval->nodeTab[0];
char *price = (char *)(xmlNodeGetContent(priceHTMLElement));

// store the scraped data in a Product instance


Product product;
product.url = strdup(url);
product.image = strdup(image);
product.name = strdup(name);
product.price = strdup(price);

// free up the resources you no longer need


free(url);
free(image);
free(name);
free(price);

// add a new product to the array


products[productCount] = product;
productCount++;
}

// free up the allocated resources


free(response.html);
xmlXPathFreeContext(context);
xmlFreeDoc(doc);
xmlCleanupParser();

// cleanup the curl instance


curl_easy_cleanup(curl_handle);
// cleanup the curl resources
curl_global_cleanup();

return 0;
}

Good job! Now that you know how to extract data from HTML in C, it only remains
to get the output. See that in the next step, together with the final code.

13/28
Step 4: Export Data to CSV

Right now, the scraped data is stored in an array of C structs. That's not the best
format to share data with other users. Instead, export it to a more useful format,
such as CSV.

And you don't even need an extra library to achieve that. All you have to do is open
a .csv file, convert Product instances to CSV records, and append them to the
file:

scraper.c

// open a CSV file for writing


FILE *csvFile = fopen("products.csv", "w");
if (csvFile == NULL)
{
perror("Failed to open the CSV output file!");
return 1;
}

// write the CSV header


fprintf(csvFile, "url,image,name,price\n");
// write each product's data to the CSV file

for (int i = 0; i < productCount; i++)


{
fprintf(csvFile, "%s,%s,%s,%s\n", products[i].url,
products[i].image, products[i].name, products[i].price);
}

// close the CSV file


fclose(csvFile);

Remember to free up the memory allocated for the struct fields:

scraper.c

for (int i = 0; i < productCount; i++)


{
free(products[i].url);
free(products[i].image);
free(products[i].name);
free(products[i].price);
}

Put it all together, and you'll get the final code for your C web scraping script:

14/28
scraper.c

15/28
#include <stdio.h>
#include <stdlib.h>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>

#define MAX_PRODUCTS 16

// initialize a data structure to


// store the scraped data
typedef struct
{
char *url;
char *image;
char *name;
char *price;
} Product;

struct CURLResponse
{
char *html;
size_t size;
};

static size_t
WriteHTMLCallback(void *contents, size_t size, size_t nmemb, void
*userp)
{
size_t realsize = size * nmemb;
struct CURLResponse *mem = (struct CURLResponse *)userp;

char *ptr = realloc(mem->html, mem->size + realsize + 1);


if (!ptr)
{
printf("Not enough memory available (realloc returned NULL)\n");
return 0;
}

mem->html = ptr;
memcpy(&(mem->html[mem->size]), contents, realsize);
mem->size += realsize;
mem->html[mem->size] = 0;

return realsize;
}

struct CURLResponse GetRequest(CURL *curl_handle, const char *url)


{
CURLcode res;
struct CURLResponse response;

// initialize the response


response.html = malloc(1);
response.size = 0;

// specify URL to GET

16/28
curl_easy_setopt(curl_handle, CURLOPT_URL, url);

// send all data returned by the server to WriteHTMLCallback


curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION,
WriteHTMLCallback);

// pass "response" to the callback function


curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, (void *)&response);

// set a User-Agent header


curl_easy_setopt(curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0
(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/117.0.0.0 Safari/537.36");

// perform the GET request


res = curl_easy_perform(curl_handle);

// check for HTTP errors


if (res != CURLE_OK)
{
fprintf(stderr, "GET request failed: %s\n",
curl_easy_strerror(res));
}

return response;
}

int main(void)
{
// initialize curl globally
curl_global_init(CURL_GLOBAL_ALL);

// initialize a CURL instance


CURL *curl_handle = curl_easy_init();

// initialize the array that will contain


// the scraped data
Product products[MAX_PRODUCTS];
int productCount = 0;

// get the HTML document associated with the page


struct CURLResponse response = GetRequest(curl_handle,
"https://www.scrapingcourse.com/ecommerce/");

// parse the HTML document returned by the server


htmlDocPtr doc = htmlReadMemory(response.html, (unsigned
long)response.size, NULL, NULL, HTML_PARSE_NOERROR);
xmlXPathContextPtr context = xmlXPathNewContext(doc);

// get the product HTML elements on the page


xmlXPathObjectPtr productHTMLElements =
xmlXPathEvalExpression((xmlChar *)"//li[contains(@class, 'product')]",
context);

// iterate over them and scrape data from each of them


for (int i = 0; i < productHTMLElements->nodesetval->nodeNr; ++i)

17/28
{
// get the current element of the loop
xmlNodePtr productHTMLElement = productHTMLElements->nodesetval-
>nodeTab[i];

// set the context to restrict XPath selectors


// to the children of the current element
xmlXPathSetContextNode(productHTMLElement, context);
xmlNodePtr urlHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a", context)->nodesetval->nodeTab[0];
char *url = (char *)xmlGetProp(urlHTMLElement, (xmlChar
*)"href");
xmlNodePtr imageHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/img", context)->nodesetval->nodeTab[0];
char *image = (char *)(xmlGetProp(imageHTMLElement, (xmlChar
*)"src"));
xmlNodePtr nameHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/h2", context)->nodesetval->nodeTab[0];
char *name = (char *)(xmlNodeGetContent(nameHTMLElement));
xmlNodePtr priceHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/span", context)->nodesetval->nodeTab[0];
char *price = (char *)(xmlNodeGetContent(priceHTMLElement));

// store the scraped data in a Product instance


Product product;
product.url = strdup(url);
product.image = strdup(image);
product.name = strdup(name);
product.price = strdup(price);

// free up the resources you no longer need


free(url);
free(image);
free(name);
free(price);

// add a new product to the array


products[productCount] = product;
productCount++;
}

// free up the allocated resources


free(response.html);
xmlXPathFreeContext(context);
xmlFreeDoc(doc);
xmlCleanupParser();

// cleanup the curl instance


curl_easy_cleanup(curl_handle);
// cleanup the curl resources
curl_global_cleanup();

// open a CSV file for writing


FILE *csvFile = fopen("products.csv", "w");
if (csvFile == NULL)
{

18/28
perror("Failed to open the CSV output file!");
return 1;
}
// write the CSV header
fprintf(csvFile, "url,image,name,price\n");
// write each product's data to the CSV file
for (int i = 0; i < productCount; i++)
{
fprintf(csvFile, "%s,%s,%s,%s\n", products[i].url,
products[i].image, products[i].name, products[i].price);
}

// close the CSV file


fclose(csvFile);

// free the resources associated with each product


for (int i = 0; i < productCount; i++)
{
free(products[i].url);
free(products[i].image);
free(products[i].name);
free(products[i].price);
}

return 0;
}

Compile the scraper application and run it. The products.csv file below will
appear in your project's folder containing the products from ScrapingCourse first
page.

Click to open the image in full screen

Amazing! You just learned the basics of web scraping with C, but there's still a lot
to learn. For example, you need to learn how to get the data from the next
ecommerce pages. Keep reading to become a C data scraping expert.

19/28
Advanced Web Scraping in C
Scraping requires more than the basics. Time to dig into the advanced concepts of
web scraping in C!

Scrape Multiple Pages: Web Crawling with C

The script built above retrieves products from a single page. However, the target
site consists of several pages. To scrape them all, you need to go through each of
them with web crawling. In other words, you have to discover all the links on the
sites and visit them all automatically. This involves sets, support data structures,
and custom logic to avoid visiting a page twice.

Implementing web crawling with automatic page discovery in C is possible but also
complex and error-prone. To avoid a headache, you should go for a smart
approach. Take a look at the URLs of the pagination pages. These all follow the
format below:

scraper.c

https://www.scrapingcourse.com/ecommerce/page/<page>/

Click to open the image in full screen

As there are 12 pages on the site, scrape them all by applying the following
scraping logic to each pagination URL:

20/28
scraper.c

for (int page = 1; page <= NUM_PAGES; ++page)


{
// build the URL of the target page
char url[256];
snprintf(url, sizeof(url),
"https://www.scrapingcourse.com/ecommerce/page/%d/", page);

// get the HTML document associated with the current page


struct CURLResponse response = GetRequest(curl_handle, &url);

// scraping logic...
}

NUM_PAGES is a macro that contains the number of pages the spider will visit. You'll
also need to adapt MAX_PRODUCTS accordingly:

scraper.c

#define NUM_PAGES 12
#define MAX_PRODUCTS NUM_PAGES * 16

The scraper.c file will now contain:

scraper.c

21/28
#include <stdio.h>
#include <stdlib.h>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>

#define NUM_PAGES 5 // limit to 5 to avoid crawling the entire site


#define MAX_PRODUCTS NUM_PAGES * 16

// initialize a data structure to


// store the scraped data
typedef struct
{
char *url;
char *image;
char *name;
char *price;
} Product;

struct CURLResponse
{
char *html;
size_t size;
};

static size_t
WriteHTMLCallback(void *contents, size_t size, size_t nmemb, void
*userp)
{
size_t realsize = size * nmemb;
struct CURLResponse *mem = (struct CURLResponse *)userp;

char *ptr = realloc(mem->html, mem->size + realsize + 1);


if (!ptr)
{
printf("Not enough memory available (realloc returned NULL)\n");
return 0;
}

mem->html = ptr;
memcpy(&(mem->html[mem->size]), contents, realsize);
mem->size += realsize;
mem->html[mem->size] = 0;

return realsize;
}

struct CURLResponse GetRequest(CURL *curl_handle, const char *url)


{
CURLcode res;
struct CURLResponse response;

// initialize the response


response.html = malloc(1);
response.size = 0;

22/28
// specify URL to GET
curl_easy_setopt(curl_handle, CURLOPT_URL, url);

// send all data returned by the server to WriteHTMLCallback


curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION,
WriteHTMLCallback);

// pass "response" to the callback function


curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, (void *)&response);

// set a User-Agent header


curl_easy_setopt(curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0
(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/117.0.0.0 Safari/537.36");

// perform the GET request


res = curl_easy_perform(curl_handle);

// check for HTTP errors


if (res != CURLE_OK)
{
fprintf(stderr, "GET request failed: %s\n",
curl_easy_strerror(res));
}

return response;
}

int main(void)
{
// initialize curl globally
curl_global_init(CURL_GLOBAL_ALL);

// initialize a CURL instance


CURL *curl_handle = curl_easy_init();

// initialize the array that will contain


// the scraped data
Product products[MAX_PRODUCTS];
int productCount = 0;

// iterate over the pages to scrape


for (int page = 1; page <= NUM_PAGES; ++page)
{
// build the URL of the target page
char url[256];
snprintf(url, sizeof(url),
"https://www.scrapingcourse.com/ecommerce/page/%d/", page);

// get the HTML document associated with the current page


struct CURLResponse response = GetRequest(curl_handle, &url);

// parse the HTML document returned by the server


htmlDocPtr doc = htmlReadMemory(response.html, (unsigned
long)response.size, NULL, NULL, HTML_PARSE_NOERROR);
xmlXPathContextPtr context = xmlXPathNewContext(doc);

23/28
// get the product HTML elements on the page
xmlXPathObjectPtr productHTMLElements =
xmlXPathEvalExpression((xmlChar *)"//li[contains(@class, 'product')]",
context);

// iterate over them and scrape data from each of them


for (int i = 0; i < productHTMLElements->nodesetval->nodeNr;
++i)
{
// get the current element of the loop
xmlNodePtr productHTMLElement = productHTMLElements-
>nodesetval->nodeTab[i];

// set the context to restrict XPath selectors


// to the children of the current element
xmlXPathSetContextNode(productHTMLElement, context);
xmlNodePtr urlHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a", context)->nodesetval->nodeTab[0];
char *url = (char *)xmlGetProp(urlHTMLElement, (xmlChar
*)"href");
xmlNodePtr imageHTMLElement =
xmlXPathEvalExpression((xmlChar *)".//a/img", context)->nodesetval-
>nodeTab[0];
char *image = (char *)(xmlGetProp(imageHTMLElement, (xmlChar
*)"src"));
xmlNodePtr nameHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/h2", context)->nodesetval->nodeTab[0];
char *name = (char *)(xmlNodeGetContent(nameHTMLElement));
xmlNodePtr priceHTMLElement =
xmlXPathEvalExpression((xmlChar *)".//a/span", context)->nodesetval-
>nodeTab[0];
char *price = (char *)(xmlNodeGetContent(priceHTMLElement));

// store the scraped data in a Product instance


Product product;
product.url = strdup(url);
product.image = strdup(image);
product.name = strdup(name);
product.price = strdup(price);

// free up the resources you no longer need


free(url);
free(image);
free(name);
free(price);

// add a new product to the array


products[productCount] = product;
productCount++;
}

// free up the allocated resources


free(response.html);
xmlXPathFreeContext(context);
xmlFreeDoc(doc);

24/28
xmlCleanupParser();
}

// cleanup the curl instance


curl_easy_cleanup(curl_handle);
// cleanup the curl resources
curl_global_cleanup();

// open a CSV file for writing


FILE *csvFile = fopen("products.csv", "w");
if (csvFile == NULL)
{
perror("Failed to open the CSV output file!");
return 1;
}
// write the CSV header
fprintf(csvFile, "url,image,name,price\n");
// write each product's data to the CSV file
for (int i = 0; i < productCount; i++)
{
fprintf(csvFile, "%s,%s,%s,%s\n", products[i].url,
products[i].image, products[i].name, products[i].price);
}

// close the CSV file


fclose(csvFile);

// free the resources associated with each product


for (int i = 0; i < productCount; i++)
{
free(products[i].url);
free(products[i].image);
free(products[i].name);
free(products[i].price);
}

return 0;
}

This C web scraping script crawls the entire site, getting the data from each
product on every pagination page. Run it, and the resulting CSV file will contain all
products discovered in the new visited pages, too.

Congrats, you just reached your data extraction goal!

Avoid Getting Blocked


Data is the new oil, and companies know that. That's why many websites protect
their data with anti-scraping measures, which can block requests coming from
automated software like your C scraper.

25/28
Take for example the G2.com site, which uses the Cloudflare WAF to prevent bots
from accessing its pages, and try to make a request to it:

scraper.c

struct CURLResponse response = GetRequest(curl_handle,


"https://www.g2.com/products/asana/reviews");
printf("%s\n", response.html)

That'll print the following anti-bot page:

Terminal

<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Just a moment...</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta name="robots" content="noindex,nofollow">
<meta name="viewport" content="width=device-width,initial-scale=1">
<link href="/cdn-cgi/styles/challenges.css" rel="stylesheet">
</head>
<!-- Omitted for brevity... -->

Anti-bot measures represent the biggest challenge when performing web scraping
in C. There are, of course, some solutions. Find out more in our in-depth guide on
how to do web scraping without getting blocked.

At the same time, most of those techniques are tricks that work only for a while or
not consistently. A better alternative to avoid any blocks is ZenRows, a full-featured
web scraping API that provides premium proxies, headless browser capabilities,
and a complete anti-bot toolkit.

Follow the steps below to get started with ZenRows:

Sign up for free to get your free 1,000 credits, and you'll get to the Request Builder
page.

26/28
Click to open the image in full screen

Paste your target URL (https://www.g2.com/products/asana/reviews). Then,


activate “Premium Proxies” and enable the "JS Rendering” boost mode.

On the right of the screen, select the "cURL” option, and then the “API” connection
mode. Next, pass the generated URL to your GetRequest() method:

scraper.c

struct CURLResponse response = GetRequest(curl_handle,


"https://api.zenrows.com/v1/?apikey=
<YOUR_ZENROWS_API_KEY> &url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fasana%2Fre
views&js_render=true&premium_proxy=true");
printf("%s\n", response.html)

That snippet will result in the following output:

Output

<!DOCTYPE html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/assets/favicon-
fdacc4208a68e8ae57a80bf869d155829f2400fa7dd128b9c9e60f07795c4915.ico"
rel="shortcut icon" type="image/x-icon" />
<title>Asana Reviews 2023: Details, Pricing, &amp; Features |
G2</title>
<!-- omitted for brevity ... -->

27/28
Wow! Bye-bye, anti-bot limitations!

Render JavaScript: Headless Browser Scraping in C


Most pages use JavaScript for rendering or data retrieval. To scrape them, you
need a tool that can execute JS: a headless browser. The problem is that, as of
this writing, there's no headless browser library for C.

The closest project you can find is webdriverxx, but it only works with C++. You
can explore our C++ web scraping tutorial to learn more. But if you don't want to
change the programming language, the solution is to rely on ZenRows' JS
rendering capabilities.

ZenRows works with C and any other programming language, and can render
JavaScript. It also offers JavaScript actions to interact with pages as a human user
would do. You don't need to adopt a different language to deal with dynamic
content pages in C.

Conclusion
This step-by-step tutorial taught you how to build a C web scraping application.
You started from the basics and then dug into more complex topics. You have
become a web scraping C ninja!

Now, you know:

Why C is great for efficient scraping.


The basics of scraping with C.
How to do web crawling in C.
How to use C to deal with JavaScript-rendered sites.

However, no matter how sophisticated your script is, anti-scraping technologies


can still block it. Bypass them all with ZenRows, a scraping tool with the best built-
in anti-bot bypass features on the market. A single API call allows you to get your
desired data.

28/28

You might also like