0% found this document useful (0 votes)

18 views28 pages

Web Scraping with C

How to easily libscrape

Uploaded by

Arjuna vijayanayagam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views28 pages

Web Scraping with C

How to easily libscrape

Uploaded by

Arjuna vijayanayagam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Web Scraping with C in 2024

zenrows.com/blog/web-scraping-c

June 1, 2024 · 7 min read

C is one of the most efficient programming languages on the planet, and its
performance makes it ideal for web scraping, which involves tons of pages or very
large ones! In this step-by-step tutorial, you'll learn how to do web scraping in C
with the libcurl and libxml2 libraries.

Let's dive in!

Can You Do Web Scraping with C?

When you think of the online world, C isn't the first language that comes to mind.
For web scraping, most developers prefer Python because of its extensive
packages. Or they use JavaScript in Node.js because of its large community.

At the same time, C is a viable option for doing web scraping, especially when
resource use is critical. A web scraping application in C can achieve extreme
performance thanks to the low-level nature of the language.

Learn more about the best programming languages for web scraping.

How to Do Web Scraping in C

Web scraping using C involves three simple steps:

1. Download the HTML content of the target page with libcurl.

2. Parse the HTML and scrape data from it with the HTML parser libxml2.
3. Export the collected data to a file.

As a target site, we'll use ScrapingCourse.com, a demo website with e-commerce

features:

1/28
Click to open the image in full screen

The C scraper you're going to build will be able to retrieve all product data from
each page of the site.

Let's build a C web scraping program!

Step 1: Install the Necessary Tools

Coding in C requires an environment for compilation and execution. On Windows,

rely on Visual Studio. On macOS or Linux, go for Visual Studio Code with the
C/C++ extension. Open the IDE and follow the instructions to create a C project
based on your local compiler.

Then, install the C package manager vcpkg and set it up in Visual Studio as
explained in the official guide. This package manager allows you to install the
dependencies required to build a web scraper in C:

libcurl helps you retrieve the HTML of the target pages that you can then parse
with libxml2 to extract the desired data.

To install libcurl and libxml2, run the command below in the root folder of the
project:

Terminal

vcpkg install curl libxml2

2/28
Fantastic! You're now fully set up.

Time to initialize your web scraping C script. Create a scraper.c file in your project
as follows. This is the easiest C program, but the main() function will soon contain
some scraping logic.

scraper.c

#include <stdio.h>
#include <stdlib.h>

int main() {
printf("Hello, World!\n");
return 0;
}

Import the two libraries installed earlier by adding the below three lines on top of
the scraper.c file. The first import is for libcurl while the other two come from
libxml2. In detail, HTMLparser.h exposes functions to parse an HTML document
and XPath.h to select the desired elements from it.

scraper.c

#include <curl/curl.h>
#include "libxml/HTMLparser.h"
#include "libxml/xpath.h"

Great! You're now ready to learn the basics of web scraping with C!

Step 2: Get the HTML of Your Target Webpage

Requests with libcurl involve boilerplate operations you don't want to repeat every
time. Encapsulate them in a reusable function that:

1. Receives a cURL instance as a parameter.

2. Uses it to make an HTTP GET request to the URL passed as a parameter.
3. Returns the HTML document produced by the server as a special data
structure.

This is how:

3/28
scraper.c

4/28
struct CURLResponse
{
char *html;
size_t size;
};

static size_t WriteHTMLCallback(void *contents, size_t size, size_t

nmemb, void *userp)
{
size_t realsize = size * nmemb;
struct CURLResponse *mem = (struct CURLResponse *)userp;
char *ptr = realloc(mem->html, mem->size + realsize + 1);

if (!ptr)
{
printf("Not enough memory available (realloc returned NULL)\n");
return 0;
}

mem->html = ptr;
memcpy(&(mem->html[mem->size]), contents, realsize);
mem->size += realsize;
mem->html[mem->size] = 0;

return realsize;
}

struct CURLResponse GetRequest(CURL curl_handle, const char url)

{
CURLcode res;
struct CURLResponse response;

// initialize the response

response.html = malloc(1);
response.size = 0;

// specify URL to GET

curl_easy_setopt(curl_handle, CURLOPT_URL, url);
// send all data returned by the server to WriteHTMLCallback
curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION,
WriteHTMLCallback);
// pass "response" to the callback function
curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, (void *)&response);
// set a User-Agent header
curl_easy_setopt(curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0
(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/117.0.0.0 Safari/537.36");
// perform the GET request
res = curl_easy_perform(curl_handle);

// check for HTTP errors

if (res != CURLE_OK)
{
fprintf(stderr, "GET request failed: %s\n",
curl_easy_strerror(res));
}

5/28
return response;
}

Now, use GetRequest() in the main() function of scraper.c to retrieve the target
HTML document as a char*:

scraper.c

#include <stdio.h>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>

// ...
// struct CURLResponse GetRequest(CURL *curl_handle, const char *url)
...

int main() {
// initialize curl globally
curl_global_init(CURL_GLOBAL_ALL);

// initialize a CURL instance

CURL *curl_handle = curl_easy_init();

// retrieve the HTML document of the target page

struct CURLResponse response = GetRequest(curl_handle,
"https://www.scrapingcourse.com/ecommerce/");
// print the HTML content
printf("%s\n", response.html);

// scraping logic...

// cleanup the curl instance

curl_easy_cleanup(curl_handle);
// cleanup the curl resources
curl_global_cleanup();

return 0;
}

The script above is your initial script. Compile and run it to produce the following
output in your terminal:

Output

6/28
<!DOCTYPE html>
<html lang="en-US">
<head>

<title>Ecommerce Test Site to Learn Web Scraping –

ScrapingCourse.com</title>

</head>
<body class="home archive ...">
<p class="woocommerce-result-count">Showing 1–16 of 188 results</p>
<ul class="products columns-4">

</ul>
</body>
</html>

Wonderful! That's the HTML code of the target page!

Frustrated that your web scrapers are blocked once and again?

ZenRows API handles rotating proxies and headless browsers for

you.

Try for FREE

Step 3: Extract Specific Data from the Page

After retrieving the HTML code, feed it to libxml2, where htmlReadMemory() parses
the HTML char * content and produces a tree you can explore via XPath
expressions.

scraper.c

htmlDocPtr doc = htmlReadMemory(response.html, (unsigned

long)response.size, NULL, NULL, HTML_PARSE_NOERROR);

7/28
The next step is to define an effective selection strategy. To do so, you need to
inspect the target site and familiarize yourself with its structure.

Open the target site in the browser, right-click on a product HTML node, and
choose the "Inspect" option. The following DevTools window will open:

Click to open the image in full screen

Take a look at the DOM of the page and note that all products are <li> elements
with the product class. Thus, you can retrieve them all with the XPath query
below:

scraper.c

//li[contains(@class, 'product')]

Apply the XPath selector in libxml2 to retrieve all HTML products.

xmlXPathNewContext() sets the XPath context to the entire document. Next,
xmlXPathEvalExpression() applies the selector strategy defined above.

scrape.c

xmlXPathContextPtr context = xmlXPathNewContext(doc);

xmlXPathObjectPtr productHTMLElements = xmlXPathEvalExpression((xmlChar
*)"//li[contains(@class, 'product')]", context);

8/28
Note
You can learn about XPath for web scraping in our tutorial.

Note that each product on the page contains this information:

A link to the detail page in an <a>.

An image in an <img>.
A name in an <h2>.
A price in a <span>.

To scrape that data and keep track of it, define a custom data structure on top of
scraper.c with typedef. C doesn't support classes but has structs, collections of
data fields grouped under the same name.

scraper.c

typedef struct
{
char *url;
char *image;
char *name;
char *price;
} Product;

There are many elements on a single pagination page, so you'll need an array of
Product:

scraper.c

Product products[MAX_PRODUCTS];

MAX_PRODUCTS is a macro storing the number of products on a page:

scraper.c

#define MAX_PRODUCTS 16

9/28
Time to iterate over the selected product nodes and extract the desired info from
each of them. At the end of the for cycle, products will contain all product data of
interest!

scraper.c

for (int i = 0; i < productHTMLElements->nodesetval->nodeNr; ++i)

{
// get the current element of the loop
xmlNodePtr productHTMLElement = productHTMLElements->nodesetval-
>nodeTab[i];

// set the context to restrict XPath selectors

// to the children of the current element
xmlXPathSetContextNode(productHTMLElement, context);
xmlNodePtr urlHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a", context)->nodesetval->nodeTab[0];
char *url = (char *)xmlGetProp(urlHTMLElement, (xmlChar *)"href");
xmlNodePtr imageHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/img", context)->nodesetval->nodeTab[0];
char *image = (char *)(xmlGetProp(imageHTMLElement, (xmlChar
*)"src"));
xmlNodePtr nameHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/h2", context)->nodesetval->nodeTab[0];
char *name = (char *)(xmlNodeGetContent(nameHTMLElement));
xmlNodePtr priceHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/span", context)->nodesetval->nodeTab[0];
char *price = (char *)(xmlNodeGetContent(priceHTMLElement));

// store the scraped data in a Product instance

Product product;
product.url = strdup(url);
product.image = strdup(image);
product.name = strdup(name);
product.price = strdup(price);

// free up the resources you no longer need

free(url);
free(image);
free(name);
free(price);

// add a new product to the array

products[productCount] = product;
productCount++;
}

After the loop, remember to free up the resources you allocated to achieve the
goal:

10/28
scraper.c

free(response.html);
// free up libxml2 resources
xmlXPathFreeContext(context);
xmlFreeDoc(doc);
xmlCleanupParser();

The current scraper.c file for C web scraping contains:

scraper.c

11/28
#include <stdio.h>
#include <stdlib.h>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>

#define MAX_PRODUCTS 16

// initialize a data structure to

// store the scraped data
typedef struct
{
char *url;
char *image;
char *name;
char *price;
} Product;

// ...
// struct CURLResponse GetRequest(CURL *curl_handle, const char *url)
...

int main(void)
{
// initialize curl globally
curl_global_init(CURL_GLOBAL_ALL);

// initialize a CURL instance

CURL *curl_handle = curl_easy_init();

// initialize the array that will contain

// the scraped data
Product products[MAX_PRODUCTS];
int productCount = 0;

// get the HTML document associated with the page

struct CURLResponse response = GetRequest(curl_handle,
"https://www.scrapingcourse.com/ecommerce/");

// parse the HTML document returned by the server

htmlDocPtr doc = htmlReadMemory(response.html, (unsigned
long)response.size, NULL, NULL, HTML_PARSE_NOERROR);
xmlXPathContextPtr context = xmlXPathNewContext(doc);

// get the product HTML elements on the page

xmlXPathObjectPtr productHTMLElements =
xmlXPathEvalExpression((xmlChar *)"//li[contains(@class, 'product')]",
context);

// iterate over them and scrape data from each of them

for (int i = 0; i < productHTMLElements->nodesetval->nodeNr; ++i)
{
// get the current element of the loop
xmlNodePtr productHTMLElement = productHTMLElements->nodesetval-
>nodeTab[i];

12/28
// set the context to restrict XPath selectors
// to the children of the current element
xmlXPathSetContextNode(productHTMLElement, context);
xmlNodePtr urlHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a", context)->nodesetval->nodeTab[0];
char *url = (char *)xmlGetProp(urlHTMLElement, (xmlChar
*)"href");
xmlNodePtr imageHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/img", context)->nodesetval->nodeTab[0];
char *image = (char *)(xmlGetProp(imageHTMLElement, (xmlChar
*)"src"));
xmlNodePtr nameHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/h2", context)->nodesetval->nodeTab[0];
char *name = (char *)(xmlNodeGetContent(nameHTMLElement));
xmlNodePtr priceHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/span", context)->nodesetval->nodeTab[0];
char *price = (char *)(xmlNodeGetContent(priceHTMLElement));

// store the scraped data in a Product instance

Product product;
product.url = strdup(url);
product.image = strdup(image);
product.name = strdup(name);
product.price = strdup(price);

// free up the resources you no longer need

free(url);
free(image);
free(name);
free(price);

// add a new product to the array

products[productCount] = product;
productCount++;
}

// free up the allocated resources

free(response.html);
xmlXPathFreeContext(context);
xmlFreeDoc(doc);
xmlCleanupParser();

// cleanup the curl instance

curl_easy_cleanup(curl_handle);
// cleanup the curl resources
curl_global_cleanup();

return 0;
}

Good job! Now that you know how to extract data from HTML in C, it only remains
to get the output. See that in the next step, together with the final code.

13/28
Step 4: Export Data to CSV

Right now, the scraped data is stored in an array of C structs. That's not the best
format to share data with other users. Instead, export it to a more useful format,
such as CSV.

And you don't even need an extra library to achieve that. All you have to do is open
a .csv file, convert Product instances to CSV records, and append them to the
file:

scraper.c

// open a CSV file for writing

FILE *csvFile = fopen("products.csv", "w");
if (csvFile == NULL)
{
perror("Failed to open the CSV output file!");
return 1;
}

// write the CSV header

fprintf(csvFile, "url,image,name,price\n");
// write each product's data to the CSV file

for (int i = 0; i < productCount; i++)

{
fprintf(csvFile, "%s,%s,%s,%s\n", products[i].url,
products[i].image, products[i].name, products[i].price);
}

// close the CSV file

fclose(csvFile);

Remember to free up the memory allocated for the struct fields:

scraper.c

for (int i = 0; i < productCount; i++)

{
free(products[i].url);
free(products[i].image);
free(products[i].name);
free(products[i].price);
}

Put it all together, and you'll get the final code for your C web scraping script:

14/28
scraper.c

15/28
#include <stdio.h>
#include <stdlib.h>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>

#define MAX_PRODUCTS 16

// initialize a data structure to

// store the scraped data
typedef struct
{
char *url;
char *image;
char *name;
char *price;
} Product;

struct CURLResponse
{
char *html;
size_t size;
};

static size_t
WriteHTMLCallback(void *contents, size_t size, size_t nmemb, void
*userp)
{
size_t realsize = size * nmemb;
struct CURLResponse *mem = (struct CURLResponse *)userp;

char *ptr = realloc(mem->html, mem->size + realsize + 1);

if (!ptr)
{
printf("Not enough memory available (realloc returned NULL)\n");
return 0;
}

mem->html = ptr;
memcpy(&(mem->html[mem->size]), contents, realsize);
mem->size += realsize;
mem->html[mem->size] = 0;

return realsize;
}

struct CURLResponse GetRequest(CURL curl_handle, const char url)

{
CURLcode res;
struct CURLResponse response;

// initialize the response

response.html = malloc(1);
response.size = 0;

// specify URL to GET

16/28
curl_easy_setopt(curl_handle, CURLOPT_URL, url);

// send all data returned by the server to WriteHTMLCallback

curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION,
WriteHTMLCallback);

// pass "response" to the callback function

curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, (void *)&response);

// set a User-Agent header

curl_easy_setopt(curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0
(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/117.0.0.0 Safari/537.36");

// perform the GET request

res = curl_easy_perform(curl_handle);

// check for HTTP errors

if (res != CURLE_OK)
{
fprintf(stderr, "GET request failed: %s\n",
curl_easy_strerror(res));
}

return response;
}

int main(void)
{
// initialize curl globally
curl_global_init(CURL_GLOBAL_ALL);

// initialize a CURL instance

CURL *curl_handle = curl_easy_init();

// initialize the array that will contain

// the scraped data
Product products[MAX_PRODUCTS];
int productCount = 0;

// get the HTML document associated with the page

struct CURLResponse response = GetRequest(curl_handle,
"https://www.scrapingcourse.com/ecommerce/");

// parse the HTML document returned by the server

htmlDocPtr doc = htmlReadMemory(response.html, (unsigned
long)response.size, NULL, NULL, HTML_PARSE_NOERROR);
xmlXPathContextPtr context = xmlXPathNewContext(doc);

// get the product HTML elements on the page

xmlXPathObjectPtr productHTMLElements =
xmlXPathEvalExpression((xmlChar *)"//li[contains(@class, 'product')]",
context);

// iterate over them and scrape data from each of them

for (int i = 0; i < productHTMLElements->nodesetval->nodeNr; ++i)

17/28
{
// get the current element of the loop
xmlNodePtr productHTMLElement = productHTMLElements->nodesetval-
>nodeTab[i];

// set the context to restrict XPath selectors

// to the children of the current element
xmlXPathSetContextNode(productHTMLElement, context);
xmlNodePtr urlHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a", context)->nodesetval->nodeTab[0];
char *url = (char *)xmlGetProp(urlHTMLElement, (xmlChar
*)"href");
xmlNodePtr imageHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/img", context)->nodesetval->nodeTab[0];
char *image = (char *)(xmlGetProp(imageHTMLElement, (xmlChar
*)"src"));
xmlNodePtr nameHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/h2", context)->nodesetval->nodeTab[0];
char *name = (char *)(xmlNodeGetContent(nameHTMLElement));
xmlNodePtr priceHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/span", context)->nodesetval->nodeTab[0];
char *price = (char *)(xmlNodeGetContent(priceHTMLElement));

// store the scraped data in a Product instance

Product product;
product.url = strdup(url);
product.image = strdup(image);
product.name = strdup(name);
product.price = strdup(price);

// free up the resources you no longer need

free(url);
free(image);
free(name);
free(price);

// add a new product to the array

products[productCount] = product;
productCount++;
}

// free up the allocated resources

free(response.html);
xmlXPathFreeContext(context);
xmlFreeDoc(doc);
xmlCleanupParser();

// cleanup the curl instance

curl_easy_cleanup(curl_handle);
// cleanup the curl resources
curl_global_cleanup();

// open a CSV file for writing

FILE *csvFile = fopen("products.csv", "w");
if (csvFile == NULL)
{

18/28
perror("Failed to open the CSV output file!");
return 1;
}
// write the CSV header
fprintf(csvFile, "url,image,name,price\n");
// write each product's data to the CSV file
for (int i = 0; i < productCount; i++)
{
fprintf(csvFile, "%s,%s,%s,%s\n", products[i].url,
products[i].image, products[i].name, products[i].price);
}

// close the CSV file

fclose(csvFile);

// free the resources associated with each product

for (int i = 0; i < productCount; i++)
{
free(products[i].url);
free(products[i].image);
free(products[i].name);
free(products[i].price);
}

return 0;
}

Compile the scraper application and run it. The products.csv file below will
appear in your project's folder containing the products from ScrapingCourse first
page.

Click to open the image in full screen

Amazing! You just learned the basics of web scraping with C, but there's still a lot
to learn. For example, you need to learn how to get the data from the next
ecommerce pages. Keep reading to become a C data scraping expert.

19/28
Advanced Web Scraping in C
Scraping requires more than the basics. Time to dig into the advanced concepts of
web scraping in C!

Scrape Multiple Pages: Web Crawling with C

The script built above retrieves products from a single page. However, the target
site consists of several pages. To scrape them all, you need to go through each of
them with web crawling. In other words, you have to discover all the links on the
sites and visit them all automatically. This involves sets, support data structures,
and custom logic to avoid visiting a page twice.

Implementing web crawling with automatic page discovery in C is possible but also
complex and error-prone. To avoid a headache, you should go for a smart
approach. Take a look at the URLs of the pagination pages. These all follow the
format below:

scraper.c

https://www.scrapingcourse.com/ecommerce/page/<page>/

Click to open the image in full screen

As there are 12 pages on the site, scrape them all by applying the following
scraping logic to each pagination URL:

20/28
scraper.c

for (int page = 1; page <= NUM_PAGES; ++page)

{
// build the URL of the target page
char url[256];
snprintf(url, sizeof(url),
"https://www.scrapingcourse.com/ecommerce/page/%d/", page);

// get the HTML document associated with the current page

struct CURLResponse response = GetRequest(curl_handle, &url);

// scraping logic...
}

NUM_PAGES is a macro that contains the number of pages the spider will visit. You'll
also need to adapt MAX_PRODUCTS accordingly:

scraper.c

#define NUM_PAGES 12
#define MAX_PRODUCTS NUM_PAGES * 16

The scraper.c file will now contain:

scraper.c

21/28
#include <stdio.h>
#include <stdlib.h>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>

#define NUM_PAGES 5 // limit to 5 to avoid crawling the entire site

#define MAX_PRODUCTS NUM_PAGES * 16

// initialize a data structure to

// store the scraped data
typedef struct
{
char *url;
char *image;
char *name;
char *price;
} Product;

struct CURLResponse
{
char *html;
size_t size;
};

static size_t
WriteHTMLCallback(void *contents, size_t size, size_t nmemb, void
*userp)
{
size_t realsize = size * nmemb;
struct CURLResponse *mem = (struct CURLResponse *)userp;

char *ptr = realloc(mem->html, mem->size + realsize + 1);

if (!ptr)
{
printf("Not enough memory available (realloc returned NULL)\n");
return 0;
}

mem->html = ptr;
memcpy(&(mem->html[mem->size]), contents, realsize);
mem->size += realsize;
mem->html[mem->size] = 0;

return realsize;
}

struct CURLResponse GetRequest(CURL curl_handle, const char url)

{
CURLcode res;
struct CURLResponse response;

// initialize the response

response.html = malloc(1);
response.size = 0;

22/28
// specify URL to GET
curl_easy_setopt(curl_handle, CURLOPT_URL, url);

// send all data returned by the server to WriteHTMLCallback

curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION,
WriteHTMLCallback);

// pass "response" to the callback function

curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, (void *)&response);

// set a User-Agent header

curl_easy_setopt(curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0
(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/117.0.0.0 Safari/537.36");

// perform the GET request

res = curl_easy_perform(curl_handle);

// check for HTTP errors

if (res != CURLE_OK)
{
fprintf(stderr, "GET request failed: %s\n",
curl_easy_strerror(res));
}

return response;
}

int main(void)
{
// initialize curl globally
curl_global_init(CURL_GLOBAL_ALL);

// initialize a CURL instance

CURL *curl_handle = curl_easy_init();

// initialize the array that will contain

// the scraped data
Product products[MAX_PRODUCTS];
int productCount = 0;

// iterate over the pages to scrape

for (int page = 1; page <= NUM_PAGES; ++page)
{
// build the URL of the target page
char url[256];
snprintf(url, sizeof(url),
"https://www.scrapingcourse.com/ecommerce/page/%d/", page);

// get the HTML document associated with the current page

struct CURLResponse response = GetRequest(curl_handle, &url);

// parse the HTML document returned by the server

htmlDocPtr doc = htmlReadMemory(response.html, (unsigned
long)response.size, NULL, NULL, HTML_PARSE_NOERROR);
xmlXPathContextPtr context = xmlXPathNewContext(doc);

23/28
// get the product HTML elements on the page
xmlXPathObjectPtr productHTMLElements =
xmlXPathEvalExpression((xmlChar *)"//li[contains(@class, 'product')]",
context);

// iterate over them and scrape data from each of them

for (int i = 0; i < productHTMLElements->nodesetval->nodeNr;
++i)
{
// get the current element of the loop
xmlNodePtr productHTMLElement = productHTMLElements-
>nodesetval->nodeTab[i];

// set the context to restrict XPath selectors

// to the children of the current element
xmlXPathSetContextNode(productHTMLElement, context);
xmlNodePtr urlHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a", context)->nodesetval->nodeTab[0];
char *url = (char *)xmlGetProp(urlHTMLElement, (xmlChar
*)"href");
xmlNodePtr imageHTMLElement =
xmlXPathEvalExpression((xmlChar *)".//a/img", context)->nodesetval-
>nodeTab[0];
char *image = (char *)(xmlGetProp(imageHTMLElement, (xmlChar
*)"src"));
xmlNodePtr nameHTMLElement = xmlXPathEvalExpression((xmlChar
*)".//a/h2", context)->nodesetval->nodeTab[0];
char *name = (char *)(xmlNodeGetContent(nameHTMLElement));
xmlNodePtr priceHTMLElement =
xmlXPathEvalExpression((xmlChar *)".//a/span", context)->nodesetval-
>nodeTab[0];
char *price = (char *)(xmlNodeGetContent(priceHTMLElement));

// store the scraped data in a Product instance

Product product;
product.url = strdup(url);
product.image = strdup(image);
product.name = strdup(name);
product.price = strdup(price);

// free up the resources you no longer need

free(url);
free(image);
free(name);
free(price);

// add a new product to the array

products[productCount] = product;
productCount++;
}

// free up the allocated resources

free(response.html);
xmlXPathFreeContext(context);
xmlFreeDoc(doc);

24/28
xmlCleanupParser();
}

// cleanup the curl instance

curl_easy_cleanup(curl_handle);
// cleanup the curl resources
curl_global_cleanup();

// open a CSV file for writing

FILE *csvFile = fopen("products.csv", "w");
if (csvFile == NULL)
{
perror("Failed to open the CSV output file!");
return 1;
}
// write the CSV header
fprintf(csvFile, "url,image,name,price\n");
// write each product's data to the CSV file
for (int i = 0; i < productCount; i++)
{
fprintf(csvFile, "%s,%s,%s,%s\n", products[i].url,
products[i].image, products[i].name, products[i].price);
}

// close the CSV file

fclose(csvFile);

// free the resources associated with each product

for (int i = 0; i < productCount; i++)
{
free(products[i].url);
free(products[i].image);
free(products[i].name);
free(products[i].price);
}

return 0;
}

This C web scraping script crawls the entire site, getting the data from each
product on every pagination page. Run it, and the resulting CSV file will contain all
products discovered in the new visited pages, too.

Congrats, you just reached your data extraction goal!

Avoid Getting Blocked

Data is the new oil, and companies know that. That's why many websites protect
their data with anti-scraping measures, which can block requests coming from
automated software like your C scraper.

25/28
Take for example the G2.com site, which uses the Cloudflare WAF to prevent bots
from accessing its pages, and try to make a request to it:

scraper.c

struct CURLResponse response = GetRequest(curl_handle,

"https://www.g2.com/products/asana/reviews");
printf("%s\n", response.html)

That'll print the following anti-bot page:

Terminal

<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Just a moment...</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta name="robots" content="noindex,nofollow">
<meta name="viewport" content="width=device-width,initial-scale=1">
<link href="/cdn-cgi/styles/challenges.css" rel="stylesheet">
</head>

Anti-bot measures represent the biggest challenge when performing web scraping
in C. There are, of course, some solutions. Find out more in our in-depth guide on
how to do web scraping without getting blocked.

At the same time, most of those techniques are tricks that work only for a while or
not consistently. A better alternative to avoid any blocks is ZenRows, a full-featured
web scraping API that provides premium proxies, headless browser capabilities,
and a complete anti-bot toolkit.

Follow the steps below to get started with ZenRows:

26/28
Click to open the image in full screen

Paste your target URL (https://www.g2.com/products/asana/reviews). Then,

activate “Premium Proxies” and enable the "JS Rendering” boost mode.

On the right of the screen, select the "cURL” option, and then the “API” connection
mode. Next, pass the generated URL to your GetRequest() method:

scraper.c

struct CURLResponse response = GetRequest(curl_handle,

"https://api.zenrows.com/v1/?apikey=
<YOUR_ZENROWS_API_KEY> &url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fasana%2Fre
views&js_render=true&premium_proxy=true");
printf("%s\n", response.html)

That snippet will result in the following output:

Output

<!DOCTYPE html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/assets/favicon-
fdacc4208a68e8ae57a80bf869d155829f2400fa7dd128b9c9e60f07795c4915.ico"
rel="shortcut icon" type="image/x-icon" />
<title>Asana Reviews 2023: Details, Pricing, & Features |
G2</title>

27/28
Wow! Bye-bye, anti-bot limitations!

Render JavaScript: Headless Browser Scraping in C

Most pages use JavaScript for rendering or data retrieval. To scrape them, you
need a tool that can execute JS: a headless browser. The problem is that, as of
this writing, there's no headless browser library for C.

The closest project you can find is webdriverxx, but it only works with C++. You
can explore our C++ web scraping tutorial to learn more. But if you don't want to
change the programming language, the solution is to rely on ZenRows' JS
rendering capabilities.

ZenRows works with C and any other programming language, and can render
JavaScript. It also offers JavaScript actions to interact with pages as a human user
would do. You don't need to adopt a different language to deal with dynamic
content pages in C.

Conclusion
This step-by-step tutorial taught you how to build a C web scraping application.
You started from the basics and then dug into more complex topics. You have
become a web scraping C ninja!

Now, you know:

Why C is great for efficient scraping.

The basics of scraping with C.
How to do web crawling in C.
How to use C to deal with JavaScript-rendered sites.

However, no matter how sophisticated your script is, anti-scraping technologies

can still block it. Bypass them all with ZenRows, a scraping tool with the best built-
in anti-bot bypass features on the market. A single API call allows you to get your
desired data.

28/28

Web Scraping with PHP
No ratings yet
Web Scraping with PHP
14 pages
PDFScraper
No ratings yet
PDFScraper
2 pages
libcurl programming tutorial
No ratings yet
libcurl programming tutorial
28 pages
How To Create A Simple Web Crawler in PHP
No ratings yet
How To Create A Simple Web Crawler in PHP
3 pages
Top Solar Panel Manufacturers in India
No ratings yet
Top Solar Panel Manufacturers in India
10 pages
Data Engineering Concepts #2 - Sending Data Using An API - by Bar Dadon - Dev Genius
No ratings yet
Data Engineering Concepts #2 - Sending Data Using An API - by Bar Dadon - Dev Genius
14 pages
3252_ids_10
No ratings yet
3252_ids_10
5 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
S12 Web Scraping
No ratings yet
S12 Web Scraping
13 pages
Module5 Q&A
No ratings yet
Module5 Q&A
6 pages
A Practical Guide to Web Scraping ( PDFDrive )
No ratings yet
A Practical Guide to Web Scraping ( PDFDrive )
107 pages
Curl
No ratings yet
Curl
9 pages
ISO 03964-2016
No ratings yet
ISO 03964-2016
16 pages
Scrapping The Web
100% (1)
Scrapping The Web
13 pages
Downloader
No ratings yet
Downloader
1 page
Flipkart Web Scrapping Project
No ratings yet
Flipkart Web Scrapping Project
11 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Course Notes - Web Scraping and API Fundamentals in Python
No ratings yet
Course Notes - Web Scraping and API Fundamentals in Python
10 pages
Python Toolbox 100 Scripts for Developers Enhance Your Development Skills with Ready-to-Use Python Scripts (Sari, Serhan) (Z-Library)
No ratings yet
Python Toolbox 100 Scripts for Developers Enhance Your Development Skills with Ready-to-Use Python Scripts (Sari, Serhan) (Z-Library)
193 pages
FDSWeb Scraping
No ratings yet
FDSWeb Scraping
31 pages
Investigation Into The Potential Shortfall of Power Technicians Across The Generation Industry 3058322 1
No ratings yet
Investigation Into The Potential Shortfall of Power Technicians Across The Generation Industry 3058322 1
81 pages
ATS5630 Elastomeric Strip Seal Expansion Joints
No ratings yet
ATS5630 Elastomeric Strip Seal Expansion Joints
11 pages
Notes for Web Scraping - BeautifulSoup-3903
No ratings yet
Notes for Web Scraping - BeautifulSoup-3903
6 pages
stock position
No ratings yet
stock position
8 pages
eBook IT Handbook Kinaxis (1)
No ratings yet
eBook IT Handbook Kinaxis (1)
26 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
19-5E8 Tushara Priya
No ratings yet
19-5E8 Tushara Priya
23 pages
Web Scraping by Using R
No ratings yet
Web Scraping by Using R
3 pages
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
No ratings yet
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
193 pages
Introduction to Web Crawling chapter -13
No ratings yet
Introduction to Web Crawling chapter -13
3 pages
Complete Plan to Start Learning CGI Ads
No ratings yet
Complete Plan to Start Learning CGI Ads
8 pages
A Simple Python Web Crawler...
100% (1)
A Simple Python Web Crawler...
5 pages
SBA - Fault Injection Attack On Deep Neural Network
No ratings yet
SBA - Fault Injection Attack On Deep Neural Network
23 pages
4a82c633-5051-45ef-a932-6a6495641a0e_4F_IntroToWebScraping
No ratings yet
4a82c633-5051-45ef-a932-6a6495641a0e_4F_IntroToWebScraping
6 pages
Web Scraping and Data Collection CheatSheet 1731972399
No ratings yet
Web Scraping and Data Collection CheatSheet 1731972399
10 pages
The Eve of War
No ratings yet
The Eve of War
90 pages
1747399713103-1747037056197-webscraping
No ratings yet
1747399713103-1747037056197-webscraping
12 pages
PDF Document 2
No ratings yet
PDF Document 2
24 pages
Sudhakar CV
No ratings yet
Sudhakar CV
3 pages
Vennila Resume (1)
No ratings yet
Vennila Resume (1)
1 page
Project Research Paper BCA SEM VI Group-1
No ratings yet
Project Research Paper BCA SEM VI Group-1
8 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
Differential Equations - MTH401 Handouts Lecture 17
No ratings yet
Differential Equations - MTH401 Handouts Lecture 17
15 pages
Web Scraping - Unit 1
100% (1)
Web Scraping - Unit 1
31 pages
Nozzle
No ratings yet
Nozzle
5 pages
Web Crawling - python
No ratings yet
Web Crawling - python
34 pages
DAP_4_module
No ratings yet
DAP_4_module
45 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
0% (1)
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Algorithms
No ratings yet
Algorithms
13 pages
We Carry The Saving Cross
No ratings yet
We Carry The Saving Cross
12 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
WebScraping Lessons 1
100% (1)
WebScraping Lessons 1
3 pages
MandalDomesticProduct 15-16
No ratings yet
MandalDomesticProduct 15-16
243 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
Web Scraping Handbook
No ratings yet
Web Scraping Handbook
115 pages
Solve 3 Simultaneous Equation Using Inverse Method
No ratings yet
Solve 3 Simultaneous Equation Using Inverse Method
23 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
2.4b Bernoullis Eq
No ratings yet
2.4b Bernoullis Eq
2 pages
Architectural-Technician_Centennial-College
No ratings yet
Architectural-Technician_Centennial-College
9 pages
Synopsis WS
No ratings yet
Synopsis WS
11 pages
Seismic Manual Opa 2123 10 - 2 132
No ratings yet
Seismic Manual Opa 2123 10 - 2 132
66 pages
GLL63 Fozmula Liquid Level Contents Gauge Data 8 6 5R4
No ratings yet
GLL63 Fozmula Liquid Level Contents Gauge Data 8 6 5R4
1 page
Architectural Thesis Presentation Ideas
100% (2)
Architectural Thesis Presentation Ideas
8 pages
Dijkstra's and A-Star in Finding The Shortest Path: A Tutorial
No ratings yet
Dijkstra's and A-Star in Finding The Shortest Path: A Tutorial
5 pages
FOODPANDA Project
No ratings yet
FOODPANDA Project
15 pages
Bizhub c200 Instalacion
No ratings yet
Bizhub c200 Instalacion
14 pages
Project Management - Dcc5183: Green Building
No ratings yet
Project Management - Dcc5183: Green Building
5 pages
Software Design Simplified
From Everand
Software Design Simplified
Liviu Catalin Dorobantu
No ratings yet
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
Learning DHTMLX Suite UI
From Everand
Learning DHTMLX Suite UI
Eli Geske
No ratings yet
Math SBA
100% (1)
Math SBA
12 pages
Update to Modern C++
From Everand
Update to Modern C++
James Raynard
No ratings yet
Introduction to PHP, Part 4, Second Edition
From Everand
Introduction to PHP, Part 4, Second Edition
Adam Majczak
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
From Everand
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
Kanto
No ratings yet
MCTS 70-515 Exam: Web Applications Development with Microsoft .NET Framework 4 (Exam Prep)
From Everand
MCTS 70-515 Exam: Web Applications Development with Microsoft .NET Framework 4 (Exam Prep)
Eddie Vi
4/5 (1)
JavaScript Fundamentals: JavaScript Syntax, What JavaScript is Use for in Website Development, JavaScript Variable, Strings, Popup Boxes, JavaScript Objects, Function, and Event Handlers
From Everand
JavaScript Fundamentals: JavaScript Syntax, What JavaScript is Use for in Website Development, JavaScript Variable, Strings, Popup Boxes, JavaScript Objects, Function, and Event Handlers
Steven Bright
No ratings yet
2010 Chevrolet Captiva Sport X2 (LE5 o LE9)
100% (1)
2010 Chevrolet Captiva Sport X2 (LE5 o LE9)
6 pages
NoSQL Injection for Elasticsearch
From Everand
NoSQL Injection for Elasticsearch
Gary Drocella
No ratings yet
Html5: QuickStudy Laminated Reference Guide
From Everand
Html5: QuickStudy Laminated Reference Guide
Robin Nixon
5/5 (1)
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
The Little Book of Sitecore® Tips: Volume 1
From Everand
The Little Book of Sitecore® Tips: Volume 1
Neil P Shack
No ratings yet
Build your own Blockchain: Make your own blockchain and trading bot on your pc
From Everand
Build your own Blockchain: Make your own blockchain and trading bot on your pc
Magelan Cybersecurity
No ratings yet
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
From Everand
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
Georgio Daccache
No ratings yet