[go: up one dir, main page]

underlap

Blocking AI web crawlers

(It’s not déjà vu: this post is an amalgamation of two earlier posts on this subject.)

AI companies crawl websites to train their large language models (LLMs). Aside from using copyrighted material that is not their’s to copy, there is the huge environmental cost¹ in the training and use of LLMs.

So a growing number of owners of personal websites, including me, do not want to be party to this activity. We want to stop AI web crawlers from accessing out websites. But how can we do that?

The obvious first answer is to install a robots.txt file on the website. This lists the “user agents” and the URL paths that each agent is or is not allowed to access. Each web crawler has a user agent. For example, GPT uses GPTBot.

The rest of this post questions whether robots.txt is sufficient and suggests option for actually blocking AI web crawlers.

Is robots.txt sufficient?

Background

Three weeks ago I added a robots.txt to this site, intending to dissuade AI web crawlers from scanning my posts. I am grateful to Cory Dransfeldt for his post Go ahead and block AI web crawlers and for his subsequent creation of the ai.robots.txt git repository.

There was a reasonable response on Hacker News. Many wondered if AI web crawlers would respect robots.txt.

Will AI web crawlers respect robots.txt?

Perhaps not. But larger and/or reputable companies developing AI models probably wouldn’t want to damage their reputation by ignoring robots.txt. Given the contentious nature of AI and the possibility of legislation limiting its development, companies developing AI models will probably want to be seen to be behaving ethically.

robots.txt and its limitations

According to Google’s Introduction to robots.txt, the purpose of the file is to prevent a web site being overwhelmed by (search engine) crawler traffic and to prevent unimportant pages from being indexed. It won’t guarantee that pages do not appear in a search index since a page may be indexed when it is referenced from another site. (Thus it seems Google doesn’t check robots.txt before indexing such a page.)

What’s the difference between search engine crawlers and AI crawlers?

AI crawlers won’t necessarily follow links, whereas search engine crawlers typically will. The purpose of AI crawlers is to gather suitable training data for AI models, so they are likely to be more selective about the kinds of sites they crawl.

What about ai.txt?

The Spawning company is proposing an ai.txt file as a way of communicating the media types that a site does and does not offer to AI crawlers. At the time of writing, this file is Spawning-specific and would benefit from standardisation and adoption by the larger AI companies. (Another option would be to add support for AI crawling restrictions to the robots.txt format.)

So is it worth trying to block AI web crawlers?

I think so. If we don’t make any attempt to block them, we’re effectively inviting them in.

What to put in robots.txt?

There is a growing number of sources of suitable robots.txt files, including ai.robots.txt and Dark Visitors. All you need to do is arrange for your website to serve your chosen robots.txt file from the path /robots.txt.

Job done?

Well not quite. Unfortunately, respecting robots.txt requires effort on the part of the developers of web crawlers and not all of them may have implemented this support. (If you have evidence of AI web crawlers ignoring robots.txt, please get in touch! This is the closest I’ve found to a smoking gun.)

So if we want more certainty that AI web crawlers are not scraping our websites, we need to do better. Websites need to block AI web crawlers.

Implementing blocking

I’ve seen a number of approaches such as Apache or NGINX configurations which reject requests from certain user agents. Someone has used Cloudflare to block AI crawlers. There’s even an NGINX module that rejects web crawlers which don’t implement certain browser features properly.

Since my site uses NGINX, I wrote the nginx_robot_access module to enforce robots.txt. If you’d like to use it, see the instructions in the README. If you want to read about the development of the module, see #BlockingAI.

If you’d like to use the module, but are not currently using NGINX, you may be able to add NGINX as a reverse proxy (“in front of”) your existing web server – see the links in Abstracting cloud storage.

NGINX reverse proxy for web server

Seeing nginx_robot_access in action

If you want to see the module in action, issue the following:

curl -A "CCBot" https://underlap.org/banana.html

but please make sure to include banana.html so I don’t mistake this for a smoking gun if I check the site’s access log!

It should produce the following output:

<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx/1.22.1</center>
</body>
</html>

Further reading

Comments

if crawlers/bots will not respect robots.txt, will they provide an honest user agent?

I think they will. Not to do so would be downright malicious or dishonest and that would detract from their reputation (see my earlier comments on AI web crawlers respecting robots.txt).

Footnote:

¹ The carbon cost of training GPT-4 has been estimated at 6912 metric tonnes of CO₂ equivalent emissions. Even if this is an over-estimate, it shows the scale of the problem.