NGINX robot access module

This NGINX module enforces the rules in robots.txt for web crawlers that choose to disregard those rules.

Regardless of the rules in robots.txt, the module always allows the path /robots.txt to be accessed. This gives web crawlers the option of obeying robots.txt. If any other files should always be accessible, these should be made available via robots.txt.

See the following instructions for how to build and configure the module.

Building

This module is written in Rust. After installing Rust, the module may be built using cargo, but must be built for the version of NGINX that is in use.

For example, to build the module for NGINX version 1.22.1, issue the following command in the root directory of a clone of this repository:

NGX_VERSION=1.22.1 cargo build --release

With gcc v15 you may also need to set the environment variable CFLAGS="-Wno-error=unterminated-string-initialization" to avoid a warning failing the build.

This will build a shared library in target/release.

Configuring

To enable this module, it must be loaded in the NGINX configuration, e.g.:

load_module /var/lib/libnginx_robot_access.so;

For this module to work correctly, the absolute file path of robots.txt must be configured in the NGINX configuration using the directive robots_txt_path. The directive takes a single argument: the absolute file path of robots.txt, e.g.:

robots_txt_path /etc/robots.txt;

The directive may be specified in any of the http, server, or location configuration blocks. Configuring the directive in the location block overrides any configuration of the directive in the server block. Configuring the directive in the server block overrides any configuration in the http block.

For example, here's a simple configuration that enables the module and sets the path to /etc/robots.txt:

load_module /var/lib/libnginx_robot_access.so;
...
http {
    ...
    server {
        ...
        location / {
            ...
            robots_txt_path /etc/robots.txt;
        }
...

Validating

To make sure the module is working correctly, use curl to access your site and specify a user agent that your robots.txt file denies access for, e.g.:

curl -A "GPTBot" https://example.org

Debugging

Some debug logging is included in the module. To use this, enable debug logging in the NGINX configuration, e.g.:

error_log  logs/error.log debug;

Contributing

See the Contributor Guide if you'd like to submit changes.

Acknowledgements

ngx-rust: a Rust binding for NGINX.
robotstxt: a Rust port of Google's C++ implementation. Thanks @Folyd!

Alternatives

Configure NGINX to block specific user agents, although this doesn't share the configuration in robots.txt.
NGINX configuration for AI web crawlers, but again this doesn't share the configuration in robots.txt.
Roboo is an NGINX module which protects against robots that fail to implement certain browser features.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NGINX robot access module

Building

Configuring

Validating

Debugging

Contributing

Acknowledgements

Alternatives

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

glyn/nginx_robot_access

Folders and files

Latest commit

History

Repository files navigation

NGINX robot access module

Building

Configuring

Validating

Debugging

Contributing

Acknowledgements

Alternatives

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages