Introduction
============

Crawler Rate Limit allows you to limit requests performed by web crawlers, bots,
and spiders. It can also rate limit regular traffic, and block requests based on
autonomous system number (ASN).

**Features**
- rate limits web crawlers, bots and spiders
- rate limits regular traffic (human visitors and bots not openly
  identifying as bots) at the visitor level (IP address + User-Agent string)
  and / or [autonomous system](https://en.wikipedia.org/wiki/Autonomous_system_(Internet))
  level
- blocks traffic at the ASN-level
- number of allowed requests in a given time interval (limit) is configurable
- limits bot requests based on the User-Agent string - if the same crawler uses
  multiple IP addresses, all those requests will count towards the single limit
- each type of rate limiting can be configured independently from the other
  types
- minimal effect on performance
  - rate limiting and blocking is performed in a very early phase of request
    handling
  - uses APCu, Redis or Memcached as rate limiter backend

There are a number of crawlers, bots, and spiders that are known for excessive
crawling of websites. Such requests increase the server load and often affect
performance of the website for regular visitors. In extreme cases they can even
bring the site down.

This module detects if the request is made by a crawler/bot/spider by
inspecting the User-Agent HTTP header and then limits the number of requests
the crawler is allowed to perform in a given time interval. Once the limit is
reached, the server will respond with HTTP code 429 (Too many requests). The
crawler is unblocked at the end of the time interval and a new cycle beings.
If the crawler exceeds the rate limit again, it will again be blocked for the
duration of the same time interval.

Other types of rate limiting operate on the same principle.

Limitations
===========

This module can't protect against DDOS attacks. Blocking and rate limiting
will be effective only if your web server can actually handle volume of the
traffic it receives. Once the server gets overloaded with the requests it will
start failing (dropping requests) and Drupal won't ever get a chance to process
requests and perform rate limiting or blocking.

However, rate limiting and blocking may help your server to handle much larger
number of requests by significantly reducing the time required to process single
request. Request that is either blocked (403) or rate limited (429) by Crawler
Rate Limit module can be processed up to 10 times faster than regular Drupal
request that returns HTTP code 200.

Requirements
============

Dependencies installed along with the module:
- jaybizzle/crawler-detect - https://github.com/jaybizzle/crawler-detect
- nikolaposa/rate-limit - https://github.com/nikolaposa/rate-limit

Each backend has additional dependencies which must be installed in addition to
the module in order to use that particular backend.

**Redis backend**
- Either Redis PECL extension (https://pecl.php.net/package/redis) or Predis PHP
  package (https://github.com/predis/predis)
- Redis Drupal module - https://www.drupal.org/project/redis

**APCu backend**
- APCu PECL extension - https://pecl.php.net/package/APCu

**Memcached backend**
- Memcached PECL extension - https://pecl.php.net/package/memcached
- Memcache Drupal module - https://www.drupal.org/project/memcache

**Important:** Crawler Rate Limit does not support Memcache PECL extension. Only
Memcached is supported.

**Autonomous system (ASN)-level limiting (optional)**
- geoip2/geoip2 composer package - https://packagist.org/packages/geoip2/geoip2
- GeoLite2/GeoIP2 binary ASN Database - https://dev.maxmind.com/geoip/docs/databases/asn#binary-database
- Cron task or GeoIP Update to keep the ASN database up-to-date - https://dev.maxmind.com/geoip/updating-databases

Installation and configuration
==============================

First install dependencies for your backend. Follow the steps covering your
backend of choice.

APCu backend
------------

1. Install APCu PECL extension by following instructions provided for your
   operating system.

Redis backend
-------------

1. Install Redis PECL extension by following instructions provided for your
   operating system or install Predis PHP package via composer. It's sufficient
   to install one or the other.

2. Install Drupal Redis module

3. Configure Redis module. Crawler Rate Limit requires only minimum Redis
   configuration (host and port). It's not necessary to configure cache, locking
   or any other Redis backend.

    ```php
    // Example Redis configuration.
    $settings['redis.connection']['host'] = '127.0.0.1';
    $settings['redis.connection']['port'] = '6379';
    ```

Memcache backend
----------------

1. Install Memcached PECL extension by following instructions provided for your
   operating system.

2. Install Drupal Memcache module

3. Configure Memcache module. Crawler Rate Limit requires only minimum Memcache
   configuration (servers and bins). It's not necessary to configure cache,
   locking or any other Memcache backend.

    ```php
    // Example Memcache configuration.
    $settings['memcache']['servers'] = ['127.0.0.1:11211' => 'default'];
    $settings['memcache']['bins'] = ['default' => 'default'];
    $settings['memcache']['key_prefix'] = '';
    ```

Autonomous system (ASN)-level limiting
--------------------------------------

_Proceed with this section only if you intend to enforce ASN-based regular
traffic limits or ASN based request blocking._

1. Install `geoip2/geoip2`

   ```sh
   composer require geoip2/geoip2
   ```

2. Place a copy of the GeoLite2/GeoIP2 binary ASN Database into a location
   on the server that is accessible by Drupal
   1. Register for a free or paid MaxMind.com account
   2. Download the GeoLite2 (free) or GeoIP2 (paid) binary ASN Database
   3. Upload to server.

3. Schedule a cron task or GeoIP Update to keep the ASN database up-to-date
   (IP address <-> ASN data changes over time)


Once the backend (and ASN, if using) dependencies are in place, you can install
Crawler Rate Limit.

Install Crawler Rate Limit module
---------------------------------

1. Crawler Rate Limit manages its dependencies via composer. Just copying the
   module into the modules folder won't work.

    ```sh
    cd /root/of/your/drupal/project
    composer require drupal/crawler_rate_limit
    drush en crawler_rate_limit
    ```

2. Configure Crawler Rate Limit. Add the following snippet to your
   `settings.php` file. Make sure to adjust all the values to match your
   website's traffic. Feel free to omit/delete all optional sections that you
   don't intend to use.

    ```php
    /**
     * Below configuration uses a redis backend and will limit each
     * crawler / bot (identified by User-Agent string) to a maximum of 100
     * requests every 600 seconds.
     *
     * Regular traffic (human visitors and bots not openly identifying as bots)
     * will be limited to a maximum of 300 requests per visitor
     * (identified by IP address + User-Agent string) every 600 seconds.
     *
     * Regular traffic will additionally be limited at the ASN-level to a
     * maximum of 600 requests per ASN every 600 seconds.
     *
     * @see https://en.wikipedia.org/wiki/Autonomous_system_(Internet)
     */

    /**
     * Enable or disable rate limiting. Required.
     *
     * If set to FALSE, all the module's functionality will be entirely disabled
     * regardless of all the other settings below.
     */
    $settings['crawler_rate_limit.settings']['enabled'] = TRUE;

    /**
     * Define which backend to use. Required.
     *
     * Supported and properly configured backend is necessary for normal
     * operation of the module. If backend is not set, all the module's
     * functionality will be disabled.
     *
     * Supported backends: redis, memcached, apcu.
     */
    $settings['crawler_rate_limit.settings']['backend'] = 'redis';


    /**
     * Limit for crawler / bot traffic (visitors that openly identify as
     * crawlers / bots). Optional. Omit to disable.
     *
     * Note: If this section is omitted (undefined), bot traffic will be treated
     * in the same way as regular traffic.
     */
    $settings['crawler_rate_limit.settings']['bot_traffic'] = [
      // Time interval in seconds. Must be whole number greater than zero.
      'interval' => 600,
      // Number of requests allowed in the given time interval per crawler or
      // bot (identified by User-Agent string). Must be a whole number greater
      // than zero.
      'requests' => 100,
    ];

    /**
     * Limits for regular website traffic (visitors that don't openly identify
     * as crawlers / bots). Optional. Omit to disable.
     *
     * Visitor-level (IP address + User-Agent string) regular traffic rate
     * limit.
     */
    $settings['crawler_rate_limit.settings']['regular_traffic'] = [
      // Time interval in seconds. Must be whole number greater than zero.
      'interval' => 600,
      // Number of requests allowed in the given time interval per regular
      // visitor (identified by combination of IP address + User-Agent string).
      'requests' => 300,
    ];

    /**
     * Autonomous system-level (ASN) regular traffic rate limit. Optional. Omit
     * to disable.
     *
     * Useful if the following two conditions are met:
     *   1. Unwanted traffic is coming from a large number of different IP
     *      addresses and rate limiting based on the IP address and User Agent
     *      is not effective.
     *   2. Small number of distinct ASNs are identified as origin of this
     *      traffic (by obtaining ASN numbers for each of the unwanted IP
     *      addresses).
     *
     * Requires geoip2/geoip2 package and associated ASN Database.
     *
     * @see https://github.com/maxmind/GeoIP2-php
     * @see https://dev.maxmind.com/geoip/docs/databases/asn#binary-database
     */
    $settings['crawler_rate_limit.settings']['regular_traffic_asn'] = [
      // Time interval in seconds. Must be whole number greater than zero.
      'interval' => 600,
      // Number of requests allowed in the given time interval per autonomous
      // system number (ASN).
      'requests' => 600,
      // Path to the local ASN Database file. Must be an up-to-date,
      // GeoLite2/GeoIP2 binary ASN Database. Consider updating automatically
      // via GeoIP Update or cron.
      // @see https://dev.maxmind.com/geoip/updating-databases
      // Note that the database path is also required by ASN blocking feature.
      'database' => '/var/www/example.com/private/geoip2/GeoLite2-ASN.mmdb',
    ];

    /**
     * Allow specified IP addresses to bypass rate limiting. Optional.
     *
     * Useful if your website is maintained by a number of users all accessing
     * the site from the same location, using the same browsers, and at the same
     * time.
     *
     * Allowlist can contain:
     *   - IPv4 addresses or subnets in CIDR notation
     *   - IPv6 addresses or subnets in CIDR notation
     *
     * Default value: empty array.
     *
     * Sample configuration to allow all the traffic on the local network.
     *
     * @code
     * $settings['crawler_rate_limit.settings']['ip_address_allowlist'] = [
     *   '127.0.0.1',
     *   '10.0.0.0/8',
     *   '192.168.1.0/24',
     * ];
     * @endcode
     */
    $settings['crawler_rate_limit.settings']['ip_address_allowlist'] = [];

    /**
     * List of ASNs that should be blocked. Optional.
     *
     * All requests coming from IP addresses belonging to the ASNs on this list
     * will be blocked. Server will respond with HTTP code 403. Useful as a
     * drastic, and ideally temporary measure if ASN can be identified as origin
     * of exclusively unwanted traffic.
     *
     * Requires geoip2/geoip2 package and associated ASN Database. Make sure
     * that ASN database path is configured correctly if you want to use ASN
     * blocking.
     *
     * Note that blocking takes precedence over rate limiting and allowlist. If
     * request comes from the IP address belonging to the ASN found on the
     * blocklist, server will immediately return 403 response. Rate limiting
     * settings and allowlist will not be considered.
     *
     * Caution: Autonomous Systems are large networks. Carefully analyze and
     * understand your website traffic in order to make sure that blocking an
     * ASN won't block genuine visitor traffic.
     *
     * Sample list blocking 3 ASNs taken from the Spamhaus DROP list.
     * @see: https://www.spamhaus.org/blocklists/do-not-route-or-peer/
     *
     * @code
     * $settings['crawler_rate_limit.settings']['asn_blocklist'] = [
     *   24567,
     *   202469,
     *   401616,
     * ];
     * @endcode
     */
    $settings['crawler_rate_limit.settings']['asn_blocklist'] = [];
    ```

    - Make sure to adjust the value for `backend` to match the one you decided
      to use.
    - Make sure to adjust the values for `interval` and `requests` according
      to the capabilities of your server and level of the traffic your website
      receives. You may need to review your server logs in order to determine
      optimal values for your server.
    - If planning to use ASN-based rate limiting or blocking, make sure to
      adjust the path to ASN database to match the one on your server.

3. Visit Status report page (/admin/reports/status) and confirm that Crawler
   Rate Limit is enabled and it doesn't report any errors.

Test your installation
======================

To validate rate limiting is working, even on setups with edge implementations,
we can use curl to quickly hit the site multiple times. Keep in mind each
request speed will be subject to your connectivity to site, but in most cases,
you can expect about 1 request a second.

To test bot rate limiting, run the following:

```bash
for i in $(seq 1 101); do curl -A "Bytespider" -skLI "https://example.com/?i=$i" | head -1; done
```

Change 101 to one more than your rate limit, and https://example.com/ to the url
of your website where Crawler Rate Limit is installed. If your first requests
are not `HTTP/2 200`, ensure you don't have "Bytespider" blocked by other means,
or try a different bot User-Agent. Your last response should be `HTTP/2 429`.

If you want to validate that the functionality is working across ip addresses,
rerun the code above from another machine with a different IP address. You
should receive `HTTP/2 429` from the other machine as long as you do so within
the configured time limit, and ensure you are using the same User-Agent.

To validate rate limiting is working for visitor-level regular traffic, do the
same as above, but change your User-Agent from `"Bytespider"` to
`"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"`.
The difference here is if you try this same code from another machine with a
different IP address, each machine will be rate-limited independently.

To validate rate limiting is working for ASN-level regular traffic (across
multiple IPs and user agent strings), run the code from two or more machines
under a single ASN, each with a different IP address and non-bot user agent
string like `"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"`.
Once the ASN-based regular traffic limit is reached, all machines under that ASN
should begin receiving `HTTP/2 429` responses until the configured time limit
expires.

Logging of rate limited requests
================================

Crawler Rate Limit does not implement database logging in the interest of
performance. Instead, rate limited and blocked requests can be reviewed in your
web server log (usually `access.log`).

Considering that HTTP code 429 is most likely not used by any other services on
your server, rate limited requests can easily be identified by searching for
response code 429.

Blocked requests return HTTP code 403 which is also used by Drupal. In order to
distinguish requests blocked by Crawler Rate Limit you will also need to look at
the response size. CRL response contains only a single word "Blocked." and
response size will be very small (likely 28 bytes) while the size of Drupal's
regular 403 page will be at least several kilobytes.

Here are some bash one-liners you can use as a starting point for making your
own. Make sure to adjust the details to match the format of your `access.log`
file (date format, order of fields).

```bash
# Total number of rate-limited requests on 15/Jun/2025.
grep --text "15/Jun/2025" /var/log/access.log | awk '$9 == 429' | wc -l
grep --text "15/Jun/2025" /var/log/access.log | grep -c "\" 429 "

# Rate-limited requests by IP address on 15/Jun/2025.
grep --text "15/Jun/2025" /var/log/access.log | awk '$9 == 429 {print $1}' | sort | uniq -c | sort -rn
grep --text "15/Jun/2025" /var/log/access.log | grep "\" 429 " | cut -d " " -f 1 | sort | uniq -c | sort -rn

# List of rate-limited IP addresses that can be pasted into your ASN lookup tool
# of choice. E.g. https://hackertarget.com/as-ip-lookup/
grep --text "15/Jun/2025" /var/log/access.log | awk '$9 == 429 {print $1}' | sort | uniq

# Rate-limited requests by User-Agent string on 15/Jun/2025.
grep --text "15/Jun/2025" /var/log/access.log | awk '$9 == 429' | cut -d '"' -f 6 | sort | uniq -c | sort -rn
grep --text "15/Jun/2025" /var/log/access.log | grep "\" 429 " | cut -d '"' -f 6 | sort | uniq -c | sort -rn

# Number of requests per HTTP response code.
# Columns: "Number of requests", "HTTP response code".
grep "15/Jun/2025" /var/log/access.log | awk '{print $9}' | sort | uniq -c | sort -rn

# Number of requests per hour on a given date.
for i in $(seq -f "%02g" 0 23); do grep "15/Jun/2025:$i:" /var/log/access.log | wc -l; done

# Number of rate-limited requests per hour on a given date.
# Columns: "Hour in a day (24-hour format)", "Number of rate-limited requests".
for i in $(seq -f "%02g" 0 23); do grep "15/Jun/2025:$i:" /var/log/access.log | awk '$9 == 429' | printf "%02d %5d\n" ${i#0} $(wc -l); done
for i in $(seq -f "%02g" 0 23); do printf "%02d" ${i#0}; grep "15/Jun/2025:$i:" /var/log/access.log | awk '{if ($9 == 429) {limited++} } END { printf("%5d\n", limited) } '; done

# Number of requests per minute within given hour.
# This example uses 17th hour (5 pm) on 15/Jun/2025. Period from 17:00 to 17:59.
# Columns: "Minute in an hour", "Number of requests".
for i in $(seq -f "%02g" 0 59); do count=$(grep "15/Jun/2025:17:$i:" /var/log/access.log | wc -l); printf "%02d % 5d\n" ${i#0} $count; done

# Print total, served and rate-limited number of requests per hour on a given
# date (15/Jun/2025).
# Columns:
#   - Hour in a day (in 24-hour format)
#   - Total number of requests
#   - Number of allowed/served requests
#   - Number of rate-limited requests.
for i in $(seq -f "%02g" 0 23); do printf "%02d" ${i#0}; grep "15/Jun/2025:$i:" /var/log/access.log | awk '{if ($9 == 429) {limited++} else {served++}} END { printf("%5d %5d %5d\n", NR, served, limited) }'; done

# Print total, served and rate-limited number of requests for each minute in a
# given hour.
# This example uses 17th hour (5 pm) on 15/Jun/2025. Period from 17:00 to 17:59.
# Columns:
#   - Hour in a day (in 24-hour format)
#   - Total number of requests
#   - Number of allowed/served requests
#   - Number of rate-limited requests.
for i in $(seq -f "%02g" 0 59); do printf "%02d" ${i#0}; grep "15/Jun/2025:17:$i:" /var/log/access.log | awk '{if ($9 == 429) {limited++} else {served++}} END { printf("%5d %5d %5d\n", NR, served, limited) }'; done
```

How to update to version 3
==========================

Updating to version 3 from previous versions of the module requires some changes
to be made to module's settings.

1. Define `['backend']` key and set its value to "redis".
2. Move `['interval']` to `['bot_traffic]['interval']`.
3. Move/rename `['operations']` to `['bot_traffic]['requests']`.

settings.php for version 1 or 2:
```php
$settings['crawler_rate_limit.settings']['enabled'] = TRUE;
$settings['crawler_rate_limit.settings']['operations'] = 100;
$settings['crawler_rate_limit.settings']['interval'] = 600;
```

Equivalent settings.php for version 3:
```php
$settings['crawler_rate_limit.settings']['enabled'] = TRUE;
$settings['crawler_rate_limit.settings']['backend'] = 'redis';
$settings['crawler_rate_limit.settings']['bot_traffic'] = [
  'requests' => 100,
  'interval' => 600,
];
```

Maintainers
===========

- Vojislav Jovanović (vaish) - https://www.drupal.org/u/vaish


Supporting organizations:
- Greentech Renewables - https://www.drupal.org/greentech-renewables
