DEV Community

Lulu
Lulu

Posted on

Solving Bandwidth Issues Caused by Web Crawlers with SafeLine WAF

1. Background

Automated bots and malicious web crawlers can consume a significant amount of network bandwidth by repeatedly accessing your site over extended periods. When you check your cloud server's management dashboard, you might notice that most of the traffic is concentrated on a few IP addresses. A straightforward solution to this problem is to limit the frequency of requests from these IP addresses.

However, rate-limiting IP addresses is typically not related to the business logic of your application, and developers are often reluctant to maintain an IP request frequency table themselves. Additionally, manually managing visitor information in a distributed or concurrent environment can be quite costly in terms of development effort.

This is where SafeLine WAF by Chaitin comes in. SafeLine offers a suite of features including rate limiting, port forwarding, and manual IP blacklisting/whitelisting, alongside its core functionality of defending against web attacks.

2. Installing SafeLine

bash -c "$(curl -fsSLk https://waf.chaitin.com/release/latest/setup.sh)"
Enter fullscreen mode Exit fullscreen mode

For detailed instructions, refer to: https://docs.waf.chaitin.com/en/tutorials/install

3. Logging into SafeLine

Open the web console page https://<safeline-ip>:9443/ in the browser, then you will see below.

Image description

Get Administrator Account

docker exec safeline-mgt resetadmin
Enter fullscreen mode Exit fullscreen mode

After the command is successfully executed, you will see the following content

[SafeLine] Initial username:admin
[SafeLine] Initial password:**********
[SafeLine] Done
Enter fullscreen mode Exit fullscreen mode

Enter the password in the previous step and you will successfully logged into SafeLine.

Image description

4. Configuring Your Site and Rate Limiting

4.1 SafeLine Site Configuration

SafeLine provides comprehensive site configuration options, including automatic TLS certificate and private key uploads, and the ability to specify multiple forwarding ports. This eliminates the need for developers to manually configure Nginx.

Image description

4.2 Configuring Rate Limiting

You can customize the blocking strategy according to your needs. A common recommendation is to set a limit of 100 requests per 10 seconds, with a block duration of 10 minutes.

Image description

Note: If you're testing or encounter false positives, you can manually lift the block.

5. Testing and Additional Considerations

5.1 Testing

For testing, we set up a simple server that offers an endpoint with a "hello" response and an "a" parameter. Here’s a basic Python script for testing with a web crawler:

def send_request(url, request_method="GET", header=None, data=None):  
    try:  
        if header is None:  
            header = {"User-Agent": "Mozilla/5.0"}  
        response = requests.request(request_method, url, headers=header)  
        return response  
    except Exception as err:  
        print(err)  
        pass  
    return None

if __name__ == '__main__':  
    for i in range(100):  
        str = random.choice('abcdefghijklmnopqrstuvwxyz')  
        resp = send_request("http://a.com/hello?a=" + str)  
        print(resp.content)
Enter fullscreen mode Exit fullscreen mode

Output Example:

b'{"a":"u"}'
b'{"a":"m"}'
b'{"a":"y"}'
b'{"a":"o"}'
b'<!DOCTYPE html>\n\n<html lang="zh">\n  <head>\n .... # followed by a long HTML text
Enter fullscreen mode Exit fullscreen mode

At this point, if you try to access the page again, you'll find that it has been blocked.

Image description

5.2 What if the Crawler Fakes the X-Forwarded-For Header?

Some crawlers are sneaky and might fake the X-Forwarded-For header. To counter this, SafeLine allows you to configure the source IP retrieval method. Simply go to 'Proxy Setting' -> 'Get Attack IP From' -> and select 'Socket Connection'. This ensures that the IP is obtained directly from the TCP connection.

Image description

What if the crawler fakes the TCP Source IP field too?

If a crawler fakes the TCP header information, the HTTP handshake, which is based on TCP, will fail. As a result, the crawler will lose its ability to scrape data, and the request will be dropped by Nginx.

SafeLine Discord GitHub

Top comments (0)