Lulu

Posted on Aug 16

Solving Bandwidth Issues Caused by Web Crawlers with SafeLine WAF

#devops #opensource #cybersecurity #webdev

1. Background

Automated bots and malicious web crawlers can consume a significant amount of network bandwidth by repeatedly accessing your site over extended periods. When you check your cloud server's management dashboard, you might notice that most of the traffic is concentrated on a few IP addresses. A straightforward solution to this problem is to limit the frequency of requests from these IP addresses.

However, rate-limiting IP addresses is typically not related to the business logic of your application, and developers are often reluctant to maintain an IP request frequency table themselves. Additionally, manually managing visitor information in a distributed or concurrent environment can be quite costly in terms of development effort.

This is where SafeLine WAF by Chaitin comes in. SafeLine offers a suite of features including rate limiting, port forwarding, and manual IP blacklisting/whitelisting, alongside its core functionality of defending against web attacks.

2. Installing SafeLine

bash -c "$(curl -fsSLk https://waf.chaitin.com/release/latest/setup.sh)"

For detailed instructions, refer to: https://docs.waf.chaitin.com/en/tutorials/install

3. Logging into SafeLine

Open the web console page https://<safeline-ip>:9443/ in the browser, then you will see below.

Get Administrator Account

docker exec safeline-mgt resetadmin

After the command is successfully executed, you will see the following content

[SafeLine] Initial username：admin
[SafeLine] Initial password：**********
[SafeLine] Done

Enter the password in the previous step and you will successfully logged into SafeLine.

4. Configuring Your Site and Rate Limiting

4.1 SafeLine Site Configuration

SafeLine provides comprehensive site configuration options, including automatic TLS certificate and private key uploads, and the ability to specify multiple forwarding ports. This eliminates the need for developers to manually configure Nginx.

4.2 Configuring Rate Limiting

You can customize the blocking strategy according to your needs. A common recommendation is to set a limit of 100 requests per 10 seconds, with a block duration of 10 minutes.

Note: If you're testing or encounter false positives, you can manually lift the block.

5. Testing and Additional Considerations

5.1 Testing

For testing, we set up a simple server that offers an endpoint with a "hello" response and an "a" parameter. Here’s a basic Python script for testing with a web crawler:

def send_request(url, request_method="GET", header=None, data=None):  
    try:  
        if header is None:  
            header = {"User-Agent": "Mozilla/5.0"}  
        response = requests.request(request_method, url, headers=header)  
        return response  
    except Exception as err:  
        print(err)  
        pass  
    return None

if __name__ == '__main__':  
    for i in range(100):  
        str = random.choice('abcdefghijklmnopqrstuvwxyz')  
        resp = send_request("http://a.com/hello?a=" + str)  
        print(resp.content)

Output Example:

b'{"a":"u"}'
b'{"a":"m"}'
b'{"a":"y"}'
b'{"a":"o"}'
b'<!DOCTYPE html>\n\n<html lang="zh">\n  <head>\n .... # followed by a long HTML text

At this point, if you try to access the page again, you'll find that it has been blocked.

5.2 What if the Crawler Fakes the X-Forwarded-For Header?

Some crawlers are sneaky and might fake the X-Forwarded-For header. To counter this, SafeLine allows you to configure the source IP retrieval method. Simply go to 'Proxy Setting' -> 'Get Attack IP From' -> and select 'Socket Connection'. This ensures that the IP is obtained directly from the TCP connection.

What if the crawler fakes the TCP Source IP field too?

If a crawler fakes the TCP header information, the HTTP handshake, which is based on TCP, will fail. As a result, the crawler will lose its ability to scrape data, and the request will be dropped by Nginx.

SafeLine Discord GitHub

DEV Community

Solving Bandwidth Issues Caused by Web Crawlers with SafeLine WAF

1. Background

2. Installing SafeLine

3. Logging into SafeLine

4. Configuring Your Site and Rate Limiting

4.1 SafeLine Site Configuration

4.2 Configuring Rate Limiting

5. Testing and Additional Considerations

5.1 Testing

5.2 What if the Crawler Fakes the X-Forwarded-For Header?

Top comments (0)

Read next

Simplify Environment Variable Management with GitHub Environments

Deploying Next.js + Pocketbase to a single Fly.io machine

Boost Your Web App's Speed: JavaScript Performance Optimization Techniques

WebSocket Integration in React for Real-Time Communication