Advanced Bot Protection You Must Know
This article is written by one of the developers of SafeLine WAF.
Recently, while chatting in the SafeLine Discord community, I noticed that many of our domestic and global users configured SafeLine WAF's anti-bot feature on their websites.
I am very happy to see this, as it indicates that anti-bot is becoming a more welcomed feature. We guess the rise of AI has led to a significant increase in web bots, scrapers and crawlers.
AI chatbots like ChatGPT has provided convenience for us. Personally, I often use these tools to ask technical questions. Although AI responds well, most of its answers are derived from various technical forums, embodying the concept of "taking your data and stealing your traffic."
After discussing with some friends who run technical communities on their web applications, we confirmed our suspicions.
"We were being crawled before, but with the rise of AI, bots and crawlers flood into our website," said the CEO of a big forum.
Traditional Bot Protection
Typically, placing a robots.txt
file in the root directory of the website can instruct crawlers which links can be crawled and which cannot. However, 99% of crawlers do not adhere to the robots protocol.
We're not going to talk about regulatory or compliant measures for bot protection. We only talk about technical methods for anti-bot. There are five traditional anti-bot methods generally being used to prevent website crawling:
-
Check User-Agent: Block common crawler based on
User-Agent
-
Check Referer: Block invalid
Referer
- Limit Access Frequency: Block IPs that exceed access limits.
- Check Cookies: Issue cookies to authenticated legitimate users, blocking users with invalid cookies.
- JS Dynamic Rendering: Encrypt JS code of key terms and generate them dynamically.
Drawbacks of Traditional Bot Protection
For the above traditional anti-bot methods, there are many ways to bypass them easily:
- Check User-Agent: This can be bypassed by forging HTTP request headers on the client side.
- Check Referer: This can also be bypassed by forging HTTP request headers on the client side.
- Limit Access Frequency: This can be bypassed using a proxy pool to obtain numerous source IPs, with each IP having a low actual access volume.
- Check Cookies: This can be bypassed by manually obtaining a cookie and then providing it to the crawler.
- JS Dynamic Rendering: This can be bypassed by using a headless browser.
Advanced Bot Protection
- Request Signature: Bind the client and SESSION; if IP, User-Agent, browser fingerprint, etc., are modified, the SESSION is automatically revoked.
- Behavior Recognition: Detect mouse and keyboard usage habits and browser window visit positions to comprehensively determine if it is a human.
- Headless Browser Detection: Detect client characteristics of local browsers and prohibit access from headless browsers.
- Automation Detection: Detect if the local browser is controlled by automation programs and prohibit access from automated browsers.
- Interactive Recognition: Involve users in interactive captchas like sliding verification, recognizing images, recognizing text, etc.
- Computational Verification: Inject computational verification scripts to consume CPU resources, increasing client access costs. A device that could visit 1000 times per second, with SafeLine protection, can only visit once per second.
- Prevent Request Replay: Add one-time verification to prevent HTTP requests from being replayed once they leave the browser, making copied cookies invalid.
- Disrupt HTML Structure: Dynamically disrupt the HTML code structure, making it difficult for crawlers to recognize webpage features.
- Obfuscate JS Code: Dynamically obfuscate JS code, making it difficult for attackers to understand the effective webpage logic.
How to Use SafeLine for Bot Protection
SafeLine WAF includes most of the anti-bot technologies on the market and can be used for free!
For installation instructions, refer to the official documentation: SafeLine WAF Documentation
After installing SafeLine WAF, enable the anti-bot features and complete the configuration within one minute:
Once configured, accessing a website protected by SafeLine WAF will show that SafeLine is checking the client's environment security.
Legitimate users will see the actual webpage content load automatically after waiting a few seconds, while malicious users will be blocked.
If a local client is detected to be controlled by an automation program, access will still be blocked.
After verification, if you look through the webpage source code, you will find that HTML and JS code are dynamically encrypted. Although the webpage looks the same, the HTML code structure changes with each refresh.
For example, the server's HTML file is shown below:
After SafeLine's dynamic protection, the HTML file seen in the browser is shown below:
It is necessary to mention that SafeLine's human and bot verification utilizes cloud verification, invoking our compant, Chaitin's cloud API for each verification. Combining Chaitin's IP threat intelligence data and browser fingerprint data, the bot detection rate exceeds 99.9%. Meanwhile, cloud algorithms and JS logic are continuously updated automatically, ensuring we always stay ahead of attackers.
If anyone can bypass SafeLine's human and bot verification feature, you are welcomed to my office, and I'll treat you to a month of KFC!
With such high detection rates, website owners may worry about SEO impact and whether search engine indexing will be affected.
The answer is "no." SafeLine thoughtfully provides IP whitelists for major search engine spiders. If SEO is a concern, simply whitelist these IPs.
Wrap-up
Finally, if you are interested in anti-bot technology or WAF technology, welcome to join the SafeLine Discord Community and discuss with us:
https://discord.gg/dy3JT7dkmY
Top comments (0)