Datahut Blogs (9 Part Series)
It is this era of tremendous competition; enterprises use all methods within their power to get ahead. For businesses, the unique tool to achieve this supremacy is Web scraping. But this too isn’t a field without hurdles. Websites employ various anti-scraping techniques to block you from scraping their websites. But there is always a way around.
WHAT HAVE WE KNOWN ABOUT WEB SCRAPING
The WWW harbours more websites than you can imagine. Some of them might be of the same domain as yours. For example, both Amazon and Flipkart are e-commerce websites. Such websites become your rivals, even without trying. So when it comes to tasting success, you need to identify your competitions and conquer them.
So what methods can help you get that edge over a million others working in the same domain?
The answer is web scraping.
Web scraping is nothing but collecting data from various websites. You can extract information, such as product pricing and discounts. The data that you acquire can help in enhancing user experience. This usage, in return, will ensure that the customers prefer you over your competitors.
For instance, your e-commerce company sells software. You need to understand how you can improve your product. For this, you will have to visit websites that sell software and find out about their products. Once you do this, you can also verify your competitor’s costs. Eventually, you can decide at what price will you place your software and what features need to be improved. This process applies to almost any product.
WHAT ARE ANTI-SCRAPING TOOLS AND HOW TO DEAL WITH THEM
As a growing business, you will have to target popular and well-established websites. But the task of web scraping becomes difficult in such cases. Why? Because these websites employ various anti-scraping techniques to block your way.
WHAT DO THESE ANTI-SCRAPING TOOLS DO?
Websites harbour much information. Genuine visitors might use this information to learn something or to select the product they want to buy. But non-genuine visitors like competitor websites can use this information to get ahead in the game.
Anyone would like to keep competition at bay. That is why websites use anti-scraping tools. These tools can identify the non-genuine visitors and prevent you from acquiring data for your use.
- KEEP ROTATING YOUR IP ADDRESS This is the easiest method of deceiving any anti-scraping tool. An IP address is like a numerical identifier assigned to a device. One can easily monitor it when you visit a website to perform the web scraping.
Most websites keep in check the IP addresses visitors use to surf them. So, while doing the enormous task of scraping a large site, you should keep several IP addresses handy. You can think of this as using different face mask each time you go out of your house. By using a number of these, none of your IP addresses will get blocked.
This method comes handy with most of the websites. But a few high-profile sites use advanced proxy blacklists. That is where you need to act smarter. Residential or mobile proxies are reliable alternatives here. Just in case you are wondering, there are several kinds of proxies.
We have a fixed number of IP addresses in the world. Yet, if you somehow manage to have 100 of them, you can easily visit 100 websites without arousing any suspicion. So, the most crucial step is to find yourself the right proxy service provider.
- USE A REAL USER AGENT User agents are a type of HTTP header. Their primary function is to decipher which browser are you using to visit a website. They can easily block you, in case you are using a website that isn’t major. For instance, a few significant sites can be Chrome and Mozilla Firefox. Most scrapers ignore this point. You can cut down your chances of getting blacklisted by setting a user agent that seems genuine and well-known.
You can easily find yourself one from the list of user agents. In case your’s is an advanced website, Googlebot User Agent can help you out. Your request will allow Googlebot to go through your site. Additionally, this will list you on google.
A user agent works best when it is up-to-date. All browsers use a different set of user-agent. In case you fail to stay updated, you will arouse suspicion which you don’t want. Rotating between a few user agents can give you an upper hand too.
- KEEP RANDOM INTERVALS BETWEEN EACH REQUEST A web scraper is like a robot. Web scraping tools will send requests at regular intervals of time. Your goal should be to appear as human as possible. Since humans don’t like routine, it is better to space out your requests at random intervals. This way, you can easily dodge any anti-scraping tool of the target website.
Make sure that your requests are polite. In case you send requests frequently, you can crash the website for everyone. The goal is not to overload the site at any instance. As an example, Scrapy has a mandatory requirement of sending out requests slowly.
As additional security, you can refer to the robots.txt of a website. These documents have a line specifying crawl-delay. Accordingly, you can understand how many seconds you need to wait to avoid generating high server traffic.
- A REFERER ALWAYS HELPS An HTTP request header which specifies which site you redirected from is a referrer header. This can be your lifesaver during any web scraping operation. Your goal should be to appear as if you are coming directly from google.
Can help you do this. You can even change this as you change countries. For example, in the UK, you can use” https://www.google.co.uk/“. Many sites affiliate certain referrers to redirect traffic. You can use a tool like https://www.similarweb.com to find the common referrer for a website.
These referrers are usually social media sites like Youtube or facebook.
Knowing the referrer will make you appear more authentic. The target site will think that the sites usual referrer redirected you to their website. Therefore, the target website will classify you as a genuine visitor and won’t think of blocking you.
- AVOID ANY HONEYPOT TRAPS As robots got smarter, so did the website handlers. Many of the websites put invisible links that your scraping robots would follow. By intercepting these robots, websites can easily block your web scraping operation.
To safeguard yourself, try to look for “display: none” or “visibility: hidden” CSS properties in a link. If you detect these properties in a link, it is time to backtrack. By using this method, websites can identify and trap any programmed scraper. They can fingerprint your requests and then block them permanently.
This is the method that masters of web security use against web crawlers. Try to check each page for any such properties. Webmasters also use tricks like changing the colour of the link to that of the background. In such cases, for additional security look for properties like “color:#fff” or “color:#ffffff”. This way, you can even save yourselves from links that have been rendered invisible.
Many tools are available that can help you design browsers identical to the one used by a real user. This step will help you avoid detection entirely. The only milestone in this method is the design of such websites because it takes more caution and time. But as a result, it makes for the most effective way to go undetected while scraping a website.
The drawback of such smart tools is their memory and CPU intensive properties. Resort to these type of tools only when you can find no means to avoid getting blacklisted by a website.
- KEEP WEBSITE CHANGES IN CHECK Websites can change layouts for various reasons. Most of the time, sites do so to block websites from scraping them. Websites can include designs at random places. This method is used even by the big shot websites.
So the crawler that you are using should be able to understand these changes well. Your crawler needs to be able to detect these ongoing changes and continue to perform web scraping. Monitoring the number of successful requests per crawl can help you do this easily.
Another method to ensure ongoing monitoring is by writing a unit test for a specific URL on the target site. You can use one URL from each section of the website. This method will help you detect any such changes. Only a few requests sent every 24 hours will help you avoid any pause in the scraping procedure.
- EMPLOY A CAPTCHA SOLVING SERVICE Captchas are one of the most widely used anti-scraping tools. Most of the times, crawlers cannot bypass the captchas on websites. But as a recluse, many services have been designed to help you in carrying out web scraping. A few of these are the captcha solving solutions like AntiCAPTCHA.
Websites which require CAPTCHA makes it mandatory for crawlers to use these tools. Some of these services might be very slow and expensive. So you will have to choose wisely to ensure that this service isn’t too extravagant for you.
- GOOGLE CACHE CAN BE A SOURCE TOO Throughout the WWW, there is a large amount of stationary data (data that doesn’t change much with time). In such instances, Google’s cached copy can be your last resort for web scraping. By using cached copies, you can directly perform data acquisition. This method is at times easier as compared to scraping websites.
Just add “http://webcache.googleusercontent.com/search?q=cache:” as a prefix to your URL, and you are ready to go. This option comes handy for websites that are pretty hard to scrape and yet are mostly constant over time.
This option is mostly hassle-free as nobody is trying to block your ways all the time.
But this isn’t such a reliable option. For example, LinkedIn keeps denying google permission to cache their data. It is best to opt for some other method to scrape that website.