DEV Community

Cover image for DOs and DON'Ts of Web Scraping
Ander Rodriguez
Ander Rodriguez

Posted on • Originally published at zenrows.com

DOs and DON'Ts of Web Scraping

For those of you new to web scraping, regular users, or just curious: these tips are golden. Scraping might seem an easy-entry activity, and it is. But it will take you down a rabbit hole. Before you realize it, you got blocked from a website, your code is 110% spaghetti, and there's no way you can scale that to another four sites.

Ever been there? ✋ I was there 10 years ago — no shame (well, just a bit). Continue with us for a few minutes, and we'll help you navigate through the rabbit hole. 🕳️

DO Rotate IPs

The simplest and most common anti-scraping technique is to ban by IP. The server will show you the first pages, but it will detect too much traffic from the same IP and block it after some time. Then your scraper will be unusable. And you won't even be able to access the webpage from a real browser. The first lesson on web scraping is never to use your actual IP.

Every request leaves a trace, even if you try to avoid it from your code. There are some parts of the networking that you cannot control. But you can use a proxy to change your IP. The server will see an IP, but it won't be yours. The next step, rotate the IP or use a service that will do it for you. What does this even mean?

You can use a different IP every few seconds or per request. The target server can't identify your requests and won't block those IPs. You can build a massive list of proxies and take one randomly for every request. Or use a Rotating Proxy which will do that for you. Either way. The chances of your scraper working correctly skyrocketed with just this change.

import requests
import random

urls = ["http://ident.me"] # ... more URLs
proxy_list = [
    "54.37.160.88:1080",
    "18.222.22.12:3128",
    # ... more proxy IPs
]

for url in urls:
    proxy = random.choice(proxy_list)
    proxies = {"http": f"http://{proxy}", "https": f"http://{proxy}"}
    response = requests.get(url, proxies=proxies)
    print(response.text)
    # prints 54.37.160.88 (or any other proxy IP)
Enter fullscreen mode Exit fullscreen mode

Note that these free proxies might not work for you. They are short-time lived.

DO Use Custom User-Agent

The second-most-common anti-scraping mechanism is User-Agent. UA is a header that browsers send in requests to identify themselves. They are usually a long string declaring the browser's name, version, platform, and many more. An example for an iPhone 13:

"Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1"
Enter fullscreen mode Exit fullscreen mode

There is nothing wrong with sending a User-Agent, and it is actually recommended to do so. The problem is which one to send. Many HTTP clients send their own (cURL, requests in Python, or Axios in Javascript), which might be suspicious. Can you imagine your server getting hundreds of requests with a "curl/7.74.0" UA? You'd be skeptical at the very least.

The solution is usually finding valid UAs, like the one from the iPhone above, and using them. But it might turn against you also. Thousands of requests with exactly the same version in short periods?

So the next step is to have several valid and modern User-Agents and use those. And to keep the list updated. As with the IPs, rotate the UA in every request in your code.

# ... same as above 
user_agents = [ 
    "Mozilla/5.0 (iPhone ...", 
    "Mozilla/5.0 (Windows ...", 
    # ... more User-Agents 
] 

for url in urls: 
    proxy = random.choice(proxy_list) 
    proxies = {"http": f"http://{proxy}", "https": f"http://{proxy}"} 
    response = requests.get(url, proxies=proxies) 
    print(response.text)
Enter fullscreen mode Exit fullscreen mode

DO Research Target Content

Take a look at the source code before starting development. Many websites offer more manageable ways to scrape data than CSS selectors. A standard method of exposing data is through rich snippets, for example, via Schema.org JSON or itemprop data attributes. Others use hidden inputs for internal purposes (i.e., IDs, categories, product code), and you can take advantage. There's more than meets the eye.

Hidden Inputs on Amazon Products

Some other sites rely on XHR requests after the first load to get the data. And it comes structured! For us, the easier way is to browse the site with DevTools open and check both the HTML and Network tab. You will have a clear vision and decide how to extract the data in a few minutes. These tricks are not always available, but you can save a headache by using them. Metadata, for example, tends to change less than HTML or CSS classes, making it more reliable and maintainable long-term.

Auction.com XHR Requests

We wrote about exploring before coding with examples and code in Python; check out for more info.

DO Parallelize Requests

After switching gear and scaling up, the old one-file sequential script will not be enough. You probably need to "professionalize" it. For a tiny target and a few URLs getting them one by one might be enough. But then scale it to thousands and different domains. It won't work correctly.

One of the first steps of that scaling would be to get several URLs simultaneously and not stop the whole scraping for a slow response. Going from 50-line-script to Google scale is a giant leap, but the first steps are achievable. There are the main things you'll need: concurrency and a queue.

Concurrency

The main idea is to send multiple requests simultaneously but with a limit. And then, as soon as a response arrives, send a new one. Let's say the limit is ten. That would mean that ten URLs would always be running at any given time until there are no more, which brings us to the next step.

We wrote a guide on using concurrency (examples in Python and Javascript).

Queue

A queue is a data structure that allows adding items to be processed later. You can start the crawling with a single URL, get the HTML and extract the links you want. Add those to the queue, and they will start running. Keep on doing the same, and you built a scalable crawler. Some points are missing, like deduplicating URLs (not crawling the same one twice) or infinite loops. But the easy way to solve it would be to set a maximum number of pages crawled and stop once you get there.

We have an article with an example in Python scraping from a seed URL.

Still far from Google scale (obviously), but you can go to thousands of pages with this approach. To be more precise, you can have different settings per domain to avoid overloading a single target. We'll leave that up to you 😉

DON'T Use Headless Browsers for Everything

Selenium, Puppeteer, and Playwright are great, no doubt, but not a silver bullet. They bring a resource overhead and slow down the scraping process. So why use them? 100% needed for Javascript rendered content and helpful in many circumstances. But ask yourself if that's your case.

Most of the sites serve the data, one way or another, on the first HTML request. Because of that, we advocate going the other way around. Test first plain HTML by using your favorite tool and language (cURL, requests in Python, Axios in Javascript, whatever). Check for the content you need: text, IDs, prices. Be careful here since sometimes the data you see on the browser might be encoded (i.e.," shown in plain HTML as "). Copy & paste might not work. 😅

In some cases, you won't find the info because it is not there on the first load, for example, in Angular.io. No problem, headless browsers come in handy for those cases. Or XHR scraping as shown above for Auction.

If you find the info, try to write the extractors. A quick hack might be good enough for a test. Once you have identified all the content you want, the following point is to separate generic crawling code from the custom one for the target site.

  1. Using Python's "requests": 2.41 seconds
  2. A playwright with chromium opening a new browser per request: 11.33 seconds
  3. Playwright with chromium sharing browser and context for all the URLs: 7.13 seconds

It is not 100% conclusive nor statistically accurate, but it shows the difference. In the best case, we are talking about 3x slower using Playwright, and sharing context is not always a good idea. And we are not even talking about CPU and memory consumption.

DON'T Couple Code to Target

Some actions are independent of the website you are scraping: get HTML, parse it, queue new links to crawl, store content, and more. In an ideal scenario, we would separate those from the ones that depend on the target site: CSS selectors, URL structure, DDBB structure.

The first script is usually entangled; no problem there. But as it grows and new pages are added, separating responsibilities is crucial. We know, easier said than done. But to pause and think matters to develop a maintainable and scalable scraper.

We published a repository and blog post about distributed crawling in Python. It is a bit more complicated than what we've seen so far. It uses external software (Celery for asynchronous task queue and Redis as the database).

  1. How to get the HTML (requests VS headless browser)
  2. Filter URLs to queue for crawling
  3. What content to extract (CSS selectors)
  4. Where to store the data (a list in Redis)
# ... 
def extract_content(url, soup): 
    # ... 

def store_content(url, content): 
    # ... 

def allow_url_filter(url): 
    # ... 

def get_html(url): 
    return headless_chromium.get_html(url, headers=random_headers(), proxies=random_proxies())
Enter fullscreen mode Exit fullscreen mode

It is still far from massive scale production-ready. But code reuse is easy, as is adding new domains. And when adding updated browsers or headers, it would be easy to modify the old scrapers to use those.

DON'T Take Down your Target Site

Your extra load might be a drop in the ocean for Amazon but a burden for a small independent store. Be mindful of the scale of your scraping and the size of your targets.

You can probably crawl hundreds of pages at Amazon concurrently, and they won't even notice (careful nonetheless). But many websites run on a single shared machine with poor specs, and they deserve our understanding. Tune down your scripts capabilities for those sites. It might complicate the code, but stopping if the response times increase would be nice.

Another point is to inspect and comply with their robots.txt. Mainly two rules: do not scrape disallowed pages and obey Crawl-Delay. That directive is not common, but when present, represents the amount of seconds crawlers should wait between requests. There is a Python module that can help us to comply with robots.txt.

We will not go into details but do not perform malicious activities (there should be no need to say it, just in case). We are always talking about extracting data without breaking the law or causing damage to the target site.

DON'T Mix Headers from Different Browsers

This last technique is for higher-level anti-bot solutions. Browsers send several headers with a set format that varies from version to version. And advanced solutions check those and compare them to a real-world header set database. Which means you will raise red flags when sending the wrong ones. Or even more difficult to notice, by not sending the right ones! Visit httpbin to see the headers your browser sends. Probably more than you imagine and some you haven't even heard of! "Sec-Ch-Ua"? 😕

There is no easy way out of this but to have an actual full set of headers. And to have plenty of them, one for each User-Agent you use. Not one for Chrome and another for iPhone, nope. One. Per. User-Agent. 🤯

Some people try to avoid this by using headless browsers, but we already shaw why it is better to avoid them. And anyway, you are not on the clear with them. They send the whole set of headers that work for that browser on that version. If you modify any of that, the rest might not be valid. If using Chrome with Puppeteer and overwriting the UA to use the iPhone one... you can have a surprise. A real iPhone does not send "Sec-Ch-Ua", but Puppeteer will since you overwrote UA but didn't delete that one.

Some sites offer a list of User-Agents. But it is hard to get the complete sets for hundreds of them, which is the needed scale when scraping at complex sites.

# ... 

header_sets = [ 
    { 
        "Accept-Encoding": "gzip, deflate, br", 
        "Cache-Control": "no-cache", 
        "User-Agent": "Mozilla/5.0 (iPhone ...", 
        # ... 
    }, { 
        "User-Agent": "Mozilla/5.0 (Windows ...", 
        # ... 
    }, 
    # ... more header sets 
] 

for url in urls: 
    # ... 
    headers = random.choice(header_sets) 
    response = requests.get(url, proxies=proxies, headers=headers) 
    print(response.text)
Enter fullscreen mode Exit fullscreen mode

We know this last one was a bit picky. But some anti-scraping solutions can be super-picky and even more than headers. Some might check browser or even connection fingerprinting — high-level stuff.

Conclusion

Rotating IPs and having good headers will allow you to crawl and scrape most websites. Use headless browsers only when necessary and apply Software Engineering good practices.

Build small and grow from there, add functionalities and use cases. But always try to keep scale and maintainability in mind while keeping success rates high. Don't despair if you get blocked from time to time, and learn from every case.

Web scraping at scale is a challenging and long journey, but you might not need the best ever system. Nor a 100% accuracy. If it works on the domains you want, good enough! Do not freeze trying to reach perfection since you probably don't need it.

In case of doubts, questions, or suggestions, do not hesitate to contact us.

Thanks for reading! Did you find the content helpful? Please, spread the word and share it. 👈


Originally published at https://www.zenrows.com

Top comments (4)

Collapse
 
baukereg profile image
Bauke Regnerus

Great article, thanks.

I'm working on a scraper that gathers news articles from my favorite sites and present them in a dashboard hosted on a local sever. The idea is to update every 5 or 10 minutes, so I guess rotating IP's and user agents is overkill. But I'll implement it anyway as good practice.

Bdw, I notice user_agents isn't used anywhere in the example.

Collapse
 
anderrv profile image
Ander Rodriguez

Hi, thanks!!

It's an overkill if your are sure that they won't block you. But depending on the sites, you cannot be 100% sure.

By user_agents you mean the library. As far as I know, it will detect capabilities based on an User Agent string. For our case, we would need a generator.

Collapse
 
shravan1908 profile image
Shravan • Edited

Wow, this was cool. Thank you Ander for sharing your knowledge.

Collapse
 
anderrv profile image
Ander Rodriguez

Thanks, Shravan! 🙏