DEV Community

Cover image for How to reduce chance of being blocked while web scraping search engines
Dimitry Zub
Dimitry Zub

Posted on • Updated on • Originally published at serpapi.com

How to reduce chance of being blocked while web scraping search engines

This blog post is about different ways to reduce chance being blocks while web scraping search engines or other websites with Python, Ruby code examples.

Contents:

Methods

  • Check the network tab first to make a direct request to API/Server.
  • Add delays.
  • Pass user-agent into request headers.
  • Pass additional HTTP request headers (cookies, auth, authority, etc.).
  • Add proxies.
  • Become whitelisted.
  • SerpApi.

Check Network Tab First

Before you try to make the most stealth bypass system, take a look in the Network tab under dev tools first and see if the data you want can be extracted via direct API/Server request call. This way you don't need to make things complicated.

Note: API calls are also protected. For example, the Home Depot and Walmart block API requests without proper headers.

To check it, go to the Dev Tools -> Network -> Fetch/XHR. On the left side you'll see a bunch of requests send from/to the server, when you click on one of those requests, on the right side you'll see the response via preview tab.

image
image

If some of those request have the data you want, click on it, go to headers tab on the right and copy URL to make a requests using Python requests.get() or Ruby HTTParty.get().
image


Delays

Delays could do the trick sometimes, but it's very depends on use case, and it will depend whether you should use them or not.

In Python you can use built-in time.sleep method:

from time import sleep

sleep(0.05)  # 50 milliseconds of sleep
sleep(0.5)   # half a second of sleep
sleep(3)     # 3 seconds of sleep 
Enter fullscreen mode Exit fullscreen mode

In Ruby it's identical process using sleep method as well:

# Called without an argument, sleep() will sleep forever
sleep(0.5) # half a second
sleep(4.minutes)

# or longer..
sleep(2.hours)
sleep(3.days)
Enter fullscreen mode Exit fullscreen mode

User-Agent

It's the most basic one and usually, for most websites, it will be enough, but user-agent does not guarantee that your request won't be declined or blocked.

In basic explanation, user-agent is needed to act as a "real" user visit, which is also known as user-agent spoofing, when bot or browser send a fake user-agent string to announce themselves as a different client.

The reason why request might be block is because, for example in Python requests library, default user-agent is python-requests and websites understands that it's a bot and might block a request in order to protect the website from overload, if there's a lot of requests being sent.

User-agent syntax looks like this:

User-Agent: <product> / <product-version> <comment>
Enter fullscreen mode Exit fullscreen mode

Check what's your user-agent.

In Python requests library, you can pass user-agent into request headers as a dict() like so:

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

# add request headers to request
requests.get("YOUR_URL", headers=headers)
Enter fullscreen mode Exit fullscreen mode

In Ruby with HTTPary gem it's identical process:

headers = {
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

# add request headers to request
HTTParty.get("YOUR_URL", headers:headers)
Enter fullscreen mode Exit fullscreen mode

Code and response examples with and without user-agent

Examples below will be using Python and requests library. This problem is very common on StackOverFlow.

Let's try to get data from Google Search with and without user-agent passed into request headers. The example below will try to get the stock price.

Making request without passing user-agent into request headers:
import requests, lxml
from bs4 import BeautifulSoup

params = {
  "q": "Nasdaq composite",
  "hl": "en",
}

soup = BeautifulSoup(requests.get('https://www.google.com/search', params=params).text, 'lxml')
print(soup.select_one('[jsname=vWLAgc]').text)
Enter fullscreen mode Exit fullscreen mode

Firstly, it will throw and AttributeError because the response contains different HTML with different selectors:

print(soup.select_one('[jsname=vWLAgc]').text)
AttributeError: 'NoneType' object has no attribute 'text'
Enter fullscreen mode Exit fullscreen mode

Secondly, if you try to print soup object or response from requests.get() you'll see that it's a HTML with <script> tags, or HTML that contains some sort of an error.

Making requests with user-agent:
import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "Nasdaq composite",
  "hl": "en",
}

soup = BeautifulSoup(requests.get('https://www.google.com/search', headers=headers, params=params).text, 'lxml')
print(soup.select_one('[jsname=vWLAgc]').text)

# 15,363.52
Enter fullscreen mode Exit fullscreen mode

Rotate User-Agents

If you are making a large number of requests for web scraping a website, it's a good idea to make each request look random by sending a different set of HTTP headers to make it look like the request is coming from different computers/different browsers.

The process:

  1. Collect a list of User-Agent strings of some recent real browsers from WhatIsMyBrowser.com.
  2. Put them in Python list() or txt file.
  3. Make each request pick a random string from this list() using random.choice().
import requests, random

user_agent_list = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]

url = 'https://httpbin.org/headers'

for i in range(1,4):
  #Pick a random user agent
  user_agent = random.choice(user_agent_list)

  #Set the headers 
  headers = {'User-Agent': user_agent}
Enter fullscreen mode Exit fullscreen mode

Learn more at ScrapeHero about how to fake and rotate User Agents using Python.


Additional Headers

Sometimes passing only user-agent isn't enough. You can pass additional headers. For example:

  • Accept: Accept: <MIME_type>/<MIME_subtype>; Accept: <MIME_type>/*; Accept: */*
  • Accept-Language: Accept-Language: <language>; Accept-Language: *
  • Content-Type: Content-Type: text/html; img/png

See more HTTP request headers that you can send while making a request.

Additionally, if you need to send authentification data, you can use requests.Session():

session = requests.Session()
session.auth = ('user', 'pass')
session.headers.update({'x-test': 'true'})

# both 'x-test' and 'x-test2' are sent
session.get('https://httpbin.org/headers', headers={'x-test2': 'true'})
Enter fullscreen mode Exit fullscreen mode

Or if you need to send cookies:

session = requests.Session()

response = session.get('https://httpbin.org/cookies', cookies={'from-my': 'browser'})
print(response .text)
# '{"cookies": {"from-my": "browser"}}'

response = session.get('https://httpbin.org/cookies')
print(response.text)
# '{"cookies": {}}'
Enter fullscreen mode Exit fullscreen mode

You can view all request/response headers under DevTools -> Network -> Click on the URL -> Headers.

In Insomnia (right click on URL -> copy as cURL (Bash)) you can see what HTTP request headers being sent and play around with them dynamically:
image

It can also generate code for you (not perfect all the time):
image


Ordered Headers

In unusual circumstances, you may want to provide headers in an ordered manner.

To do so, you can do it like so:

from collections import OrderedDict
import requests

session = requests.Session()
session.headers = OrderedDict([
    ('Connection', 'keep-alive'), 
    ('Accept-Encoding', 'gzip,deflate'),
    ('Origin', 'example.com'),
    ('User-Agent', 'Mozilla/5.0 ...'),
])

# other code ...

custom_headers = OrderedDict([('One', '1'), ('Two', '2')])
req = requests.get('https://httpbin.org/get', headers=custom_headers)
prep = session.prepare_request(req)
print(*prep.headers.items(), sep='\n')

# prints:
'''
('Connection', 'keep-alive')
('Accept-Encoding', 'gzip,deflate')
('Origin', 'example.com')
('User-Agent', 'Mozilla/5.0 ...')
('One', '1')
('Two', '2')
'''
Enter fullscreen mode Exit fullscreen mode

Code was taken from StackOverFlow answer by jfs. Please, read his answer to get more out of it (note: it's written in Russian.). Learn more about Requests Header Ordering.


Proxies

Sometimes passing request headers isn't enough. That's when you can try to use proxies in combination with request headers.

Why proxies in the first place?

  1. If you want to scrape at scale. While web scraping there's could a lot of traffic while making requests. Proxies are used to make traffic look like regular user traffic making things balanced.
  2. If destination website you want to scrape only available in some countries, then you make a request from a specific geographical region or device.
  3. If you want to have an ability to make concurrent sessions to the same or different websites which will reduce chances to get banned or blocked by the website(s).

Using Python to pass proxies into request (same as passing user-agent):

proxies = {
  'http': 'http://10.10.1.10:3128',
  'https': 'http://10.10.1.10:1080',
}

requests.get('http://example.org', proxies=proxies)
Enter fullscreen mode Exit fullscreen mode

Using HTTParty to add proxies like so, or like in the code snippet shown below:

http_proxy = {
  http_proxyaddr: "PROXY_ADDRESS",
  http_proxyport: "PROXY_PORT"
}

HTTParty.get("YOUR_URL", http_proxy:http_proxy)
Enter fullscreen mode Exit fullscreen mode

Or using HTTPrb to add proxies:

HTTP.via("proxy-hostname.local", 8080)
  .get("http://example.com/resource")

HTTP.via("proxy-hostname.local", 8080, "username", "password")
  .get("http://example.com/resource")
Enter fullscreen mode Exit fullscreen mode

Non-overused proxies

To keep things short, if possible, do not use overused proxies because:

  • Public proxies are the unsafest and the most unreliable proxies.
  • Shared proxies are usually the cheapest proxies, because many clients split the cost and get to use more proxies for the same price.

You can scrape a lot of public proxies and store them in the list() or save it to .txt file to save memory and iterate over them while making a request to see what's the results would be, and then move to different types of proxies if the result is not what you were looking for.

Learn more about other types of proxies and which one of them is the best for use case.

Become Whitelisted

Get whitelisted means to add IP addresses to allow lists in website which explicitly allows some identified entities to access a particular privilege, i.e. it is a list of things allowed when everything is denied by default.

One of the ways to become whitelisted is you can regularly do something useful for "them" based on scraped data which could lead to some insights.


Using SerpApi

You can avoid all of these problems by using SerpApi. It's a paid API with a free plan.

The biggest difference is that it's already done for the end-user, except for the authentification part and you don't have to think about it either maintain it.


User-AgentRequest HeadersResponse HeadersList of HTTP HeadersTypes of proxiesPython RequestsRuby HTTPartyAPI

Outro

If you have any questions or any suggestions, feel free to drop a comment in the comment section or via Twitter at @serp_api.

Yours,
Dimitry, and the rest of the SerpApi Team.

Discussion (0)