loading...

The importance of use User Agent to Scraping Data

hhsm95 profile image Hugo Sandoval ・6 min read

User Agent

Using User Agent isn't a common practice by many scrapers and crawlers developers. But it is important to know that using the correct User Agent can help and make easy the scraping tasks of many websites.

What is a User Agent?

The User Agent is a text string that the client sends through the headers of a request, and serves as an identifier for the type of device, operating system and browser that we are using. This information tells the server that, for example, we are using Google Chrome 80 browser and a computer with Windows 10. And therefore, the server prepares a response intended for that type of device.

User Agent Process

It is not the same response that Facebook, Twitter or Google sends us when we enter with a smartphone with Android or iOS as when we enter with a computer with Windows, Mac OS or Linux. And their servers know this through the User Agent.

Because the User Agent is a plaintext string it is easy to manipulate and thus trick the web server into believing that we are visiting it from a different device.

Why is recommended to use User Agent?

Not setting an User Agent in our requests will cause that our tools use a default one that in many cases is one that announces our presence as a Bot, which in many websites is not allowed and therefore it is possible that they can easily ban us.

It is recommended to always use a popular User Agent, so that it can go unnoticed. The following website contains a huge User Agent database, but in my recommendation it is easier to use the User Agent of our browser and in the case of a Windows 10 PC using Google Chrome version 80 it would look something like this:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36

Example: testing of different user agents.

We will be using Python 3 for this example, you can download it here if you don't already have it.

Necessary libraries:

  • requests
  • BeautifulSoup4
  • lxml

Install them with this command in a terminal:

pip install requests BeautifulSoup4 lxml

First we will classify the types of User Agent based on the content that a website could serve us when accessing with said User Agent.

  • For desktops or laptops: computers in general.
  • For smartphones: Android, iOS, Windows Phone.
  • For featurephones: Nokia 5310 xpressmusic, Sony Ericsson etc. childhood phones.

In the following tests I will be cutting the response of the server to not make the post so big.

Desktop computers:

Let's take a desktop user agent, Windows 10 and Google Chrome, then run the request:

import requests # Import requests
from bs4 import BeautifulSoup # Import BeautifulSoup4

# Windows 10 with Google Chrome
user_agent_desktop = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 '\
'Safari/537.36'

headers = { 'User-Agent': user_agent_desktop}

url_twitter = 'https://twitter.com/billgates'
resp = requests.get(url_twitter, headers=headers)  # Send request

code = resp.status_code  # HTTP response code
if code == 200:
    soup = BeautifulSoup(resp.text, 'lxml')  # Parsing the HTML
    print(soup.prettify())
else:
    print(f'Error to load Twitter: {code}')

What does the response of the Twitter server look like if we send a request with this User Agent?

...
<body>
  <noscript>
   <form action="https://mobile.twitter.com/i/nojs_router?path=%2Fbillgates" method="POST" style="background-color: #fff; position: fixed; top: 0; left: 0; right: 0; bottom: 0; z-index: 9999;">
    <div style="font-size: 18px; font-family: Helvetica,sans-serif; line-height: 24px; margin: 10%; width: 80%;">
     <p>
      We've detected that JavaScript is disabled in your browser. Would you like to proceed to legacy Twitter?
     </p>
...

As we can see in the result, the server returns a page that is loaded dynamically through the use of Javascript and that by default will do nothing if the client doesn't has enable Javascript. Python Requests doesn't execute Javascript so we will not be able to see the information that interests us, so let's try with another User Agent.

Smartphones:

Now let's try an Android 9 phone and Google Chrome.

import requests # Import requests
from bs4 import BeautifulSoup # Import BeautifulSoup4

# Android 9 with Google Chrome
user_agent_smartphone = 'Mozilla/5.0 (Linux; Android 9; SM-G960F '\
'Build/PPR1.180610.011; wv) AppleWebKit/537.36 (KHTML, like Gecko) '\
'Version/4.0 Chrome/74.0.3729.157 Mobile Safari/537.36'

headers = { 'User-Agent': user_agent_smartphone}

url_twitter = 'https://twitter.com/billgates'
resp = requests.get(url_twitter, headers=headers)  # Send request

code = resp.status_code  # HTTP response code
if code == 200:
    soup = BeautifulSoup(resp.text, 'lxml')  # Parsing the HTML
    print(soup.prettify())
else:
    print(f'Error to load Twitter: {code}')

The answer is quite similar to requesting with a desktop browser, and this is due to the same thing, the server expects a smartphone to have Javascript to display the page content.

...
<body>
  <noscript>
   <form action="https://mobile.twitter.com/i/nojs_router?path=%2Fbillgates" method="POST" style="background-color: #fff; position: fixed; top: 0; left: 0; right: 0; bottom: 0; z-index: 9999;">
    <div style="font-size: 18px; font-family: Helvetica,sans-serif; line-height: 24px; margin: 10%; width: 80%;">
     <p>
      We've detected that JavaScript is disabled in your browser. Would you like to proceed to legacy Twitter?
     </p>
...

Featurephone:

Finally we will see the answer when we request with an old mobile:

import requests # Import requests
from bs4 import BeautifulSoup # Import BeautifulSoup4

# Nokia 5310 with UC Browser
user_agent_old_phone = 'Nokia5310XpressMusic_CMCC/2.0 (10.10) Profile/MIDP-2.1 '\
'Configuration/CLDC-1.1 UCWEB/2.0 (Java; U; MIDP-2.0; en-US; '\
'Nokia5310XpressMusic) U2/1.0.0 UCBrowser/9.5.0.449 U2/1.0.0 Mobile'

headers = { 'User-Agent': user_agent_old_phone}

url_twitter = 'https://twitter.com/billgates'
resp = requests.get(url_twitter, headers=headers)  # Send request

code = resp.status_code  # HTTP response code
if code == 200:
    soup = BeautifulSoup(resp.text, 'lxml')  # Parsing the HTML
    print(soup.prettify())
else:
    print(f'Error to load Twitter: {code}')

Let's see the answer:

...
<table class="tweet" href="/BillGates/status/1249497817900433408?p=v">
  <tr class="tweet-header">
    <td class="avatar" rowspan="3">
      <a href="/BillGates?p=i"><img alt="Bill Gates" src="https://pbs.twimg.com/profile_images/988775660163252226/XpgonN0X_normal.jpg" /></a>
    </td>
    <td class="user-info">
      <a href="/BillGates?p=s">
        <strong class="fullname">Bill Gates</strong>
        <div class="username"> <span>@</span>BillGates</div>
      </a>
    </td>
    <td class="timestamp">...</td>
  </tr>
  <tr class="tweet-container">
    <td class="tweet-content" colspan="2">
      <div class="tweet-text" data-id="1249497817900433408">
        <div class="dir-ltr" dir="ltr">
          .
          <a class="twitter-atreply dir-ltr" data-mentioned-user-id="17004618" data-screenname="NickKristof" dir="ltr"
            href="/NickKristof">
            @NickKristof
          </a>
          does an amazing job capturing the heroism of the health care workers on the front lines of the
          coronavirus fight.
          <a class="twitter_external_link dir-ltr tco-link"
            data-expanded-url="https://twitter.com/NickKristof/status/1248996159491919873"
            data-url="https://twitter.com/NickKristof/status/1248996159491919873" dir="ltr"
            href="https://t.co/x1TgE2oNXE" rel="nofollow noopener" target="_blank"
            title="https://twitter.com/NickKristof/status/1248996159491919873">
            twitter.com/NickKristof/st…
          </a>
        </div>
      </div>
    </td>
  </tr>
  <tr>
    <td class="meta-and-actions" colspan="2">...</span>
      <span class="tweet-actions">...</span>
    </td>
  </tr>
</table>
...

This time we have a good response, and this is because popular sites like Twitter, Facebook or Google have versions for all these devices because they started for those devices and they hope to have users from all media and want their services to always be available.

Final thoughts

User agents can be used to request the page from the server in a certain predefined style. In the examples we saw how popular websites have different responses depending on the device that visits them, and we can use this to our advantage to scrape them.

In practice, it is usually not necessary to use mobile user agents, it is enough to rotate between common desktop user agents if what we want is to avoid being detected during our scraping tasks.

I hope this guide has helped you to know the differences of using user agents.

Posted on by:

hhsm95 profile

Hugo Sandoval

@hhsm95

Passionate about technology and motivated to solve problems and optimize processes, always seeking to learn and improve

Discussion

markdown guide