DEV Community

Timothy Huang
Timothy Huang

Posted on

Get top 50 web traffic sites with Python

We can get the top 50 web traffic sties with these two site traffic monitor services:

and the top 50 web traffic sites are:

With python requests and BeautifulSoup modules, we can automate list the top 50 web sites from these two monitor services.

First we create a dict that store these two monitor service urls and selectors (for BeautifulSoup select):

webRankSites = {
  "Alexa": {
    "url": "https://www.alexa.com/topsites/",
    "selector": "div.DescriptionCell"
  },
  "SimilarWeb":{
    "url": "https://www.similarweb.com/top-websites/",
    "selector": "td.topRankingGrid-cell.topWebsitesGrid-cellWebsite.showInMobile"
  }
}

How to define the selector? We need to check these two services url content with the site list:

  1. Alexa:
    Alexa selector
    As the developer tools show, the web site is in the element div with class DescriptionCell, the selector is "div.DescriptionCell".

  2. SimilarWeb:
    SimilarWeb selector
    The web site is in the element td with 3 classes topRankingGrid-cell, topWebsitesGrid-cellWebsite, showInMobile. The selector is "td.topRankingGrid-cell.topWebsitesGrid-cellWebsite.showInMobile".

Second we start to get the url content with requests.get and with BeautifulSoup selector patterns to get the web site list (myheaders is used for similarWeb service, since no user-agent will result response status code 403):


myheaders = {"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0(HTTP_USER_AGENT)"}

for site in webRankSites:
  print("site: " + site)
  resp = requests.get(webRankSites[site]["url"], headers = myheaders)
  soup = BeautifulSoup(resp.text, 'html.parser')
  items = soup.select(webRankSites[site]["selector"])
  i = 1
  for item in items:
    print(str(i) + ". " + item.text.strip())
    i+=1

Then we can get the result:

site: Alexa
1. Google.com
2. Youtube.com
3. Tmall.com
...
site: SimilarWeb
1. google.com
2. youtube.com
3. facebook.com
...

Wow, the result can be automate to get and it looks great. Wanna try? Check this demo:

https://repl.it/@timhuangt/GlobalTopSite

And enjoy it! Happy coding!!

Discussion (2)

Collapse
timhuang profile image
Timothy Huang Author

Since the similarweb response the request of these code without the ranking list, the list of similarweb will not appear. I've check the resp.text:

<html>
   <head>
      <title>Pardon Our Interruption</title>
....
              <div class="Title">Pardon Our Interruption...</div>
               <div class="Paragraph">As you were browsing similarweb.com something about your browser made us think you were a bot. There are a few reasons this might happen:</div>
               <ul class="ListContainer">
                  <li class="ListItem">You're a power user moving through this website with super-human speed.</li>
                  <li class="ListItem">You've disabled JavaScript in your web browser.</li>
               </ul>

And I have no idea to solve this interruption. Any idea?

Collapse
timhuang profile image
Timothy Huang Author • Edited on

Someone provide a method that add a header with
"authority": "similarweb.com"
and help to get result from similarweb.com. (already update the code repl.it/@timhuangt/GlobalTopSite )