DEV Community

Cover image for Collecting one million website links
Anurag Rana
Anurag Rana

Posted on • Originally published at pythoncircle.com

Collecting one million website links

I needed a collection of different website links to experiment with Docker cluster. So I created this small script to collect one million website URLs.

Code is available on Github too.

Running script:

Either create a new virtual environment using python3 or use the existing one in your system.
Install the dependencies.

pip install requests, BeautifulSoup
Enter fullscreen mode Exit fullscreen mode

Activate the virtual environment and run the code.

python one_million_websites.py
Enter fullscreen mode Exit fullscreen mode

Complete Code:

import requests
from bs4 import BeautifulSoup
import sys
import time


headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/64.0.3282.167 Chrome/64.0.3282.167 Safari/537.36"
}

site_link_count = 0

for i in range(1, 201):
    url = "http://websitelists.in/website-list-" + str(i) + ".html"
    response = requests.get(url, headers = headers)
    if response.status_code != 200:
        print(url + str(response.status_code))
        continue

    soup = BeautifulSoup(response.text, 'lxml')
    sites = soup.find_all("td",{"class": "web_width"})

    links = ""
    for site in sites:
        site = site.find("a")["href"]
        links += site + "\n"
        site_link_count += 1

    with open("one_million_websites.txt", "a") as f:
        f.write(links)

    print(str(site_link_count) + " links found")

    time.sleep(1)
Enter fullscreen mode Exit fullscreen mode

We are scraping links from site http://www.websitelists.in/. If you inspect the webpage, you can see anchor tag inside td tag with class web_width.

We will convert the page response into BeautifulSoup object and get all such elements and extract the HREF value of them.

Although there is a natural delay of more than 1 second between consecutive requests which is pretty slow but is good for the server. I still introduced a one-second delay to avoid 429 HTTP status.

Scraped links will be dumped in the text file in the same directory.

Originally Published on pythoncircle.com

More from PythonCircle:

Top comments (3)

Collapse
 
kdinnypaul profile image
Dinny Paul

You could use fake_useragent python library to change user agent with every request so that you don't get blocked by that website and you could also use free proxies thereby changing you ip address with every request :)

Collapse
 
anuragrana profile image
Anurag Rana

Great suggestions Dinny. However I feel we should be gentle on sites and should not send too many requests per second. That is why I didn't feel the need of using these two libraries.

I have written another article where I have used docker cluster to scrape data at a very high speed. Although I was not able to achieve desired results.

pythoncircle.com/post/518/scraping...

Collapse
 
quangthien27 profile image
Thien Nguyen

Good writting, thanks for that!