DEV Community

Cover image for Web Scraping’s Evolution: From Vintage Vibes to Futuristic Feels
Prayson Wilfred Daniel
Prayson Wilfred Daniel

Posted on • Updated on

Web Scraping’s Evolution: From Vintage Vibes to Futuristic Feels

cats
Hold onto your keyboards and brew that extra cup of coffee, because we're about to dive headfirst into the rollercoaster ride that is the evolution of web scraping in Python. From the primitive age of sockets to the futuristic promise of httpx, it's a tale of code, grit, and bytes!

The Stone Age: Direct Sockets

Imagine a time when there were no libraries, just you, raw bytes, and a connection to the server.

import socket

URI = "data.pr4e.org"
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
    s.connect((URI, 80))
    cmd = f"GET /romeo.txt HTTP/1.0\r\nHost: {URI}\r\n\r\n".encode()
    s.send(cmd)

    data = b''
    while True:
        response = s.recv(512)
        if len(response) < 1:
            break

        data += response

print(data.decode())
Enter fullscreen mode Exit fullscreen mode

With sockets, you'd ride the raw frontier, establishing a connection straight from the saloon, sending messages to the server in Morse code (or what felt like it). It was raw, it was real, but boy, was it cumbersome.

The Wild West: urllib

Fast forward, and we reach the Wild West of Python web scraping: the urllib era. Now, artists and coders alike didn't need to wrangle with sockets directly. A new dawn, a new way to fetch, and the world of web data was at our fingertips!

from urllib import request

URL = 'http://data.pr4e.org/romeo.txt'
response = request.urlopen(URL)
data = b''
for line in response:
    data += line

print(response.headers.as_string())
print(data.decode())
Enter fullscreen mode Exit fullscreen mode

No more Morse code – just the pure, distilled essence of web content, ready for digital Wild West saloon consumption. But wait, the saga continues...

Before we spotlight the upcoming third-party divas, let's craft our backstage hero, the as_string function. It'll elegantly choreograph those headers to ensure each starlet delivers a consistent, classic encore.

def as_string(headers:dict) -> str:
    """
    Pretty Prints Dictionary  
    """
    text = (' \n'.join(f'{key}: {value}' 
                 for key, value 
                 in headers.items()))

    return f"{text} \n\n"

Enter fullscreen mode Exit fullscreen mode

The Rockstar: requests

Enter the rockstar of web scraping: requests. With its leather jacket and cool demeanour, requests was like the guitarist who just knew how to shred:

import requests

URL = 'http://data.pr4e.org/romeo.txt'
response = requests.get(URL)

print(as_string(response.headers))
print(response.text)
Enter fullscreen mode Exit fullscreen mode

No more manual labor. Everything from session management to cookie handling was as effortless as playing an air guitar. Life was good, but the stage was set for the next sensation.

The Futurist: httpx

In the neon glow of the future, where the world buzzes with asynchronous operations, httpx made its entrance, floating on the cloud of efficiency:

import httpx

# response = httpx.get(URL) # synchronous 

# asynchronous magic
async with httpx.AsyncClient() as client:
    response = await client.get(URL)

print(as_string(response.headers))
print(response.text)
Enter fullscreen mode Exit fullscreen mode

With both synchronous rhythms and asynchronous beats, httpx danced its way into the heart of developers, promising a brighter, faster web scraping future.

Epilogue of Made-up Story

So, there you have it: a tale of trials, triumphs, and transformations. From raw connections to refined libraries, web scraping in Python has been a journey worth every byte. As we look to the future, one thing's for sure: the adventure is far from over.

Output: stdout.flush()

The print outputs of all code above are like:


HTTP/1.1 200 OK
Date: Fri, 20 Oct 2023 08:20:33 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
Enter fullscreen mode Exit fullscreen mode

Until then, keep on scraping and may your scrapes always return 200 OK!

Top comments (0)