Prayson Wilfred Daniel

Posted on Oct 19, 2023 • Edited on Mar 30

Web Scraping’s Evolution: From Vintage Vibes to Futuristic Feels

#tutorial #beginners #python #webdev

Hold onto your keyboards and brew that extra cup of coffee, because we're about to dive headfirst into the rollercoaster ride that is the evolution of web scraping in Python. From the primitive age of sockets to the futuristic promise of httpx, it's a tale of code, grit, and snake bytes!

The Stone Age: Direct Sockets

Imagine a time when there were no libraries, just you, raw bytes, and a connection to the server.

import socket

URI = "data.pr4e.org"
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
    s.connect((URI, 80))
    cmd = f"GET /romeo.txt HTTP/1.0\r\nHost: {URI}\r\n\r\n".encode()
    s.send(cmd)

    data = b''
    while True:
        response = s.recv(512)
        if len(response) < 1:
            break

        data += response

print(data.decode())

With sockets, you'd ride the raw frontier, establishing a connection straight from the saloon, sending messages to the server in Morse code (or what felt like it). It was raw, it was real, but boy, was it cumbersome.

The Wild West: `urllib`

Fast forward, and we reach the Wild West of Python web scraping: the urllib era. Now, artists and coders alike didn't need to wrangle with sockets directly. A new dawn, a new way to fetch, and the world of web data was at our fingertips!

from urllib import request

URL = 'http://data.pr4e.org/romeo.txt'
response = request.urlopen(URL)
data = b''
for line in response:
    data += line

print(response.headers.as_string())
print(data.decode())

No more Morse code – just the pure, distilled essence of web content, ready for digital Wild West saloon consumption. But wait, the saga continues...

Before we spotlight the upcoming third-party divas, let's craft our backstage hero, the as_string function. It'll elegantly choreograph those headers to ensure each starlet delivers a consistent, classic encore.

def as_string(headers:dict) -> str:
    """
    Pretty Prints Dictionary  
    """
    text = (' \n'.join(f'{key}: {value}' 
                 for key, value 
                 in headers.items()))

    return f"{text} \n\n"

The Rockstar: `requests`

Enter the rockstar of web scraping: requests. With its leather jacket and cool demeanour, requests was like the guitarist who just knew how to shred:

import requests

URL = 'http://data.pr4e.org/romeo.txt'
response = requests.get(URL)

print(as_string(response.headers))
print(response.text)

No more manual labor. Everything from session management to cookie handling was as effortless as playing an air guitar. Life was good, but the stage was set for the next sensation.

The Futurist: `httpx`

In the neon glow of the future, where the world buzzes with asynchronous operations, httpx made its entrance, floating on the cloud of efficiency:

import httpx

# response = httpx.get(URL) # synchronous 

# asynchronous magic
async with httpx.AsyncClient() as client:
    response = await client.get(URL)

print(as_string(response.headers))
print(response.text)

With both synchronous rhythms and asynchronous beats, httpx danced its way into the heart of developers, promising a brighter, faster web scraping future.

Epilogue of Made-up Story

So, there you have it: a tale of trials, triumphs, and transformations. From raw connections to refined libraries, web scraping in Python has been a journey worth every byte. As we look to the future, one thing's for sure: the adventure is far from over.

Output: `stdout.flush()`

The print outputs of all code above are like:


HTTP/1.1 200 OK
Date: Fri, 20 Oct 2023 08:20:33 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

Until then, keep on scraping and may your scrapes always return 200 OK!

DEV Community

Web Scraping’s Evolution: From Vintage Vibes to Futuristic Feels

The Stone Age: Direct Sockets

The Wild West: `urllib`

The Rockstar: `requests`

The Futurist: `httpx`

Epilogue of Made-up Story

Output: `stdout.flush()`

Top comments (0)

Read next

Backtest Like a Pro with a Forex API

Learn how to create a Login/Register Form with Tailwind CSS and Alpine JS

How to Create a Dynamic Popover with CSS and JavaScript 🚀

From Web2 to Web3: A Developer’s Guide to the Shift

The Stone Age: Direct Sockets

The Wild West: urllib

The Rockstar: requests

The Futurist: httpx

Epilogue of Made-up Story

Output: stdout.flush()

Read next

Backtest Like a Pro with a Forex API

Learn how to create a Login/Register Form with Tailwind CSS and Alpine JS

How to Create a Dynamic Popover with CSS and JavaScript 🚀

From Web2 to Web3: A Developer’s Guide to the Shift

The Wild West: `urllib`

The Rockstar: `requests`

The Futurist: `httpx`

Output: `stdout.flush()`