Hold onto your keyboards and brew that extra cup of coffee, because we're about to dive headfirst into the rollercoaster ride that is the evolution of web scraping in Python. From the primitive age of sockets
to the futuristic promise of httpx
, it's a tale of code, grit, and snake bytes!
The Stone Age: Direct Sockets
Imagine a time when there were no libraries, just you, raw bytes, and a connection to the server.
import socket
URI = "data.pr4e.org"
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.connect((URI, 80))
cmd = f"GET /romeo.txt HTTP/1.0\r\nHost: {URI}\r\n\r\n".encode()
s.send(cmd)
data = b''
while True:
response = s.recv(512)
if len(response) < 1:
break
data += response
print(data.decode())
With sockets
, you'd ride the raw frontier, establishing a connection straight from the saloon, sending messages to the server in Morse code (or what felt like it). It was raw, it was real, but boy, was it cumbersome.
The Wild West: urllib
Fast forward, and we reach the Wild West of Python web scraping: the urllib
era. Now, artists and coders alike didn't need to wrangle with sockets directly. A new dawn, a new way to fetch, and the world of web data was at our fingertips!
from urllib import request
URL = 'http://data.pr4e.org/romeo.txt'
response = request.urlopen(URL)
data = b''
for line in response:
data += line
print(response.headers.as_string())
print(data.decode())
No more Morse code – just the pure, distilled essence of web content, ready for digital Wild West saloon consumption. But wait, the saga continues...
Before we spotlight the upcoming third-party divas, let's craft our backstage hero, the as_string
function. It'll elegantly choreograph those headers
to ensure each starlet delivers a consistent, classic encore.
def as_string(headers:dict) -> str:
"""
Pretty Prints Dictionary
"""
text = (' \n'.join(f'{key}: {value}'
for key, value
in headers.items()))
return f"{text} \n\n"
The Rockstar: requests
Enter the rockstar of web scraping: requests
. With its leather jacket and cool demeanour, requests
was like the guitarist who just knew how to shred:
import requests
URL = 'http://data.pr4e.org/romeo.txt'
response = requests.get(URL)
print(as_string(response.headers))
print(response.text)
No more manual labor. Everything from session management to cookie handling was as effortless as playing an air guitar. Life was good, but the stage was set for the next sensation.
The Futurist: httpx
In the neon glow of the future, where the world buzzes with asynchronous operations, httpx
made its entrance, floating on the cloud of efficiency:
import httpx
# response = httpx.get(URL) # synchronous
# asynchronous magic
async with httpx.AsyncClient() as client:
response = await client.get(URL)
print(as_string(response.headers))
print(response.text)
With both synchronous rhythms and asynchronous beats, httpx
danced its way into the heart of developers, promising a brighter, faster web scraping future.
Epilogue of Made-up Story
So, there you have it: a tale of trials, triumphs, and transformations. From raw connections to refined libraries, web scraping in Python has been a journey worth every byte. As we look to the future, one thing's for sure: the adventure is far from over.
Output: stdout.flush()
The print outputs of all code above are like:
HTTP/1.1 200 OK
Date: Fri, 20 Oct 2023 08:20:33 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
Until then, keep on scraping and may your scrapes always return 200 OK
!
Top comments (0)