Chris White

Posted on Jul 30, 2023

Python Networking: HTTP

#webdev #python #http

So far we've seen what takes place behind servers and networking. Now the modern web certainly is more than just a network of echo servers. Many websites are powered by something known as HTTP (HyperText Transfer Protocol). This article will discuss some of the inner workings of HTTP using various python code and modules. For those looking for more resources I highly recommend the Mozilla Developer Network documentation on anything web related.

Security Notes
HTTP Versions
A URL
A Basic Request
Response
A Better Server
Headers
Cookies
Request Types
- GET
- HEAD
- POST
Status Codes
- 2xx
- 3xx
- 4xx
- 5xx
Conclusion

Security Notes

The code presented here is for learning purposes. Given the complexity of modern day web services I highly discourage trying to roll your own web server outside of learning purposes in an isolated network. You should instead evaluate a secure and well maintained web server that meets your needs. Traffic here is also unencrypted, meaning anyone could snoop in on the data. So to summarize:

Don't use this code in production
Always make sure your network communication is encrypted and that the encryption method is not outdated / insecure

HTTP Versions

The HTTP protocol has seen a number of revisions over the year. Version 1.0 was released in 1996 as RFC 1945. This is followed in 1999 by HTTP/1.1 which added a number of features that are widely used in the modern web.

Currently HTTP/2 is considered the modern standard. Many features here helped to work out performance issues with the way modern web applications worked. HTTP/3 is the current in-progress standard which is based on a UDP encapsulation protocol. In particular it looks to reduce connection duplication in negotiating secure connections.

Taking support into consideration, this article will cover standards set by HTTP/1.1.

A URL

URL stands for Uniform Resource Locator and is a subset of URI or Uniform Resource Identifier. The specifics of URL are defined in RFC 1738. Despite seeming like it URLs are not necessarily for reaching out to an HTTP server, though it's certainly one of the more popular use cases. The scheme section allows it to work with other services as well, such as FTP and Gopher. URL supported schemes at the time can be found in the RFC. The IANA keeps a more up to date and extensive list. Python offers the urllib module which can be used to work with URLs:

from urllib.parse import urlparse

URL = 'https://datatracker.ietf.org/doc/html/rfc1738#section-3'
print(urlparse(URL))

This gives the output:

ParseResult(scheme='https', netloc='datatracker.ietf.org', path='/doc/html/rfc1738', params='', query='', fragment='section-3')

With a more complex example:

from urllib.parse import urlparse

URL = 'https://user:password@domain.com:7777/
parsed_url = urlparse(URL)
print(parsed_url.hostname)
print(parsed_url.username)
print(parsed_url.password)
print(parsed_url.port)

# Output:
# domain.com
# user
# password
# 7777

There are some cases where a URL path may have non-alphabetic characters such as a space character. To deal with such cases the values can be URL encoded. This is done by taking the ASCII hex value of the character (including the extended ASCII table) and adding a % in front of it. urllib.parse.quote is able to handle such encoding:

from urllib.parse import quote

print(quote('/path with spaces'))

# Output:
# /path%20with%20spaces

A Basic Request

To look at the request process I'll use a basic socket server on port 80 and the requests library which can be installed with pip install requests:

import socket, os, pwd, grp
import socketserver

# https://stackoverflow.com/a/2699996
def drop_privileges(uid_name='nobody', gid_name='nogroup'):
    if os.getuid() != 0:
        # We're not root so, like, whatever dude
        return

    # Get the uid/gid from the name
    running_uid = pwd.getpwnam(uid_name).pw_uid
    running_gid = grp.getgrnam(gid_name).gr_gid

    # Remove group privileges
    os.setgroups([])

    # Try setting the new uid/gid
    os.setgid(running_gid)
    os.setuid(running_uid)

    # owner/group r+w+x
    old_umask = os.umask(0x007)

class MyTCPHandler(socketserver.StreamRequestHandler):
    """
    The request handler class for our server.

    It is instantiated once per connection to the server, and must
    override the handle() method to implement communication to the
    client.
    """

    def handle(self):
        self.data = self.rfile.readline()
        print(self.data)

if __name__ == "__main__":
    HOST, PORT = "localhost", 80

    # Create the server, binding to localhost on port 80
    with socketserver.TCPServer((HOST, PORT), MyTCPHandler) as server:
        # Activate the server; this will keep running until you
        # interrupt the program with Ctrl-C
        print(f'Server bound to port {PORT}')
        drop_privileges()
        server.serve_forever()

and the client:

import requests

requests.get('http://localhost/')

Now currently this is an incomplete request on purpose so I can show things line by line. This means simple_client.py will error out about connection reset by peer as it's expecting an HTTP response. On the server side we see:

Server bound to port 80
b'GET / HTTP/1.1\r\n'

The first line is indicated by the HTTP RFC as the Request Line. The first is the method, followed by the request to the host, the HTTP version to use, and finally a CRLF (Carriage Return '\r' Line Feed '\n'). So the GET method is being used on the path / and requesting HTTP/1.1 for the version.

Now you'll notice here in the request we didn't have to declare that we were using port 80. This is because it's defined by IANA as a service port so implementations know to use port 80 by default (or 443 for https).

Response

Next I'll read in the rest of the lines for the HTTP request:

    def handle(self):
        self.data = self.rfile.readlines()
        print(self.data)

Server bound to port 80
[b'GET / HTTP/1.1\r\n', b'Host: localhost\r\n', b'User-Agent: python-requests/2.22.0\r\n', b'Accept-Encoding: gzip, deflate\r\n', b'Accept: */*\r\n', b'Connection: keep-alive\r\n', b'\r\n']

Now to clean up the output a bit:

GET / HTTP/1.1

Host: localhost
User-Agent: python-requests/2.22.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive

After the request line we see a number of key value pairs separated by a :. This is something we'll go to shortly, but for now I'll pass back data to finish the connection. While I'm here I'll also re-organize the handler class to make it easier to follow:

class MyTCPHandler(socketserver.BaseRequestHandler):
    """
    The request handler class for our server.

    It is instantiated once per connection to the server, and must
    override the handle() method to implement communication to the
    client.
    """

    def read_http_request(self):
        print("reading request")
        self.data = self.request.recv(8192)
        print(self.data)

    def write_http_response(self):
        print("writing response")
        response_lines = [
            b'HTTP/1.1 200\r\n',
            b'Content-Type: text/plain\r\n',
            b'Content-Length: 12\r\n',
            b'Location: http://localhost/\r\n',
            b'\r\n',
            b'Hello World\n'
        ]
        for response_line in response_lines:
            self.request.send(response_line)
        print("response sent")

    def handle(self):
        self.read_http_request()
        self.write_http_response()
        self.request.close()
        print("connection closed")

and the client is also slightly modified:

import requests

r = requests.get('http://localhost/')
print(r.headers)
print(r.text)

So the handler class has been changed back to socketserver.BaseRequestHandler as I don't need single line reads anymore. I'm also writing back a static response for right now. Finally the handle() method gives a nice overview of the different steps. Now as an example:

Server:

Server bound to port 80
reading request
b'GET / HTTP/1.1\r\nHost: localhost\r\nUser-Agent: python-requests/2.22.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
writing response
response sent
connection closed

Client:

{'Content-Type': 'text/plain', 'Content-Length': '12', 'Location': 'http://localhost/'}
Hello World

As with the request, response also have their own response line which in this case is:

HTTP/1.1 200\r\n

First is the HTTP version as a confirmation that the server can communicate in that version. The next is a status code to indicate the nature of the response. I'll touch on some of the status codes later on in the article. In this case 200 is confirmation that the request is valid and everything went okay. With an initial response working it's time to look at things in a bit more depth.

A Better Server

Now that we've seen the raw elements of an HTTP request, it's time to abstract out a bit. Python actually has an http server module that can be used to extend the socketserver module. It has various components to facilitate serving HTTP traffic. So now our server looks something like this:

import os, pwd, grp
from http.server import ThreadingHTTPServer, BaseHTTPRequestHandler

# https://stackoverflow.com/a/2699996
def drop_privileges(uid_name='nobody', gid_name='nogroup'):
    if os.getuid() != 0:
        return
    running_uid = pwd.getpwnam(uid_name).pw_uid
    running_gid = grp.getgrnam(gid_name).gr_gid
    os.setgroups([])
    os.setgid(running_gid)
    os.setuid(running_uid)
    old_umask = os.umask(0x007)

class MyHTTPHandler(BaseHTTPRequestHandler):

    def read_http_request(self):
        self.log_message(f"Reading request from {self.client_address}")
        print(dict(self.headers.items()))

    def write_http_response(self):
        self.log_message(f"Writing response to {self.client_address}")
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b'Hello World\n')

    def do_GET(self):
        self.read_http_request()
        self.write_http_response()
        self.request.close()

if __name__ == "__main__":
    HOST, PORT = "localhost", 80

    with ThreadingHTTPServer((HOST, PORT), MyHTTPHandler) as server:
        print(f'Server bound to port {PORT}')
        drop_privileges()
        server.serve_forever()

Now the server is doing some of the heavy lifting for is. It has information about the request line and sends sensible default headers for the response. Speaking of which let's look at headers

Headers

So after doing another run of the new server:

{'Host': 'localhost', 'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

We see headers sent in the request to us. Now the thing with headers is that with the exception of Host (required by the HTTP/1.1 standard) the others are not something the server has to really concern itself with. This is because a single IP can host multiple domains (in fact this becomes more common with CDNs). So if I add an /etc/hosts entry like so:

127.0.0.1 webserver

Then I could make the following change:

    def write_http_response(self):
        self.log_message(f"Writing response to {self.client_address}")
        self.send_response(200)
        self.end_headers()
        self.wfile.write(bytes(f'Hello {self.headers["Host"]}\n', 'utf-8'))

And as an example:

{'Server': 'BaseHTTP/0.6 Python/3.10.6', 'Date': 'Tue, 25 Jul 2023 02:57:29 GMT'}
Hello localhost

{'Server': 'BaseHTTP/0.6 Python/3.10.6', 'Date': 'Tue, 25 Jul 2023 02:57:29 GMT'}
Hello webserver

Despite connecting via the same IP address I'm getting different content back. Now as for the rest of the headers it's a pretty long list. With this in mind I'll be going over some of the fundamental ones (some other more specific ones will be gone over in later sections).

Type Headers

These headers have to do with the type of content that's being delivered. Request versions will ask for certain types of content, and response versions will give metadata about the content. Accept is one of the more important ones and is related to the type of content indicated by MIME (Multipurpose Internet Mail Extensions). This is a way to indicate the type of a file and was originally created as a way to give information about non-textual content in the normally textual format of email. This helps in differentiating between parsing HTML and showing an image. Not surprisingly IANA manages the official list of MIME types. Python has a mimetypes module which maps file extensions to the system's MIME type database:

import mimetypes

mimetypes.init()
print(mimetypes.guess_all_extensions('text/plain'))
print(mimetypes.types_map['.html'])

# Output
# ['.txt', '.bat', '.c', '.h', '.ksh', '.pl', '.csh', '.jsx', '.sln', '.slnf', '.srf', '.ts']
# text/html

Now this is of course assuming that a file with a certain extension is actually that type of file. Realistically though a malicious actor could simply rename their malware to .jpg or similar so it's not a very good source of truth if you can't completely trust the intentions of your users. Instead we can use python-magic. So after doing a pip install python-magic (Windows users will need to install python-magic-bin instead which includes required DLLs):

import magic

f = magic.Magic(mime=True)
print(f.from_file('/bin/bash'))

# Output application/x-sharedlib

Some content types you'll likely deal with:

text/plain: Plain text
text/html: HTML
application/json: JSON

Mozilla also has a more extensive list. Now looking the request headers we see:

Accept: */*
Accept-Encoding: gzip, deflate

For basic transactions Accept: */* is fairly standard if the client isn't sure what the server will respond with. This essentially says it doesn't have a preference on the mime type a server will return. A more complex example would be:

Accept: text/html, application/xhtml+xml, application/xml;q=0.9, image/webp, */*;q=0.8

This will accept most HTMl formats. There's also a ;q=[number] here which indicates the preference of the mime type. If there's no preference indicated then everything is weighted the same and the most specific type will be selected. The server version of this is Content-Type which indicates the type of content the server will send. Now if you decide to mangle the type to something which it's not:

    def write_http_response(self):
        self.log_message(f"Writing response to {self.client_address}")
        self.send_response(200)
        self.send_header('Content-Type', 'image/jpeg')
        self.end_headers()
        self.wfile.write(bytes(f'Hello {self.headers["Host"]}\n', 'utf-8'))

The requests based client (or really things like curl and wget which download only) won't care as it doesn't render the image. Actual browsers on the other hand will throw an error or show a placeholder broken image.

Accept-Encoding indicates that the client supports compressed data being returned. Compression is recommended when possible through the server specs to reduce the amount of data transferred. As it's not uncommon to use this metric for pricing it can also help reduce cost. A server can send Content-Encoding back to indicate it's sending compressed data:

    def write_http_response(self):
        self.log_message(f"Writing response to {self.client_address}")
        self.send_response(200)
        self.send_header('Content-Type', 'text/plain')
        self.send_header('Content-Encoding', 'gzip')
        self.end_headers()
        return_data = gzip.compress(bytes(f'Hello {self.headers["Host"]}\n', encoding='utf-8'))
        self.wfile.write(return_data)

Requests is able to handle compressed data out of the box so no changes needed there, and a run shows that the compression works:

{'Server': 'BaseHTTP/0.6 Python/3.10.6', 'Date': 'Wed, 26 Jul 2023 23:27:53 GMT', 'Content-Type': 'text/plain', 'Content-Encoding': 'gzip'}
Hello localhost

Enhancing The Server Even More

Now to look at some of the other header options I'll update the HTTP server:

import datetime
import grp
import gzip
import hashlib
import os
import pwd
from email.utils import parsedate_to_datetime
from http.server import ThreadingHTTPServer, BaseHTTPRequestHandler

# https://stackoverflow.com/a/2699996
def drop_privileges(uid_name='nobody', gid_name='nogroup'):
    if os.getuid() != 0:
        return
    running_uid = pwd.getpwnam(uid_name).pw_uid
    running_gid = grp.getgrnam(gid_name).gr_gid
    os.setgroups([])
    os.setgid(running_gid)
    os.setuid(running_uid)
    old_umask = os.umask(0x007)

class MyHTTPHandler(BaseHTTPRequestHandler):

    ROUTES = {
        '/': 'serve_front_page',
        '/index.html': 'serve_html', 
        '/python-logo/': 'serve_python_logo',
        '/js/myjs.js': 'serve_js',
        '/favicon.ico': 'serve_favicon'
    }
    HTTP_DT_FORMAT = '%a, %d %b %Y %H:%M:%S GMT'

    def read_http_request(self):
        self.log_message(f"Reading request from {self.client_address}")
        print(dict(self.headers.items()))

    def serve_front_page(self):
        self.log_message(f"Writing response to {self.client_address}")
        self.send_response(307)
        self.send_header('Location', '/index.html')
        return b''

    def serve_python_logo(self):
        return self.serve_file_with_caching('python-logo-only.png', 'image/png')

    def serve_favicon(self):
        return self.serve_file_with_caching('favicon.ico', 'image/x-icon')

    def serve_html(self):
        self.send_response(200)
        self.send_header('Content-Type', 'text/html')
        return b'<html><head><title>Old Website</title><script type="text/javascript" src="/js/myjs.js"></script></head><body><img src="/python-logo/" /></body></html>'

    def serve_js(self):
        js_code = b'const a = Math.random();'
        etag = hashlib.md5(js_code).hexdigest()
        if 'If-None-Match' in self.headers and self.headers['If-None-Match'] == etag:
            self.send_response(304)
            return b''
        else:
            self.send_response(200)
            self.send_header('Etag', etag)
            self.send_header('Content-Type', 'text/javascript')
            self.send_header('Cache-Control', 'public, max-age=10')
            return js_code

    def write_data(self, bytes_data):
        self.send_header('Content-Encoding', 'gzip')
        return_data = gzip.compress(bytes_data)
        self.send_header('Content-Length', len(return_data))
        self.end_headers()
        self.wfile.write(return_data)

    def check_cache(self, filename):
        if 'If-Modified-Since' in self.headers:
            cache_date = parsedate_to_datetime(self.headers['If-Modified-Since'])
            filename_date = datetime.datetime.fromtimestamp(os.path.getmtime(filename), tz=datetime.timezone.utc).replace(microsecond=0)
            return filename_date <= cache_date
        return False

    def serve_file_with_caching(self, filename, file_type):
        self.log_message(f"Writing response to {self.client_address}")
        if self.check_cache(filename):
            self.send_response(304)
            return b''
        else:
            self.send_response(200)
            self.send_header('Content-Type', file_type)
            file_date = datetime.datetime.fromtimestamp(os.path.getmtime(filename), tz=datetime.timezone.utc).replace(microsecond=0)
            self.send_header('Last-Modified', file_date.strftime(self.HTTP_DT_FORMAT))
            self.send_header('Cache-Control', 'public, max-age=10')
            self.send_header('Expires', (datetime.datetime.now(datetime.timezone.utc) + datetime.timedelta(0, 10)).strftime(self.HTTP_DT_FORMAT) )
            with open(filename, 'rb') as file_fp:
                file_data = file_fp.read()
            return file_data

    def do_GET(self):
        self.read_http_request()
        bytes_data = self.__getattribute__(self.ROUTES[self.path])()
        self.write_data(bytes_data)
        self.request.close()

if __name__ == "__main__":
    HOST, PORT = "localhost", 80

    with ThreadingHTTPServer((HOST, PORT), MyHTTPHandler) as server:
        print(f'Server bound to port {PORT}')
        drop_privileges()
        server.serve_forever()

This will require two files in the same directory as the server:

https://www.python.org/favicon.ico
https://www.python.org/community/logos/ ( PNG format (269 × 326) of only "two snakes" )

ROUTES allows the handler to act as a simple router mapping paths to methods in the handler class. The data write method is also abstracted out to gzip compress the data every time. Another method deals with caching logic. I'll go over the parts in respect to header logic.

Redirection

The "Location" header can be utilized with a few status codes to indicate the location of a file that needs to be redirected to. Looking here:

    def serve_front_page(self):
        self.log_message(f"Writing response to {self.client_address}")
        self.send_response(307)
        self.send_header('Location', '/index.html')
        return b''

This will redirect the user to an /index.html page. Note that some CLI based HTTP clients will require an additional option to actually handle the redirection. Web browsers on the other hand will handle this seamlessly.

Caching

The first useful caching header is Cache-Control. The main usage is to indicate how long a file can stay in a client's local cache. This is then supplemented with Last-Modified and/or Etag. So here:

            self.send_header('Cache-Control', 'public, max-age=10')

We're telling the client it can be cached locally without re-verification for 10 seconds. For Last-Modified I set it to the modification time of the file in UTC:

            file_date = datetime.datetime.fromtimestamp(os.path.getmtime(filename), tz=datetime.timezone.utc).replace(microsecond=0)
            self.send_header('Last-Modified', file_date.strftime(self.HTTP_DT_FORMAT))

The replace microseconds part is due to the granularity of microseconds causing comparison issues. getmtime will get the modification time of the file since the epoch. tz= being set to UTC makes it timezone aware as a UTC date/time. Now for a standard web browser when the client has had the file locally cached for more than 10 seconds it will query the server with If-Modified-Since:

    def check_cache(self, filename):
        if 'If-Modified-Since' in self.headers:
            cache_date = parsedate_to_datetime(self.headers['If-Modified-Since'])
            filename_date = datetime.datetime.fromtimestamp(os.path.getmtime(filename), tz=datetime.timezone.utc).replace(microsecond=0)
            return filename_date <= cache_date
        return False

Now the server will check the value provided and compare it against the modified time of the file. If it's greater than If-Modified-Since then the server will return the file as usual with a new Last-Modified value:

 if self.check_cache(filename):
            self.send_response(304)
            return b''
        else:
            self.send_response(200)

Otherwise the server sends a 304 to indicate the file hasn't changed. The Cache-Control's max-age timer will be reset to 0 and the cycle continues. Now the problem is situations where content is dynamically generated. In this case Etag can be used. This value does not have an exact method to generate. As MDN web docs states: "typically, the ETag value is a hash of the content, a hash of the last modification timestamp, or just a revision number":

        js_code = b'const a = Math.random();'
        etag = hashlib.md5(js_code).hexdigest()

In this case I use the md5 hash. This is sent to the client when they request it. The client will then attach this etag value to the cache entry. When max-age is up, instead of sending If-Modified-Since the header If-None-Match is utilized:

        if 'If-None-Match' in self.headers and self.headers['If-None-Match'] == etag:
            self.send_response(304)
            return b''
        else:

Note that with Firefox they implemented a feature called Race Cache With Network (RCWN). Firefox will calculate if the network is faster than pulling it from the disk. If the network is faster it will pull the content anyways regardless of cache settings. This will most likely proc if you're doing a localhost -> localhost connection or on a very high speed network. There is currently no server side way to disable this and must be done through the browser side instead.

User Agent

This is a rather interesting header that's supposed to indicate what client is currently requesting the content. For example, Chrome may show:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36

The problem is that it's very easy to spoof because at the end of the day it's just a normal header. As an example:

import requests

r = requests.get('http://localhost/js/myjs.js', headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'})
print(r.headers)
print(r.text)

Looking at the server side request log:

127.0.0.1 - - [27/Jul/2023] Reading request from ('127.0.0.1', 44110)
{'Host': 'localhost', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

So even though I'm actually using requests, I can spoof myself as Chrome using the user-agent string from earlier. If you do a lot of JS development this is why best practice for browser detection is by feature and not user string.

Cookies

Security Note: As this is an HTTP server being used for learning purposes encryption is not setup. In a real world setup cookies should always be sent over HTTPS with strong encryption to help prevent things like session hijacking.

While cookies are technically another header I find that their functionality is enough to warrant a dedicatd sections. Cookies are essentially a method for being able to have state between HTTP calls. In general operation each HTTP call is distinct from the other. Without Cookies to bridge this gap it would be difficult to ensure states such as a user being authenticated to a service. Cookies start with the server sending one or more Set-Cookie headers. So I'll add another route here:

from http.cookies import SimpleCookie
# <snip>
    ROUTES = {
        '/': 'serve_front_page',
        '/index.html': 'serve_html', 
        '/python-logo/': 'serve_python_logo',
        '/js/myjs.js': 'serve_js',
        '/favicon.ico': 'serve_favicon',
        '/cookie-test/': 'serve_cookies'
    }
# <snip>
    def serve_cookies(self):
        self.send_response(200)
        cookies_list = SimpleCookie()
        cookies_list['var1'] = 'test'
        cookies_list['var2'] = 'test2'
        cookies_list['var2']['path'] = '/'
        for morsel in cookies_list.values():
            self.send_header("Set-Cookie", morsel.OutputString())
        return self.serve_html()

Which will use the SimpleCookie class to setup our headers. Requests puts these cookies into their own dedicated property:

import gzip
import requests

r = requests.get('http://localhost/cookie-test/')
print(r.headers)
print(dict(r.cookies))
print(gzip.decompress(r.content))

# Output:
# {'Server': 'BaseHTTP/0.6 Python/3.10.6', 'Date': 'Thu, 27 Jul 2023 23:30:07 GMT', 'Set-Cookie': 'var1=test, var2=test2; Path=/'}
# {'var2': 'test2', 'var1': 'test'}
# b'<html><head><title>Old Website</title><script type="text/javascript" src="/js/myjs.js"></script></head><body><img src="/python-logo/" /></body></html>'

Now adding some more routes and adjusting the cookie logic:

    ROUTES = {
        '/': 'serve_front_page',
        '/index.html': 'serve_html', 
        '/python-logo/': 'serve_python_logo',
        '/js/myjs.js': 'serve_js',
        '/favicon.ico': 'serve_favicon',
        '/cookie-test/': 'serve_cookies',
        '/cookie-test2/': 'serve_cookies'
    }
# <snip>
    def serve_cookies(self):
        self.send_response(200)
        cookies_list = SimpleCookie()
        cookies_list['var1'] = 'test'
        cookies_list['path_specific'] = 'test2'
        cookies_list['path_specific']['path'] = '/cookie-test/'
        cookies_list['shady_cookie'] = 'test3'
        cookies_list['shady_cookie']['domain'] = 'shadysite.com'
        for morsel in cookies_list.values():
            self.send_header("Set-Cookie", morsel.OutputString())
        return self.serve_html()

When visiting once in a browser as long as I'm in /cookie-test/ and its sub paths the path_specific cookie will show up. However, if I browse to /cookie-test2/ it won't as the paths don't match. If we also take a look at the shady_cookie:

Chrome refuses to register the cookie as it's not the same domain as the host. This is generally known as a third party cookie. While there are some ways to deal with proper usage in generally expect that third party cookies will be denied by most browsers. This is mostly done as third party cookies often contain advertising/tracking related content. Now when the cookies have been sent the browser we'll use logic to figure out what cookies are valid for a request and send them back in a Cookie header. This can then be used by the server to keep some kind of state:

    def parse_cookies(self):
        if 'Cookie' in self.headers:
            raw_cookies = self.headers['Cookie']
            self.cookies = SimpleCookie()
            self.cookies.load(raw_cookies)
        else:
            self.cookies = None

    def get_cookie(self, key, default=None):
        if not self.cookies:
            return default
        elif key not in self.cookies:
            return default
        else:
            return self.cookies[key].value

    def serve_html(self):
        self.send_response(200)
        self.send_header('Content-Type', 'text/html')
        title_cookie = self.get_cookie('path_specific', 'Old Website')
        return bytes(f'<html><head><title>{title_cookie}</title><script type="text/javascript" src="/js/myjs.js"></script></head><body><img src="/python-logo/" /></body></html>', encoding='utf-8')

So here we've modified serve_html to use the cookie value as the title. If it doesn't exist then we use the "Old Website" value instead. SimpleCookie as double as a way to parse cookies letting us reuse for parsing cookies.

Security Note: Having cookie values inserted directly into HTML is a terrible idea. This was done for simple illustration purposes only.

Now on the client side:

import gzip
import requests

r = requests.get('http://localhost/cookie-test/')
print(r.headers)
print(dict(r.cookies))
print(gzip.decompress(r.content))

r2 = requests.get('http://localhost/cookie-test/', cookies=r.cookies)
print(r2.headers)
print(dict(r2.cookies))
print(gzip.decompress(r2.content))

Which will output:

{'Server': 'BaseHTTP/0.6 Python/3.10.6', 'Date': 'Fri, 28 Jul 2023 02:46:15 GMT', 'Set-Cookie': 'var1=test, path_specific=test2; Path=/cookie-test/, shady_cookie=test3; Domain=shadysite.com'}
{'var1': 'test', 'path_specific': 'test2'}
b'<html><head><title>Old Website</title><script type="text/javascript" src="/js/myjs.js"></script></head><body><img src="/python-logo/" /></body></html>'
{'Server': 'BaseHTTP/0.6 Python/3.10.6', 'Date': 'Fri, 28 Jul 2023 02:46:15 GMT', 'Set-Cookie': 'var1=test, path_specific=test2; Path=/cookie-test/, shady_cookie=test3; Domain=shadysite.com'}
{'var1': 'test', 'path_specific': 'test2'}
b'<html><head><title>test2</title><script type="text/javascript" src="/js/myjs.js"></script></head><body><img src="/python-logo/" /></body></html>'

I'll also note that even requests removed the third party shady site cookie:

127.0.0.1 - - [27/Jul/2023] Reading request from ('127.0.0.1', 50316)
{'Host': 'localhost', 'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Cookie': 'path_specific=test2; var1=test'}

Cookies can be removed by expiring them in the past, or setting an expiration time in the future and having time pass up until then. Here's an example of a cookie that will be removed:

import time
HTTP_DT_FORMAT = '%a, %d %b %Y %H:%M:%S GMT'
INSTANT_EXPIRE = time.strftime(HTTP_DT_FORMAT, time.gmtime(0))
cookies_list['var1'] = 'test'
cookies_list['var1']['expires'] = self.INSTANT_EXPIRE

This will set the expiration to the start of epoch or Thu, 01 Jan 1970 00:00:00 GMT. One important thing to note is that this is the case when an implementation handles things to spec. A rogue client could simply choose to send the expired cookies regardless.

Request Types

Since we're not working with headers I'll take the server code back to a simplified form again:

import datetime
import grp
import gzip
import hashlib
import os
import pwd
from email.utils import parsedate_to_datetime
from http.server import ThreadingHTTPServer, BaseHTTPRequestHandler

# https://stackoverflow.com/a/2699996
def drop_privileges(uid_name='nobody', gid_name='nogroup'):
    if os.getuid() != 0:
        return
    running_uid = pwd.getpwnam(uid_name).pw_uid
    running_gid = grp.getgrnam(gid_name).gr_gid
    os.setgroups([])
    os.setgid(running_gid)
    os.setuid(running_uid)
    old_umask = os.umask(0x007)

class MyHTTPHandler(BaseHTTPRequestHandler):

    ROUTES = {
        '/': 'serve_front_page',
        '/index.html': 'serve_html', 
        '/python-logo/': 'serve_python_logo',
        '/js/myjs.js': 'serve_js',
        '/favicon.ico': 'serve_favicon',
    }
    HTTP_DT_FORMAT = '%a, %d %b %Y %H:%M:%S GMT'

    def read_http_request(self):
        self.log_message(f"Reading request from {self.client_address}")
        print(dict(self.headers.items()))

    def serve_front_page(self):
        self.log_message(f"Writing response to {self.client_address}")
        self.send_response(307)
        self.send_header('Location', '/index.html')
        return b''

    def serve_python_logo(self):
        return self.serve_file_with_caching('python-logo-only.png', 'image/png')

    def serve_favicon(self):
        return self.serve_file_with_caching('favicon.ico', 'image/x-icon')

    def serve_html(self):
        self.send_response(200)
        self.send_header('Content-Type', 'text/html')
        return bytes(f'<html><head><title>Old Website</title><script type="text/javascript" src="/js/myjs.js"></script></head><body><img src="/python-logo/" /></body></html>', encoding='utf-8')

    def serve_js(self):
        js_code = b'const a = Math.random();'
        etag = hashlib.md5(js_code).hexdigest()
        if 'If-None-Match' in self.headers and self.headers['If-None-Match'] == etag:
            self.send_response(304)
            return b''
        else:
            self.send_response(200)
            self.send_header('Etag', etag)
            self.send_header('Content-Type', 'text/javascript')
            self.send_header('Cache-Control', 'public, max-age=10')
            return js_code

    def write_data(self, bytes_data):
        self.send_header('Content-Encoding', 'gzip')
        return_data = gzip.compress(bytes_data)
        self.send_header('Content-Length', len(return_data))
        self.end_headers()
        self.wfile.write(return_data)

    def check_cache(self, filename):
        if 'If-Modified-Since' in self.headers:
            cache_date = parsedate_to_datetime(self.headers['If-Modified-Since'])
            filename_date = datetime.datetime.fromtimestamp(os.path.getmtime(filename), tz=datetime.timezone.utc).replace(microsecond=0)
            return filename_date <= cache_date
        return False

    def serve_file_with_caching(self, filename, file_type):
        self.log_message(f"Writing response to {self.client_address}")
        if self.check_cache(filename):
            self.send_response(304)
            return b''
        else:
            self.send_response(200)
            self.send_header('Content-Type', file_type)
            file_date = datetime.datetime.fromtimestamp(os.path.getmtime(filename), tz=datetime.timezone.utc).replace(microsecond=0)
            self.send_header('Last-Modified', file_date.strftime(self.HTTP_DT_FORMAT))
            self.send_header('Cache-Control', 'public, max-age=10')
            self.send_header('Expires', (datetime.datetime.now(datetime.timezone.utc) + datetime.timedelta(0, 10)).strftime(self.HTTP_DT_FORMAT) )
            with open(filename, 'rb') as file_fp:
                file_data = file_fp.read()
            return file_data

    def do_GET(self):
        self.read_http_request()
        bytes_data = self.__getattribute__(self.ROUTES[self.path])()
        self.write_data(bytes_data)
        self.request.close()

if __name__ == "__main__":
    HOST, PORT = "localhost", 80

    with ThreadingHTTPServer((HOST, PORT), MyHTTPHandler) as server:
        print(f'Server bound to port {PORT}')
        drop_privileges()
        server.serve_forever()

In the HTTP standard there are various request methods that can be used. I'll be going over three that I would consider the core ones. If you're developing a REST API it's likely you'll utilize more of them.

GET

This is the standard method you will see for a majority of web interaction. It indicates a read only action to obtain some form of content and should not change change the content used by the server. Due to the read only nature of GET the contents of the request body are ignored. In order to pass in any kind of parameters a query string can be used after the path. As the HTTP server pulls in query strings as part of the path, we'll need to parse them before using the routing dictionary:

from urllib.parse import urlparse, parse_qs
# <snip>
    ROUTES = {
        '/': 'serve_front_page',
        '/index.html': 'serve_html', 
        '/python-logo/': 'serve_python_logo',
        '/js/myjs.js': 'serve_js',
        '/favicon.ico': 'serve_favicon',
        '/query_test': 'serve_html'
    }
# <snip>
    def do_GET(self):
        self.read_http_request()

        segments = urlparse(self.path)
        self.query = parse_qs(segments.query)
        self.log_message(self.query)
        bytes_data = self.__getattri

        self.write_data(bytes_data)
        self.request.close()

urlparse allows us to break up the path and query string components. parse_qs will then parse the query string to give us a dictionary value. Note that both of these examples are valid:

# Handled by the code
http://website/query-test?test1=test2&test3=test4
# Valid, but not handled by our code
http://website/query-test/?test1=test2&test3=test4

But I'm only handling the first case on purpose to keep things simple (feature rich web servers can deal with this). We'll update our client to pass in some parameters and see the result:

import requests

r = requests.get('http://localhost/query_test?test1=foo&test2=bar&test3=hello%20world')
print(r.headers)
print(r.content)

Which will give the following output from the server:

127.0.0.1 - - [29/Jul/2023] {'test1': ['foo'], 'test2': ['bar'], 'test3': ['hello world']}

Now the reason why the values are lists is because by using the same key in a query string you can allow for multiple values:

r = requests.get('http://localhost/query_test?test1=foo&test2=bar&test3=hello%20world&test2=baz&test2=nothing')
# 127.0.0.1 - - [29/Jul/2023] {'test1': ['foo'], 'test2': ['bar', 'baz', 'nothing'], 'test3': ['hello world']}

If you only wish to support single values with unique keys, parse_qsl can be used instead:

        segments = urlparse(self.path)
        # Returns key value pair tuple
        self.query = dict(parse_qsl(segments.query))
        self.log_message(f'{self.query}')
        bytes_data = self.__getattribute__(self.ROUTES[segments.path])()

r = requests.get('http://localhost/query_test?test1=foo&test2=bar&test3=hello%20world')
# 127.0.0.1 - - [29/Jul/2023] {'test1': 'foo', 'test2': 'bar', 'test3': 'hello world'}
r = requests.get('http://localhost/query_test?test1=foo&test2=bar&test3=hello%20world&test2=baz&test2=nothing')
# 127.0.0.1 - - [29/Jul/2023] {'test1': 'foo', 'test2': 'nothing', 'test3': 'hello world'}

As you can see the multiple values version still works but it only takes in the last defined value. Again, another good reason to go with a feature rich web server for practical use.

HEAD

This is essentially the same as a GET quest except only returning the headers. It's useful for things like figuring out if a file exists without downloading the entire thing. That said even if though the response body is blank, the headers still have to be calculated exactly the same as if the file were being downloaded. Server side this isn't too bad for static files. Having to dynamically generate a large amount of data just to push back an empty body is not ideal. Something to consider in your method logic. With the base HTTP server get_HEAD will need to be added with the logic, and the write_data method will need another version to handle headers properly (I'll ignore query string parsing for simplicity here):

    def write_head_data(self, bytes_data):
        self.send_header('Content-Encoding', 'gzip')
        return_data = gzip.compress(bytes_data)
        self.send_header('Content-Length', len(return_data))
        self.end_headers()
        self.wfile.write(b'')

    def do_HEAD(self):
        self.read_http_request()
        bytes_data = self.__getattribute__(self.ROUTES[self.path])()
        self.write_head_data(bytes_data)
        self.request.close()

Now requests will need to call head() instead of get():

import requests

r = requests.head('http://localhost/index.html')
print(r.headers)
print(r.content)
# {'Server': 'BaseHTTP/0.6 Python/3.10.6', 'Date': 'Sat, 29 Jul 2023 18:32:14 GMT', 'Content-Type': 'text/html', 'Content-Encoding': 'gzip', 'Content-Length': '129'}
b''
# Server Log: 127.0.0.1 - - [29/Jul/2023] "HEAD /index.html HTTP/1.1" 200 -

So Content-Length properly shows the number of bytes that would have come from the compressed HTML but the body response is empty.

POST

POSTs are meant for cases where data is to be changed on the server side. It's important to note that even if an HTML form is present it's not a guarantee that the result is POST. Search functionality may have a form for search parameters and the results are a GET query with a query string containing the parameters. Due to the fact that POST lets you declare data in the body query strings in the URL have little practical use and should be avoided. The first type of POST is a key/value post encoded as application/x-www-form-urlencoded in the body. First we'll just print out the headers and body to see what it looks like:

    def read_post_request(self):
        self.log_message(f"Reading request from {self.client_address}")
        print(dict(self.headers.items()))
        content_length = int(self.headers['Content-Length'])
        data = self.rfile.read(content_length)
        print(data)

    def serve_post_response(self):
        self.send_response(200)
        self.send_header('Content-Type', 'text/html')
        return bytes(f'<html><head><title>Old Website</title><script type="text/javascript" src="/js/myjs.js"></script></head><body><img src="/python-logo/" /></body></html>', encoding='utf-8')

    def do_POST(self):
        self.read_post_request()
        bytes_data = self.serve_post_response()
        self.write_data(bytes_data)
        self.request.close()

And the client:

import requests

r = requests.post('http://localhost/', data={'var1': 'test', 'var2': 'test2'})
print(r.headers)
print(r.content)

After running the client we see this on the server side:

127.0.0.1 - - [29/Jul/2023] Reading request from ('127.0.0.1', 35888)
{'Host': 'localhost', 'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '20', 'Content-Type': 'application/x-www-form-urlencoded'}
b'var1=test&var2=test2'

Due to the client sending information the Content-Type and Content-Length headers are being sent. This can now be parsed on the server side using parse_qsl:

    def read_post_request(self):
        self.log_message(f"Reading request from {self.client_address}")
        print(dict(self.headers.items()))
        content_length = int(self.headers['Content-Length'])
        data = self.rfile.read(content_length)
        self.data = dict(parse_qsl(data.decode('utf-8')))
        print(self.data)
# Output: {'var1': 'test', 'var2': 'test2'}

As data is being read from a connection it comes in as bytes, which can be turned into a string using decode(). Content-Length is also an interesting predicament security wise. When doing a read() on sockets if you attempt to read() in more than the client sent the server can get into a stuck phase. This is due to expecting the possibility more packets are set to arrive and the network is simply slow. A malicious attacker could simply set Content-Length to be more bytes than are actually sent, causing a server side read() to hang. It's important to ensure your connections have time outs in this case.

Now another option is to simply post a format such as JSON. This is so popular with REST APIs that requests even has an option for it:

import requests

r = requests.post('http://localhost/', json={'var1': 'test', 'var2': 'test2'})
print(r.headers)
print(r.content)

Which can then be decoded as JSON on the server side:

    def read_post_request(self):
        self.log_message(f"Reading request from {self.client_address}")
        print(dict(self.headers.items()))
        content_length = int(self.headers['Content-Length'])
        data = self.rfile.read(content_length)
        self.data = json.loads(data)
        print(self.data)

In this case json.loads accepts bytes so we don't need to decode it ourselves. Output wise is the same, but the content type has changed to be JSON:

{'Host': 'localhost', 'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '33', 'Content-Type': 'application/json'}
{'var1': 'test', 'var2': 'test2'}

Now another method is one called a multipart post. This is mainly used for cases where you might be dealing with binary input along with other form fields (generally a file selection input in an HTML form). So to see what this looks like I'll update our client:

import requests

multipart_data = {
    'image_data': ('python_logo.png', open('python-logo-only.png', 'rb'), 'image/png'),
    'field1': (None, 'value1'),
    'field2': (None, 'value2')
}

r = requests.post('http://localhost/', files=multipart_data)
print(r.headers)
print(r.content)

So each multipart_data entry is a key to what the field name is and a tuple value. Actual files will have a filename as the first part, a file pointer as the second part, an an optional MIME type for the contents. Regular fields simply have None as the filename and the string contents of the value as the second part. This all gets passed in as a files= keyword argument in the requests post. Now to check what the server will receive out of this:

    def read_post_request(self):
        self.log_message(f"Reading request from {self.client_address}")
        print(dict(self.headers.items()))
        content_length = int(self.headers['Content-Length'])
        self.data = self.rfile.read(content_length)
        print(self.data)

Quite a lot of data comes back from this:

{'Host': 'localhost', 'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '21005', 'Content-Type': 'multipart/form-data; boundary=0cfc2d1479f926612dde676e228fc12c'}
b'--0cfc2d1479f926612dde676e228fc12c\r\nContent-Disposition: form-data; name="image_data"; filename="python_logo.png"\r\nContent-Type: image/png\r\n\r\n\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\r\x00\x00\x01F\x08\x06\x00\x00\x00p\x8d\xca\xa7\x00\x00\x00\tpHYs\x00\x00#\xbf\x00\x00#
<snip lots of binary here>
\r\n--0cfc2d1479f926612dde676e228fc12c\r\nContent-Disposition: form-data; name="field1"\r\n\r\nvalue1\r\n--0cfc2d1479f926612dde676e228fc12c\r\nContent-Disposition: form-data; name="field2"\r\n\r\nvalue2\r\n--0cfc2d1479f926612dde676e228fc12c--\r\n'

So what's happening here is we have something called a boundary. This helps show separation for each field. I cleaned up the output for the last part and it ends up looking like this:

--0cfc2d1479f926612dde676e228fc12c
Content-Disposition: form-data; name="field1"

value1
--0cfc2d1479f926612dde676e228fc12c
Content-Disposition: form-data; name="field2"

value2
--0cfc2d1479f926612dde676e228fc12c--

So as you can see the boundary= as part of the content type header has -- before it to indicate a new field on its own line. The very last one has a -- at the end to show completion of all the fields. Much of this is from email standards which used multiparts as a way of indicating file attachments. Now all of this looks quite tedious to deal with, but thankfully there is a package we can install via pip install multipart which makes it easier to work with multipart:

from multipart import MultipartParser
<snip>
   def read_post_request(self):
        self.log_message(f"Reading request from {self.client_address}")
        print(dict(self.headers.items()))
        content_length = int(self.headers['Content-Length'])
        content_boundary = self.headers['Content-Type'].split('=')[1]
        self.data = MultipartParser(self.rfile, content_boundary, content_length)
        print(self.data.get('field1').value)
        print(self.data.get('field2').value)

Now after starting the server and running the client again:

{'Host': 'localhost', 'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '21005', 'Content-Type': 'multipart/form-data; boundary=708b331135e8d587fd9a1cced157cf79'}
value1
value2
127.0.0.1 - - [29/Jul/2023] "POST / HTTP/1.1" 200 -

The data is being shown. multipart also gives a handy save_as method for downloading the file:

    def read_post_request(self):
        self.log_message(f"Reading request from {self.client_address}")
        print(dict(self.headers.items()))
        content_length = int(self.headers['Content-Length'])
        content_boundary = self.headers['Content-Type'].split('=')[1]
        self.data = MultipartParser(self.rfile, content_boundary, content_length)
        image_entry = self.data.get('image_data')
        image_entry.save_as(image_entry.filename)

This will write the image to the current directory with the python_logo.png name we gave it in the requests data.

Status Codes

Now we look at some of the HTTP status codes. Instead of going through everyone one I'll simply cover what the different categories entail.

2xx

These indicate a success. Out of all of them 200 is what you'll most likely have a majority of the cases.

3xx

These generally deal with redirections. 304 is a bit of an odd one to indicate that the contents have not been modified. This is used in coordination with the caching system. 307 can be used to indicate a redirection to another location.

4xx

This is mostly around showing something bad with the request. A few notable codes:

400 - Your client request is completely wrong (missing/malformed headers)
403 - You're not authorized to view a page
404 - It's difficult to find someone who hasn't hit this before. Used to indicate a page doesn't exist
418 - I'm a teapot. Based on an April Fools standard about a Coffee Pot Protocol

5xx

These codes are all related to the server being broken. 500 is the generic "this server is broken". The other versions can provide more specifics as to the exact nature of what went wrong.

Conclusion

This concludes a look at the HTTP protocol using python. It also will be the final installment of this series. I believe that HTTP is a sufficient enough level to stop deep diving as modern abstractions such as user sessions can be reasoned about more quickly by understanding all the concepts presented up until now. The networking parts of this guide can also be helpful for those in a DevOps role that might need to troubleshoot more unique situations.

If there's one thing I hope you get out of this, it's that despite all the code shown it's not even a complete HTTP server implementation that properly handles all use cases. Security wise communication isn't encrypted, there's no timeout handling, parsing of headers in general could use work. So basically trying to do it yourself where you have to keep several use cases in mind and deal with potential malicious actors is not worth it. Work with your security needs, threat model, and user cases to find a comprehensive server that fits your needs.

Thank you to all the new folks who have followed me over the last few weeks. Look forward to more articles ahead!

DEV Community