So far we've seen what takes place behind servers and networking. Now the modern web certainly is more than just a network of echo servers. Many websites are powered by something known as HTTP (HyperText Transfer Protocol). This article will discuss some of the inner workings of HTTP using various python code and modules. For those looking for more resources I highly recommend the Mozilla Developer Network documentation on anything web related.
- Security Notes
- HTTP Versions
- A URL
- A Basic Request
- Response
- A Better Server
- Headers
- Cookies
- Request Types
- Status Codes
- Conclusion
Security Notes
The code presented here is for learning purposes. Given the complexity of modern day web services I highly discourage trying to roll your own web server outside of learning purposes in an isolated network. You should instead evaluate a secure and well maintained web server that meets your needs. Traffic here is also unencrypted, meaning anyone could snoop in on the data. So to summarize:
- Don't use this code in production
- Always make sure your network communication is encrypted and that the encryption method is not outdated / insecure
HTTP Versions
The HTTP protocol has seen a number of revisions over the year. Version 1.0 was released in 1996 as RFC 1945. This is followed in 1999 by HTTP/1.1 which added a number of features that are widely used in the modern web.
Currently HTTP/2 is considered the modern standard. Many features here helped to work out performance issues with the way modern web applications worked. HTTP/3 is the current in-progress standard which is based on a UDP encapsulation protocol. In particular it looks to reduce connection duplication in negotiating secure connections.
Taking support into consideration, this article will cover standards set by HTTP/1.1.
A URL
URL stands for Uniform Resource Locator and is a subset of URI or Uniform Resource Identifier. The specifics of URL are defined in RFC 1738. Despite seeming like it URLs are not necessarily for reaching out to an HTTP server, though it's certainly one of the more popular use cases. The scheme section allows it to work with other services as well, such as FTP and Gopher. URL supported schemes at the time can be found in the RFC. The IANA keeps a more up to date and extensive list. Python offers the urllib module which can be used to work with URLs:
from urllib.parse import urlparse
URL = 'https://datatracker.ietf.org/doc/html/rfc1738#section-3'
print(urlparse(URL))
This gives the output:
ParseResult(scheme='https', netloc='datatracker.ietf.org', path='/doc/html/rfc1738', params='', query='', fragment='section-3')
With a more complex example:
from urllib.parse import urlparse
URL = 'https://user:password@domain.com:7777/
parsed_url = urlparse(URL)
print(parsed_url.hostname)
print(parsed_url.username)
print(parsed_url.password)
print(parsed_url.port)
# Output:
# domain.com
# user
# password
# 7777
There are some cases where a URL path may have non-alphabetic characters such as a space character. To deal with such cases the values can be URL encoded. This is done by taking the ASCII hex value of the character (including the extended ASCII table) and adding a % in front of it. urllib.parse.quote is able to handle such encoding:
from urllib.parse import quote
print(quote('/path with spaces'))
# Output:
# /path%20with%20spaces
A Basic Request
To look at the request process I'll use a basic socket server on port 80 and the requests library which can be installed with pip install requests
:
import socket, os, pwd, grp
import socketserver
# https://stackoverflow.com/a/2699996
def drop_privileges(uid_name='nobody', gid_name='nogroup'):
if os.getuid() != 0:
# We're not root so, like, whatever dude
return
# Get the uid/gid from the name
running_uid = pwd.getpwnam(uid_name).pw_uid
running_gid = grp.getgrnam(gid_name).gr_gid
# Remove group privileges
os.setgroups([])
# Try setting the new uid/gid
os.setgid(running_gid)
os.setuid(running_uid)
# owner/group r+w+x
old_umask = os.umask(0x007)
class MyTCPHandler(socketserver.StreamRequestHandler):
"""
The request handler class for our server.
It is instantiated once per connection to the server, and must
override the handle() method to implement communication to the
client.
"""
def handle(self):
self.data = self.rfile.readline()
print(self.data)
if __name__ == "__main__":
HOST, PORT = "localhost", 80
# Create the server, binding to localhost on port 80
with socketserver.TCPServer((HOST, PORT), MyTCPHandler) as server:
# Activate the server; this will keep running until you
# interrupt the program with Ctrl-C
print(f'Server bound to port {PORT}')
drop_privileges()
server.serve_forever()
and the client:
import requests
requests.get('http://localhost/')
Now currently this is an incomplete request on purpose so I can show things line by line. This means simple_client.py
will error out about connection reset by peer as it's expecting an HTTP response. On the server side we see:
Server bound to port 80
b'GET / HTTP/1.1\r\n'
The first line is indicated by the HTTP RFC as the Request Line. The first is the method, followed by the request to the host, the HTTP version to use, and finally a CRLF (Carriage Return '\r' Line Feed '\n'). So the GET method is being used on the path /
and requesting HTTP/1.1 for the version.
Now you'll notice here in the request we didn't have to declare that we were using port 80. This is because it's defined by IANA as a service port so implementations know to use port 80 by default (or 443 for https).
Response
Next I'll read in the rest of the lines for the HTTP request:
def handle(self):
self.data = self.rfile.readlines()
print(self.data)
Server bound to port 80
[b'GET / HTTP/1.1\r\n', b'Host: localhost\r\n', b'User-Agent: python-requests/2.22.0\r\n', b'Accept-Encoding: gzip, deflate\r\n', b'Accept: */*\r\n', b'Connection: keep-alive\r\n', b'\r\n']
Now to clean up the output a bit:
GET / HTTP/1.1
Host: localhost
User-Agent: python-requests/2.22.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
After the request line we see a number of key value pairs separated by a :
. This is something we'll go to shortly, but for now I'll pass back data to finish the connection. While I'm here I'll also re-organize the handler class to make it easier to follow:
class MyTCPHandler(socketserver.BaseRequestHandler):
"""
The request handler class for our server.
It is instantiated once per connection to the server, and must
override the handle() method to implement communication to the
client.
"""
def read_http_request(self):
print("reading request")
self.data = self.request.recv(8192)
print(self.data)
def write_http_response(self):
print("writing response")
response_lines = [
b'HTTP/1.1 200\r\n',
b'Content-Type: text/plain\r\n',
b'Content-Length: 12\r\n',
b'Location: http://localhost/\r\n',
b'\r\n',
b'Hello World\n'
]
for response_line in response_lines:
self.request.send(response_line)
print("response sent")
def handle(self):
self.read_http_request()
self.write_http_response()
self.request.close()
print("connection closed")
and the client is also slightly modified:
import requests
r = requests.get('http://localhost/')
print(r.headers)
print(r.text)
So the handler class has been changed back to socketserver.BaseRequestHandler
as I don't need single line reads anymore. I'm also writing back a static response for right now. Finally the handle()
method gives a nice overview of the different steps. Now as an example:
Server:
Server bound to port 80
reading request
b'GET / HTTP/1.1\r\nHost: localhost\r\nUser-Agent: python-requests/2.22.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
writing response
response sent
connection closed
Client:
{'Content-Type': 'text/plain', 'Content-Length': '12', 'Location': 'http://localhost/'}
Hello World
As with the request, response also have their own response line which in this case is:
HTTP/1.1 200\r\n
First is the HTTP version as a confirmation that the server can communicate in that version. The next is a status code to indicate the nature of the response. I'll touch on some of the status codes later on in the article. In this case 200 is confirmation that the request is valid and everything went okay. With an initial response working it's time to look at things in a bit more depth.
A Better Server
Now that we've seen the raw elements of an HTTP request, it's time to abstract out a bit. Python actually has an http server module that can be used to extend the socketserver module. It has various components to facilitate serving HTTP traffic. So now our server looks something like this:
import os, pwd, grp
from http.server import ThreadingHTTPServer, BaseHTTPRequestHandler
# https://stackoverflow.com/a/2699996
def drop_privileges(uid_name='nobody', gid_name='nogroup'):
if os.getuid() != 0:
return
running_uid = pwd.getpwnam(uid_name).pw_uid
running_gid = grp.getgrnam(gid_name).gr_gid
os.setgroups([])
os.setgid(running_gid)
os.setuid(running_uid)
old_umask = os.umask(0x007)
class MyHTTPHandler(BaseHTTPRequestHandler):
def read_http_request(self):
self.log_message(f"Reading request from {self.client_address}")
print(dict(self.headers.items()))
def write_http_response(self):
self.log_message(f"Writing response to {self.client_address}")
self.send_response(200)
self.end_headers()
self.wfile.write(b'Hello World\n')
def do_GET(self):
self.read_http_request()
self.write_http_response()
self.request.close()
if __name__ == "__main__":
HOST, PORT = "localhost", 80
with ThreadingHTTPServer((HOST, PORT), MyHTTPHandler) as server:
print(f'Server bound to port {PORT}')
drop_privileges()
server.serve_forever()
Now the server is doing some of the heavy lifting for is. It has information about the request line and sends sensible default headers for the response. Speaking of which let's look at headers
Headers
So after doing another run of the new server:
{'Host': 'localhost', 'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
We see headers sent in the request to us. Now the thing with headers is that with the exception of Host (required by the HTTP/1.1 standard) the others are not something the server has to really concern itself with. This is because a single IP can host multiple domains (in fact this becomes more common with CDNs). So if I add an /etc/hosts
entry like so:
127.0.0.1 webserver
Then I could make the following change:
def write_http_response(self):
self.log_message(f"Writing response to {self.client_address}")
self.send_response(200)
self.end_headers()
self.wfile.write(bytes(f'Hello {self.headers["Host"]}\n', 'utf-8'))
And as an example:
{'Server': 'BaseHTTP/0.6 Python/3.10.6', 'Date': 'Tue, 25 Jul 2023 02:57:29 GMT'}
Hello localhost
{'Server': 'BaseHTTP/0.6 Python/3.10.6', 'Date': 'Tue, 25 Jul 2023 02:57:29 GMT'}
Hello webserver
Despite connecting via the same IP address I'm getting different content back. Now as for the rest of the headers it's a pretty long list. With this in mind I'll be going over some of the fundamental ones (some other more specific ones will be gone over in later sections).
Type Headers
These headers have to do with the type of content that's being delivered. Request versions will ask for certain types of content, and response versions will give metadata about the content. Accept is one of the more important ones and is related to the type of content indicated by MIME (Multipurpose Internet Mail Extensions). This is a way to indicate the type of a file and was originally created as a way to give information about non-textual content in the normally textual format of email. This helps in differentiating between parsing HTML and showing an image. Not surprisingly IANA manages the official list of MIME types. Python has a mimetypes module which maps file extensions to the system's MIME type database:
import mimetypes
mimetypes.init()
print(mimetypes.guess_all_extensions('text/plain'))
print(mimetypes.types_map['.html'])
# Output
# ['.txt', '.bat', '.c', '.h', '.ksh', '.pl', '.csh', '.jsx', '.sln', '.slnf', '.srf', '.ts']
# text/html
Now this is of course assuming that a file with a certain extension is actually that type of file. Realistically though a malicious actor could simply rename their malware to .jpg
or similar so it's not a very good source of truth if you can't completely trust the intentions of your users. Instead we can use python-magic. So after doing a pip install python-magic
(Windows users will need to install python-magic-bin
instead which includes required DLLs):
import magic
f = magic.Magic(mime=True)
print(f.from_file('/bin/bash'))
# Output application/x-sharedlib
Some content types you'll likely deal with:
-
text/plain
: Plain text -
text/html
: HTML -
application/json
: JSON
Mozilla also has a more extensive list. Now looking the request headers we see:
Accept: */*
Accept-Encoding: gzip, deflate
For basic transactions Accept: */*
is fairly standard if the client isn't sure what the server will respond with. This essentially says it doesn't have a preference on the mime type a server will return. A more complex example would be:
Accept: text/html, application/xhtml+xml, application/xml;q=0.9, image/webp, */*;q=0.8
This will accept most HTMl formats. There's also a ;q=[number]
here which indicates the preference of the mime type. If there's no preference indicated then everything is weighted the same and the most specific type will be selected. The server version of this is Content-Type
which indicates the type of content the server will send. Now if you decide to mangle the type to something which it's not:
def write_http_response(self):
self.log_message(f"Writing response to {self.client_address}")
self.send_response(200)
self.send_header('Content-Type', 'image/jpeg')
self.end_headers()
self.wfile.write(bytes(f'Hello {self.headers["Host"]}\n', 'utf-8'))
The requests based client (or really things like curl and wget which download only) won't care as it doesn't render the image. Actual browsers on the other hand will throw an error or show a placeholder broken image.
Accept-Encoding
indicates that the client supports compressed data being returned. Compression is recommended when possible through the server specs to reduce the amount of data transferred. As it's not uncommon to use this metric for pricing it can also help reduce cost. A server can send Content-Encoding
back to indicate it's sending compressed data:
def write_http_response(self):
self.log_message(f"Writing response to {self.client_address}")
self.send_response(200)
self.send_header('Content-Type', 'text/plain')
self.send_header('Content-Encoding', 'gzip')
self.end_headers()
return_data = gzip.compress(bytes(f'Hello {self.headers["Host"]}\n', encoding='utf-8'))
self.wfile.write(return_data)
Requests is able to handle compressed data out of the box so no changes needed there, and a run shows that the compression works:
{'Server': 'BaseHTTP/0.6 Python/3.10.6', 'Date': 'Wed, 26 Jul 2023 23:27:53 GMT', 'Content-Type': 'text/plain', 'Content-Encoding': 'gzip'}
Hello localhost
Enhancing The Server Even More
Now to look at some of the other header options I'll update the HTTP server:
import datetime
import grp
import gzip
import hashlib
import os
import pwd
from email.utils import parsedate_to_datetime
from http.server import ThreadingHTTPServer, BaseHTTPRequestHandler
# https://stackoverflow.com/a/2699996
def drop_privileges(uid_name='nobody', gid_name='nogroup'):
if os.getuid() != 0:
return
running_uid = pwd.getpwnam(uid_name).pw_uid
running_gid = grp.getgrnam(gid_name).gr_gid
os.setgroups([])
os.setgid(running_gid)
os.setuid(running_uid)
old_umask = os.umask(0x007)
class MyHTTPHandler(BaseHTTPRequestHandler):
ROUTES = {
'/': 'serve_front_page',
'/index.html': 'serve_html',
'/python-logo/': 'serve_python_logo',
'/js/myjs.js': 'serve_js',
'/favicon.ico': 'serve_favicon'
}
HTTP_DT_FORMAT = '%a, %d %b %Y %H:%M:%S GMT'
def read_http_request(self):
self.log_message(f"Reading request from {self.client_address}")
print(dict(self.headers.items()))
def serve_front_page(self):
self.log_message(f"Writing response to {self.client_address}")
self.send_response(307)
self.send_header('Location', '/index.html')
return b''
def serve_python_logo(self):
return self.serve_file_with_caching('python-logo-only.png', 'image/png')
def serve_favicon(self):
return self.serve_file_with_caching('favicon.ico', 'image/x-icon')
def serve_html(self):
self.send_response(200)
self.send_header('Content-Type', 'text/html')
return b'<html><head><title>Old Website</title><script type="text/javascript" src="/js/myjs.js"></script></head><body><img src="/python-logo/" /></body></html>'
def serve_js(self):
js_code = b'const a = Math.random();'
etag = hashlib.md5(js_code).hexdigest()
if 'If-None-Match' in self.headers and self.headers['If-None-Match'] == etag:
self.send_response(304)
return b''
else:
self.send_response(200)
self.send_header('Etag', etag)
self.send_header('Content-Type', 'text/javascript')
self.send_header('Cache-Control', 'public, max-age=10')
return js_code
def write_data(self, bytes_data):
self.send_header('Content-Encoding', 'gzip')
return_data = gzip.compress(bytes_data)
self.send_header('Content-Length', len(return_data))
self.end_headers()
self.wfile.write(return_data)
def check_cache(self, filename):
if 'If-Modified-Since' in self.headers:
cache_date = parsedate_to_datetime(self.headers['If-Modified-Since'])
filename_date = datetime.datetime.fromtimestamp(os.path.getmtime(filename), tz=datetime.timezone.utc).replace(microsecond=0)
return filename_date <= cache_date
return False
def serve_file_with_caching(self, filename, file_type):
self.log_message(f"Writing response to {self.client_address}")
if self.check_cache(filename):
self.send_response(304)
return b''
else:
self.send_response(200)
self.send_header('Content-Type', file_type)
file_date = datetime.datetime.fromtimestamp(os.path.getmtime(filename), tz=datetime.timezone.utc).replace(microsecond=0)
self.send_header('Last-Modified', file_date.strftime(self.HTTP_DT_FORMAT))
self.send_header('Cache-Control', 'public, max-age=10')
self.send_header('Expires', (datetime.datetime.now(datetime.timezone.utc) + datetime.timedelta(0, 10)).strftime(self.HTTP_DT_FORMAT) )
with open(filename, 'rb') as file_fp:
file_data = file_fp.read()
return file_data
def do_GET(self):
self.read_http_request()
bytes_data = self.__getattribute__(self.ROUTES[self.path])()
self.write_data(bytes_data)
self.request.close()
if __name__ == "__main__":
HOST, PORT = "localhost", 80
with ThreadingHTTPServer((HOST, PORT), MyHTTPHandler) as server:
print(f'Server bound to port {PORT}')
drop_privileges()
server.serve_forever()
This will require two files in the same directory as the server:
- https://www.python.org/favicon.ico
- https://www.python.org/community/logos/ ( PNG format (269 × 326) of only "two snakes" )
ROUTES
allows the handler to act as a simple router mapping paths to methods in the handler class. The data write method is also abstracted out to gzip compress the data every time. Another method deals with caching logic. I'll go over the parts in respect to header logic.
Redirection
The "Location" header can be utilized with a few status codes to indicate the location of a file that needs to be redirected to. Looking here:
def serve_front_page(self):
self.log_message(f"Writing response to {self.client_address}")
self.send_response(307)
self.send_header('Location', '/index.html')
return b''
This will redirect the user to an /index.html
page. Note that some CLI based HTTP clients will require an additional option to actually handle the redirection. Web browsers on the other hand will handle this seamlessly.
Caching
The first useful caching header is Cache-Control
. The main usage is to indicate how long a file can stay in a client's local cache. This is then supplemented with Last-Modified
and/or Etag
. So here:
self.send_header('Cache-Control', 'public, max-age=10')
We're telling the client it can be cached locally without re-verification for 10 seconds. For Last-Modified
I set it to the modification time of the file in UTC:
file_date = datetime.datetime.fromtimestamp(os.path.getmtime(filename), tz=datetime.timezone.utc).replace(microsecond=0)
self.send_header('Last-Modified', file_date.strftime(self.HTTP_DT_FORMAT))
The replace microseconds part is due to the granularity of microseconds causing comparison issues. getmtime
will get the modification time of the file since the epoch. tz=
being set to UTC makes it timezone aware as a UTC date/time. Now for a standard web browser when the client has had the file locally cached for more than 10 seconds it will query the server with If-Modified-Since
:
def check_cache(self, filename):
if 'If-Modified-Since' in self.headers:
cache_date = parsedate_to_datetime(self.headers['If-Modified-Since'])
filename_date = datetime.datetime.fromtimestamp(os.path.getmtime(filename), tz=datetime.timezone.utc).replace(microsecond=0)
return filename_date <= cache_date
return False
Now the server will check the value provided and compare it against the modified time of the file. If it's greater than If-Modified-Since
then the server will return the file as usual with a new Last-Modified
value:
if self.check_cache(filename):
self.send_response(304)
return b''
else:
self.send_response(200)
Otherwise the server sends a 304 to indicate the file hasn't changed. The Cache-Control
's max-age
timer will be reset to 0 and the cycle continues. Now the problem is situations where content is dynamically generated. In this case Etag
can be used. This value does not have an exact method to generate. As MDN web docs states: "typically, the ETag value is a hash of the content, a hash of the last modification timestamp, or just a revision number":
js_code = b'const a = Math.random();'
etag = hashlib.md5(js_code).hexdigest()
In this case I use the md5 hash. This is sent to the client when they request it. The client will then attach this etag value to the cache entry. When max-age
is up, instead of sending If-Modified-Since
the header If-None-Match
is utilized:
if 'If-None-Match' in self.headers and self.headers['If-None-Match'] == etag:
self.send_response(304)
return b''
else:
Note that with Firefox they implemented a feature called Race Cache With Network (RCWN). Firefox will calculate if the network is faster than pulling it from the disk. If the network is faster it will pull the content anyways regardless of cache settings. This will most likely proc if you're doing a localhost -> localhost connection or on a very high speed network. There is currently no server side way to disable this and must be done through the browser side instead.
User Agent
This is a rather interesting header that's supposed to indicate what client is currently requesting the content. For example, Chrome may show:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36
The problem is that it's very easy to spoof because at the end of the day it's just a normal header. As an example:
import requests
r = requests.get('http://localhost/js/myjs.js', headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'})
print(r.headers)
print(r.text)
Looking at the server side request log:
127.0.0.1 - - [27/Jul/2023] Reading request from ('127.0.0.1', 44110)
{'Host': 'localhost', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
So even though I'm actually using requests, I can spoof myself as Chrome using the user-agent string from earlier. If you do a lot of JS development this is why best practice for browser detection is by feature and not user string.
Cookies
Security Note: As this is an HTTP server being used for learning purposes encryption is not setup. In a real world setup cookies should always be sent over HTTPS with strong encryption to help prevent things like session hijacking.
While cookies are technically another header I find that their functionality is enough to warrant a dedicatd sections. Cookies are essentially a method for being able to have state between HTTP calls. In general operation each HTTP call is distinct from the other. Without Cookies to bridge this gap it would be difficult to ensure states such as a user being authenticated to a service. Cookies start with the server sending one or more Set-Cookie
headers. So I'll add another route here:
from http.cookies import SimpleCookie
# <snip>
ROUTES = {
'/': 'serve_front_page',
'/index.html': 'serve_html',
'/python-logo/': 'serve_python_logo',
'/js/myjs.js': 'serve_js',
'/favicon.ico': 'serve_favicon',
'/cookie-test/': 'serve_cookies'
}
# <snip>
def serve_cookies(self):
self.send_response(200)
cookies_list = SimpleCookie()
cookies_list['var1'] = 'test'
cookies_list['var2'] = 'test2'
cookies_list['var2']['path'] = '/'
for morsel in cookies_list.values():
self.send_header("Set-Cookie", morsel.OutputString())
return self.serve_html()
Which will use the SimpleCookie class to setup our headers. Requests puts these cookies into their own dedicated property:
import gzip
import requests
r = requests.get('http://localhost/cookie-test/')
print(r.headers)
print(dict(r.cookies))
print(gzip.decompress(r.content))
# Output:
# {'Server': 'BaseHTTP/0.6 Python/3.10.6', 'Date': 'Thu, 27 Jul 2023 23:30:07 GMT', 'Set-Cookie': 'var1=test, var2=test2; Path=/'}
# {'var2': 'test2', 'var1': 'test'}
# b'<html><head><title>Old Website</title><script type="text/javascript" src="/js/myjs.js"></script></head><body><img src="/python-logo/" /></body></html>'
Now adding some more routes and adjusting the cookie logic:
ROUTES = {
'/': 'serve_front_page',
'/index.html': 'serve_html',
'/python-logo/': 'serve_python_logo',
'/js/myjs.js': 'serve_js',
'/favicon.ico': 'serve_favicon',
'/cookie-test/': 'serve_cookies',
'/cookie-test2/': 'serve_cookies'
}
# <snip>
def serve_cookies(self):
self.send_response(200)
cookies_list = SimpleCookie()
cookies_list['var1'] = 'test'
cookies_list['path_specific'] = 'test2'
cookies_list['path_specific']['path'] = '/cookie-test/'
cookies_list['shady_cookie'] = 'test3'
cookies_list['shady_cookie']['domain'] = 'shadysite.com'
for morsel in cookies_list.values():
self.send_header("Set-Cookie", morsel.OutputString())
return self.serve_html()
When visiting once in a browser as long as I'm in /cookie-test/
and its sub paths the path_specific
cookie will show up. However, if I browse to /cookie-test2/
it won't as the paths don't match. If we also take a look at the shady_cookie
:
Chrome refuses to register the cookie as it's not the same domain as the host. This is generally known as a third party cookie. While there are some ways to deal with proper usage in generally expect that third party cookies will be denied by most browsers. This is mostly done as third party cookies often contain advertising/tracking related content. Now when the cookies have been sent the browser we'll use logic to figure out what cookies are valid for a request and send them back in a Cookie
header. This can then be used by the server to keep some kind of state:
def parse_cookies(self):
if 'Cookie' in self.headers:
raw_cookies = self.headers['Cookie']
self.cookies = SimpleCookie()
self.cookies.load(raw_cookies)
else:
self.cookies = None
def get_cookie(self, key, default=None):
if not self.cookies:
return default
elif key not in self.cookies:
return default
else:
return self.cookies[key].value
def serve_html(self):
self.send_response(200)
self.send_header('Content-Type', 'text/html')
title_cookie = self.get_cookie('path_specific', 'Old Website')
return bytes(f'<html><head><title>{title_cookie}</title><script type="text/javascript" src="/js/myjs.js"></script></head><body><img src="/python-logo/" /></body></html>', encoding='utf-8')
So here we've modified serve_html
to use the cookie value as the title. If it doesn't exist then we use the "Old Website" value instead. SimpleCookie
as double as a way to parse cookies letting us reuse for parsing cookies.
Security Note: Having cookie values inserted directly into HTML is a terrible idea. This was done for simple illustration purposes only.
Now on the client side:
import gzip
import requests
r = requests.get('http://localhost/cookie-test/')
print(r.headers)
print(dict(r.cookies))
print(gzip.decompress(r.content))
r2 = requests.get('http://localhost/cookie-test/', cookies=r.cookies)
print(r2.headers)
print(dict(r2.cookies))
print(gzip.decompress(r2.content))
Which will output:
{'Server': 'BaseHTTP/0.6 Python/3.10.6', 'Date': 'Fri, 28 Jul 2023 02:46:15 GMT', 'Set-Cookie': 'var1=test, path_specific=test2; Path=/cookie-test/, shady_cookie=test3; Domain=shadysite.com'}
{'var1': 'test', 'path_specific': 'test2'}
b'<html><head><title>Old Website</title><script type="text/javascript" src="/js/myjs.js"></script></head><body><img src="/python-logo/" /></body></html>'
{'Server': 'BaseHTTP/0.6 Python/3.10.6', 'Date': 'Fri, 28 Jul 2023 02:46:15 GMT', 'Set-Cookie': 'var1=test, path_specific=test2; Path=/cookie-test/, shady_cookie=test3; Domain=shadysite.com'}
{'var1': 'test', 'path_specific': 'test2'}
b'<html><head><title>test2</title><script type="text/javascript" src="/js/myjs.js"></script></head><body><img src="/python-logo/" /></body></html>'
I'll also note that even requests removed the third party shady site cookie:
127.0.0.1 - - [27/Jul/2023] Reading request from ('127.0.0.1', 50316)
{'Host': 'localhost', 'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Cookie': 'path_specific=test2; var1=test'}
Cookies can be removed by expiring them in the past, or setting an expiration time in the future and having time pass up until then. Here's an example of a cookie that will be removed:
import time
HTTP_DT_FORMAT = '%a, %d %b %Y %H:%M:%S GMT'
INSTANT_EXPIRE = time.strftime(HTTP_DT_FORMAT, time.gmtime(0))
cookies_list['var1'] = 'test'
cookies_list['var1']['expires'] = self.INSTANT_EXPIRE
This will set the expiration to the start of epoch or Thu, 01 Jan 1970 00:00:00 GMT
. One important thing to note is that this is the case when an implementation handles things to spec. A rogue client could simply choose to send the expired cookies regardless.
Request Types
Since we're not working with headers I'll take the server code back to a simplified form again:
import datetime
import grp
import gzip
import hashlib
import os
import pwd
from email.utils import parsedate_to_datetime
from http.server import ThreadingHTTPServer, BaseHTTPRequestHandler
# https://stackoverflow.com/a/2699996
def drop_privileges(uid_name='nobody', gid_name='nogroup'):
if os.getuid() != 0:
return
running_uid = pwd.getpwnam(uid_name).pw_uid
running_gid = grp.getgrnam(gid_name).gr_gid
os.setgroups([])
os.setgid(running_gid)
os.setuid(running_uid)
old_umask = os.umask(0x007)
class MyHTTPHandler(BaseHTTPRequestHandler):
ROUTES = {
'/': 'serve_front_page',
'/index.html': 'serve_html',
'/python-logo/': 'serve_python_logo',
'/js/myjs.js': 'serve_js',
'/favicon.ico': 'serve_favicon',
}
HTTP_DT_FORMAT = '%a, %d %b %Y %H:%M:%S GMT'
def read_http_request(self):
self.log_message(f"Reading request from {self.client_address}")
print(dict(self.headers.items()))
def serve_front_page(self):
self.log_message(f"Writing response to {self.client_address}")
self.send_response(307)
self.send_header('Location', '/index.html')
return b''
def serve_python_logo(self):
return self.serve_file_with_caching('python-logo-only.png', 'image/png')
def serve_favicon(self):
return self.serve_file_with_caching('favicon.ico', 'image/x-icon')
def serve_html(self):
self.send_response(200)
self.send_header('Content-Type', 'text/html')
return bytes(f'<html><head><title>Old Website</title><script type="text/javascript" src="/js/myjs.js"></script></head><body><img src="/python-logo/" /></body></html>', encoding='utf-8')
def serve_js(self):
js_code = b'const a = Math.random();'
etag = hashlib.md5(js_code).hexdigest()
if 'If-None-Match' in self.headers and self.headers['If-None-Match'] == etag:
self.send_response(304)
return b''
else:
self.send_response(200)
self.send_header('Etag', etag)
self.send_header('Content-Type', 'text/javascript')
self.send_header('Cache-Control', 'public, max-age=10')
return js_code
def write_data(self, bytes_data):
self.send_header('Content-Encoding', 'gzip')
return_data = gzip.compress(bytes_data)
self.send_header('Content-Length', len(return_data))
self.end_headers()
self.wfile.write(return_data)
def check_cache(self, filename):
if 'If-Modified-Since' in self.headers:
cache_date = parsedate_to_datetime(self.headers['If-Modified-Since'])
filename_date = datetime.datetime.fromtimestamp(os.path.getmtime(filename), tz=datetime.timezone.utc).replace(microsecond=0)
return filename_date <= cache_date
return False
def serve_file_with_caching(self, filename, file_type):
self.log_message(f"Writing response to {self.client_address}")
if self.check_cache(filename):
self.send_response(304)
return b''
else:
self.send_response(200)
self.send_header('Content-Type', file_type)
file_date = datetime.datetime.fromtimestamp(os.path.getmtime(filename), tz=datetime.timezone.utc).replace(microsecond=0)
self.send_header('Last-Modified', file_date.strftime(self.HTTP_DT_FORMAT))
self.send_header('Cache-Control', 'public, max-age=10')
self.send_header('Expires', (datetime.datetime.now(datetime.timezone.utc) + datetime.timedelta(0, 10)).strftime(self.HTTP_DT_FORMAT) )
with open(filename, 'rb') as file_fp:
file_data = file_fp.read()
return file_data
def do_GET(self):
self.read_http_request()
bytes_data = self.__getattribute__(self.ROUTES[self.path])()
self.write_data(bytes_data)
self.request.close()
if __name__ == "__main__":
HOST, PORT = "localhost", 80
with ThreadingHTTPServer((HOST, PORT), MyHTTPHandler) as server:
print(f'Server bound to port {PORT}')
drop_privileges()
server.serve_forever()
In the HTTP standard there are various request methods that can be used. I'll be going over three that I would consider the core ones. If you're developing a REST API it's likely you'll utilize more of them.
GET
This is the standard method you will see for a majority of web interaction. It indicates a read only action to obtain some form of content and should not change change the content used by the server. Due to the read only nature of GET the contents of the request body are ignored. In order to pass in any kind of parameters a query string can be used after the path. As the HTTP server pulls in query strings as part of the path, we'll need to parse them before using the routing dictionary:
from urllib.parse import urlparse, parse_qs
# <snip>
ROUTES = {
'/': 'serve_front_page',
'/index.html': 'serve_html',
'/python-logo/': 'serve_python_logo',
'/js/myjs.js': 'serve_js',
'/favicon.ico': 'serve_favicon',
'/query_test': 'serve_html'
}
# <snip>
def do_GET(self):
self.read_http_request()
segments = urlparse(self.path)
self.query = parse_qs(segments.query)
self.log_message(self.query)
bytes_data = self.__getattri
self.write_data(bytes_data)
self.request.close()
urlparse allows us to break up the path and query string components. parse_qs will then parse the query string to give us a dictionary value. Note that both of these examples are valid:
# Handled by the code
http://website/query-test?test1=test2&test3=test4
# Valid, but not handled by our code
http://website/query-test/?test1=test2&test3=test4
But I'm only handling the first case on purpose to keep things simple (feature rich web servers can deal with this). We'll update our client to pass in some parameters and see the result:
import requests
r = requests.get('http://localhost/query_test?test1=foo&test2=bar&test3=hello%20world')
print(r.headers)
print(r.content)
Which will give the following output from the server:
127.0.0.1 - - [29/Jul/2023] {'test1': ['foo'], 'test2': ['bar'], 'test3': ['hello world']}
Now the reason why the values are lists is because by using the same key in a query string you can allow for multiple values:
r = requests.get('http://localhost/query_test?test1=foo&test2=bar&test3=hello%20world&test2=baz&test2=nothing')
# 127.0.0.1 - - [29/Jul/2023] {'test1': ['foo'], 'test2': ['bar', 'baz', 'nothing'], 'test3': ['hello world']}
If you only wish to support single values with unique keys, parse_qsl can be used instead:
segments = urlparse(self.path)
# Returns key value pair tuple
self.query = dict(parse_qsl(segments.query))
self.log_message(f'{self.query}')
bytes_data = self.__getattribute__(self.ROUTES[segments.path])()
r = requests.get('http://localhost/query_test?test1=foo&test2=bar&test3=hello%20world')
# 127.0.0.1 - - [29/Jul/2023] {'test1': 'foo', 'test2': 'bar', 'test3': 'hello world'}
r = requests.get('http://localhost/query_test?test1=foo&test2=bar&test3=hello%20world&test2=baz&test2=nothing')
# 127.0.0.1 - - [29/Jul/2023] {'test1': 'foo', 'test2': 'nothing', 'test3': 'hello world'}
As you can see the multiple values version still works but it only takes in the last defined value. Again, another good reason to go with a feature rich web server for practical use.
HEAD
This is essentially the same as a GET quest except only returning the headers. It's useful for things like figuring out if a file exists without downloading the entire thing. That said even if though the response body is blank, the headers still have to be calculated exactly the same as if the file were being downloaded. Server side this isn't too bad for static files. Having to dynamically generate a large amount of data just to push back an empty body is not ideal. Something to consider in your method logic. With the base HTTP server get_HEAD
will need to be added with the logic, and the write_data
method will need another version to handle headers properly (I'll ignore query string parsing for simplicity here):
def write_head_data(self, bytes_data):
self.send_header('Content-Encoding', 'gzip')
return_data = gzip.compress(bytes_data)
self.send_header('Content-Length', len(return_data))
self.end_headers()
self.wfile.write(b'')
def do_HEAD(self):
self.read_http_request()
bytes_data = self.__getattribute__(self.ROUTES[self.path])()
self.write_head_data(bytes_data)
self.request.close()
Now requests
will need to call head()
instead of get()
:
import requests
r = requests.head('http://localhost/index.html')
print(r.headers)
print(r.content)
# {'Server': 'BaseHTTP/0.6 Python/3.10.6', 'Date': 'Sat, 29 Jul 2023 18:32:14 GMT', 'Content-Type': 'text/html', 'Content-Encoding': 'gzip', 'Content-Length': '129'}
b''
# Server Log: 127.0.0.1 - - [29/Jul/2023] "HEAD /index.html HTTP/1.1" 200 -
So Content-Length
properly shows the number of bytes that would have come from the compressed HTML but the body response is empty.
POST
POSTs are meant for cases where data is to be changed on the server side. It's important to note that even if an HTML form is present it's not a guarantee that the result is POST. Search functionality may have a form for search parameters and the results are a GET query with a query string containing the parameters. Due to the fact that POST lets you declare data in the body query strings in the URL have little practical use and should be avoided. The first type of POST is a key/value post encoded as application/x-www-form-urlencoded
in the body. First we'll just print out the headers and body to see what it looks like:
def read_post_request(self):
self.log_message(f"Reading request from {self.client_address}")
print(dict(self.headers.items()))
content_length = int(self.headers['Content-Length'])
data = self.rfile.read(content_length)
print(data)
def serve_post_response(self):
self.send_response(200)
self.send_header('Content-Type', 'text/html')
return bytes(f'<html><head><title>Old Website</title><script type="text/javascript" src="/js/myjs.js"></script></head><body><img src="/python-logo/" /></body></html>', encoding='utf-8')
def do_POST(self):
self.read_post_request()
bytes_data = self.serve_post_response()
self.write_data(bytes_data)
self.request.close()
And the client:
import requests
r = requests.post('http://localhost/', data={'var1': 'test', 'var2': 'test2'})
print(r.headers)
print(r.content)
After running the client we see this on the server side:
127.0.0.1 - - [29/Jul/2023] Reading request from ('127.0.0.1', 35888)
{'Host': 'localhost', 'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '20', 'Content-Type': 'application/x-www-form-urlencoded'}
b'var1=test&var2=test2'
Due to the client sending information the Content-Type
and Content-Length
headers are being sent. This can now be parsed on the server side using parse_qsl
:
def read_post_request(self):
self.log_message(f"Reading request from {self.client_address}")
print(dict(self.headers.items()))
content_length = int(self.headers['Content-Length'])
data = self.rfile.read(content_length)
self.data = dict(parse_qsl(data.decode('utf-8')))
print(self.data)
# Output: {'var1': 'test', 'var2': 'test2'}
As data is being read from a connection it comes in as bytes, which can be turned into a string using decode(). Content-Length
is also an interesting predicament security wise. When doing a read()
on sockets if you attempt to read()
in more than the client sent the server can get into a stuck phase. This is due to expecting the possibility more packets are set to arrive and the network is simply slow. A malicious attacker could simply set Content-Length
to be more bytes than are actually sent, causing a server side read()
to hang. It's important to ensure your connections have time outs in this case.
Now another option is to simply post a format such as JSON. This is so popular with REST APIs that requests even has an option for it:
import requests
r = requests.post('http://localhost/', json={'var1': 'test', 'var2': 'test2'})
print(r.headers)
print(r.content)
Which can then be decoded as JSON on the server side:
def read_post_request(self):
self.log_message(f"Reading request from {self.client_address}")
print(dict(self.headers.items()))
content_length = int(self.headers['Content-Length'])
data = self.rfile.read(content_length)
self.data = json.loads(data)
print(self.data)
In this case json.loads accepts bytes so we don't need to decode it ourselves. Output wise is the same, but the content type has changed to be JSON:
{'Host': 'localhost', 'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '33', 'Content-Type': 'application/json'}
{'var1': 'test', 'var2': 'test2'}
Now another method is one called a multipart post. This is mainly used for cases where you might be dealing with binary input along with other form fields (generally a file selection input in an HTML form). So to see what this looks like I'll update our client:
import requests
multipart_data = {
'image_data': ('python_logo.png', open('python-logo-only.png', 'rb'), 'image/png'),
'field1': (None, 'value1'),
'field2': (None, 'value2')
}
r = requests.post('http://localhost/', files=multipart_data)
print(r.headers)
print(r.content)
So each multipart_data entry is a key to what the field name is and a tuple value. Actual files will have a filename as the first part, a file pointer as the second part, an an optional MIME type for the contents. Regular fields simply have None
as the filename and the string contents of the value as the second part. This all gets passed in as a files=
keyword argument in the requests post. Now to check what the server will receive out of this:
def read_post_request(self):
self.log_message(f"Reading request from {self.client_address}")
print(dict(self.headers.items()))
content_length = int(self.headers['Content-Length'])
self.data = self.rfile.read(content_length)
print(self.data)
Quite a lot of data comes back from this:
{'Host': 'localhost', 'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '21005', 'Content-Type': 'multipart/form-data; boundary=0cfc2d1479f926612dde676e228fc12c'}
b'--0cfc2d1479f926612dde676e228fc12c\r\nContent-Disposition: form-data; name="image_data"; filename="python_logo.png"\r\nContent-Type: image/png\r\n\r\n\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\r\x00\x00\x01F\x08\x06\x00\x00\x00p\x8d\xca\xa7\x00\x00\x00\tpHYs\x00\x00#\xbf\x00\x00#
<snip lots of binary here>
\r\n--0cfc2d1479f926612dde676e228fc12c\r\nContent-Disposition: form-data; name="field1"\r\n\r\nvalue1\r\n--0cfc2d1479f926612dde676e228fc12c\r\nContent-Disposition: form-data; name="field2"\r\n\r\nvalue2\r\n--0cfc2d1479f926612dde676e228fc12c--\r\n'
So what's happening here is we have something called a boundary. This helps show separation for each field. I cleaned up the output for the last part and it ends up looking like this:
--0cfc2d1479f926612dde676e228fc12c
Content-Disposition: form-data; name="field1"
value1
--0cfc2d1479f926612dde676e228fc12c
Content-Disposition: form-data; name="field2"
value2
--0cfc2d1479f926612dde676e228fc12c--
So as you can see the boundary=
as part of the content type header has --
before it to indicate a new field on its own line. The very last one has a --
at the end to show completion of all the fields. Much of this is from email standards which used multiparts as a way of indicating file attachments. Now all of this looks quite tedious to deal with, but thankfully there is a package we can install via pip install multipart
which makes it easier to work with multipart:
from multipart import MultipartParser
<snip>
def read_post_request(self):
self.log_message(f"Reading request from {self.client_address}")
print(dict(self.headers.items()))
content_length = int(self.headers['Content-Length'])
content_boundary = self.headers['Content-Type'].split('=')[1]
self.data = MultipartParser(self.rfile, content_boundary, content_length)
print(self.data.get('field1').value)
print(self.data.get('field2').value)
Now after starting the server and running the client again:
{'Host': 'localhost', 'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '21005', 'Content-Type': 'multipart/form-data; boundary=708b331135e8d587fd9a1cced157cf79'}
value1
value2
127.0.0.1 - - [29/Jul/2023] "POST / HTTP/1.1" 200 -
The data is being shown. multipart
also gives a handy save_as
method for downloading the file:
def read_post_request(self):
self.log_message(f"Reading request from {self.client_address}")
print(dict(self.headers.items()))
content_length = int(self.headers['Content-Length'])
content_boundary = self.headers['Content-Type'].split('=')[1]
self.data = MultipartParser(self.rfile, content_boundary, content_length)
image_entry = self.data.get('image_data')
image_entry.save_as(image_entry.filename)
This will write the image to the current directory with the python_logo.png
name we gave it in the requests data.
Status Codes
Now we look at some of the HTTP status codes. Instead of going through everyone one I'll simply cover what the different categories entail.
2xx
These indicate a success. Out of all of them 200
is what you'll most likely have a majority of the cases.
3xx
These generally deal with redirections. 304 is a bit of an odd one to indicate that the contents have not been modified. This is used in coordination with the caching system. 307 can be used to indicate a redirection to another location.
4xx
This is mostly around showing something bad with the request. A few notable codes:
- 400 - Your client request is completely wrong (missing/malformed headers)
- 403 - You're not authorized to view a page
- 404 - It's difficult to find someone who hasn't hit this before. Used to indicate a page doesn't exist
- 418 - I'm a teapot. Based on an April Fools standard about a Coffee Pot Protocol
5xx
These codes are all related to the server being broken. 500
is the generic "this server is broken". The other versions can provide more specifics as to the exact nature of what went wrong.
Conclusion
This concludes a look at the HTTP protocol using python. It also will be the final installment of this series. I believe that HTTP is a sufficient enough level to stop deep diving as modern abstractions such as user sessions can be reasoned about more quickly by understanding all the concepts presented up until now. The networking parts of this guide can also be helpful for those in a DevOps role that might need to troubleshoot more unique situations.
If there's one thing I hope you get out of this, it's that despite all the code shown it's not even a complete HTTP server implementation that properly handles all use cases. Security wise communication isn't encrypted, there's no timeout handling, parsing of headers in general could use work. So basically trying to do it yourself where you have to keep several use cases in mind and deal with potential malicious actors is not worth it. Work with your security needs, threat model, and user cases to find a comprehensive server that fits your needs.
Thank you to all the new folks who have followed me over the last few weeks. Look forward to more articles ahead!
Top comments (0)