When accessing this site there's quite a lot going on. In today's cloud centric world much of this low level communication is abstracted out. In this series we'll be looking at some of the foundations of network communication, starting with how basic connections work.
Client and Server
To start out we'll take a basic server that takes what was sent to it and returns it as upper case. This will be using python's built in socketserver
module to handle the details, the documentation for which this example comes from:
import socketserver
class MyTCPHandler(socketserver.StreamRequestHandler):
"""
The request handler class for our server.
It is instantiated once per connection to the server, and must
override the handle() method to implement communication to the
client.
"""
def handle(self):
# self.request is the TCP socket connected to the client
self.data = self.rfile.readline().strip()
print("{} wrote:".format(self.client_address))
print(self.data)
# just send back the same data, but upper-cased
self.request.sendall(self.data.upper())
if __name__ == "__main__":
HOST, PORT = "localhost", 5555
# Create the server, binding to localhost on port 9999
with socketserver.TCPServer((HOST, PORT), MyTCPHandler) as server:
# Activate the server; this will keep running until you
# interrupt the program with Ctrl-C
server.serve_forever()
The first important part here is the server doing a binding:
with socketserver.TCPServer((HOST, PORT), MyTCPHandler) as
This will register that the program wants to use a specific port (5555) so the operating system attempts to reserve it until the program shuts down and a handler is registered as well. This will be executed when a client connects to decide what will be done with the request. In this case a StreamRequestHandler
is being used, which will expose the client's connection as a file like object:
self.rfile.readline().strip()
The handler here has a single line being read in. I will note that in the real world where you don't know who is sending data, someone could simply never send a newline and the connection would be stuck open. Magnify this and soon you could have a server with too many connections taken up to the point of potential resource exhaustion, effectively becoming a Denial of Service (DoS) attack.
While readline
is shown, on the back-end it's actually doing a series of recv
calls. These calls take in a certain number of bytes until there is no data left. In a similar note sendall
will keep sending data until there is none left. Now on the client side:
import socket
MSG = bytearray("Hello World", 'utf-8')
connection = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
connection.connect(("127.0.0.1", 5555))
print(connection.getsockname())
connection.send(MSG)
result = connection.recv(len(MSG.upper()))
print(result)
One thing to note here is that networking deals with bytes at a low level. bytearray
is a special python type which lets you turn a string (array of characters) into a series of bytes designated by the character encoding given (UTF-8).
socket.socket(socket.AF_INET, socket.SOCK_STREAM)
This creates a socket connection. AF_INET
indicates we'll be dealing with connections via IPv4 (Internet Protocol Version 4) addresses. SOCK_STREAM
is a fancy way of indicating a TCP (Transmission Control Protocol) connection. This means we're connecting via TCP/IP.
connection.send(MSG)
result = connection.recv(len(MSG.upper()))
Here the message is sent, then we switch to receiving mode to get the data sent from the server. Given that we know what will come back (the message as upper case) we can use this to retrieve the exact number of bytes of the upper case version. Now after running everything together:
> python .\server.py
('127.0.0.1', 57990) wrote:
b'Hello World'
> python client.py
('127.0.0.1', 57990)
b'HELLO WORLD'
The two way connection is complete.
Ports
Ports on operating systems actually have a specification by the Internet Engineering Task Force (IETF) in Request For Comments (RFC) 8335. This talks about how ports and service names work. Service names are special labels for specific ports that are managed by IANA (Internet Assigned Numbers Authority). The IANA website holds the current mapping of service names to ports. Operating systems often use this to provide user friendly names of these special ports. Linux for example, often stores this listing in /etc/services
:
tcpmux 1/tcp # TCP port service multiplexer
echo 7/tcp
echo 7/udp
discard 9/tcp sink null
discard 9/udp sink null
systat 11/tcp users
daytime 13/tcp
daytime 13/udp
netstat 15/tcp
qotd 17/tcp quote
chargen 19/tcp ttytst source
chargen 19/udp ttytst source
ftp-data 20/tcp
ftp 21/tcp
fsp 21/udp fspd
ssh 22/tcp # SSH Remote Login Protocol
telnet 23/tcp
smtp 25/tcp mail
time 37/tcp timserver
time 37/udp timserver
whois 43/tcp nicname
tacacs 49/tcp # Login Host Protocol (TACACS)
The the built-in socket
python module even has getservbyname
and getservbyport
methods to work with this information:
>>> import socket
>>> print(socket.getservbyname('http'))
80
>>> print(socket.getservbyport(80))
http
There are also designations for port ranges. Ports 0-1023 are system ports. Running a server on requires either administrative access on Windows or root/privileged access on *NIX systems. If run without administrative privileges an access denied message will appear. This is done as you wouldn't want say, a random user putting up their own SSH server.
Ports 1024-49151 are meant for non-admin users to allow them to run services. This is why binding to 5555 doesn't require administrative access. It's also the reason many web applications that are being installed in a local environment tend to use ports such as 8080 or 8888 so users don't have to worry about their admin permissions. Finally there are "dynamic ports" which can be seen in the output:
('127.0.0.1', 57990) wrote:
These ports are reserved by the operating system for client communication. Without such ports the server has no way to communicate with the client. In essence, a dynamic port lets the client also act as a "server" in a sense for the duration of the connection. While the RFC lists these ports as 49152-65535, the actual range is OS specific and in some cases can be configured. Later versions of Windows use the IANA recommendation, while my Ubuntu instance as:
# sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768 60999
IP Address
As mentioned previously network communication works in bytes. So what about IP Addresses? Are we simply turning the string "127.0.0.1" into a series of bytes? It turns out that IP addresses are a special way of indicating a 32 bit number. .
separated value is an 8 bit/1 byte segment:
[8 bits].[8 bits].[8 bits].[8 bits]
Where each 1 byte segment is the binary version of the decimal the segment indicates. socket.inet_aton
can be used to showcase this:
>>> import socket
>>> ip_binary = socket.inet_aton('127.0.0.1')
>>> import struct
>>> struct.unpack('BBBB', ip_binary)
(127, 0, 0, 1)
>>> ip_binary
b'\x7f\x00\x00\x01'
ip_binary
is a sequence of the bytes and struct.unpack
is set to 4 unsigned chars at 1 byte (what B represents) which have the values 0-255, matching the range of the allowed values of each IP segment for IPv4. IPV6 is a bit more complicated and a full example looks something like this:
0123:4567:89ab:cdef:0123:4567:89ab:cdef
In this case segments are broken up by :
. Each segment has 4 base 16 values from 0(0000 binary) to f(1111 binary) which each take up 4 bits. This gives you 16 bits per segment, 8 segments, giving a total of 128 bits. This makes it 4 times the size of IPv4 addresses. Due to socket.inet_aton
only being for IPv4 addresses socket.inet_pton
is used instead which allows us to designate IPv6 addresses:
>>> import socket
>>> socket.inet_pton(socket.AF_INET6, '0123:4567:89ab:cdef:0123:4567:89ab:cdef')
b'\x01#Eg\x89\xab\xcd\xef\x01#Eg\x89\xab\xcd\xef'
IP Addresses in binary form can be passed in as a constructor to one of the ipaddress
modules classes to obtain back an object back:
>>> import ipaddress
>>> import socket
>>> ipv6_bytes = socket.inet_pton(socket.AF_INET6, '0123:4567:89ab:cdef:0123:4567:89ab:cdef')
>>> ipv4_bytes = socket.inet_aton('127.0.0.1')
>>> ipaddress.IPv4Address(ipv4_bytes)
IPv4Address('127.0.0.1')
>>> ipaddress.IPv6Address(ipv6_bytes)
IPv6Address('123:4567:89ab:cdef:123:4567:89ab:cdef')
>>> ipaddress.IPv6Address(ipv6_bytes).exploded
'0123:4567:89ab:cdef:0123:4567:89ab:cdef'
For IPv6 in particular, 0s are not shown in the output by default. The lack of a value simply implies 0000. Using the exploded
property will show the full address with 0s appropriately shown. There's also a few helper functions with useful information:
>>> ipaddress.IPv4Address(ipv4_bytes).is_loopback
True
>>> ipaddress.IPv4Address(ipv4_bytes).is_global
False
>>> ipaddress.IPv4Address(ipv4_bytes).is_private
Now what does this mean is_private
or is_global
and how can we tell? It turns out that IANA also handles IP addresses as well. In this case though they mostly handle the allocation of the first 8 bit value in an IPv4 address. So for example if I take one of the Google DNS IP addresses 8.8.8.8
:
008/8 Administered by ARIN 1992-12 whois.arin.net https://rdap.arin.net/registry
http://rdap.arin.net/registry LEGACY
I'm told that it's administered by ARIN. Now ARIN is the American Registry for Internet Numbers and handles IP address allocation for most of North America. This means that IANA acts as an allocation authority for which regional authorities a prefix goes to. The regional authorities are then:
- AFRINIC: Africa Region
- APNIC: Asia/Pacific Region
- ARIN: Canada, USA, and some Caribbean Islands
- LACNIC: Latin America and some Caribbean Islands
- RIPE NCC: Europe, the Middle East, and Central Asia
Now it's important to note that you generally get the best information if you search for an IP address using its regional whois service. For example, if I try to use ARIN Whois to search for a Japanese IP address:
It will tell me I should use APNIC instead. Using these I can get more information about IP address ownership. In Google DNS' case:
It shows ownership of the IP address by Google. You can also see what IP blocks an organization owns. For example here is the list of Google owned IP blocks. One thing to note is that this is also an unfortunate tool for malicious actors. They find such IP blocks belonging to an organization and initiate mass scans on them. This is how EC2 instances on AWS are continually being scanned in mass. Another rather awkward situation with ownership is that some Early Registration Transfers were initiated which transferred certain IP blocks from one regional authority to another (RIPE to ARIN).
DNS
Domain Name System or DNS allows the resolution of a specific name to an IP address. This is powered by a global network of servers along with a local override. The /etc/hosts
file on *NIX systems and c:\Windows\System32\Drivers\etc\hosts
file on Windows systems allow the manual setting locally of a host name -> IP address mapping. As an example on my Ubuntu instance:
127.0.0.1 localhost
127.0.0.1 gitserver
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
This maps localhost
to 127.0.0.1
and another entry does the same for a gitolite server to be accessible locally with a more human friendly name. Note that because DNS lookups happen so much they are generally cached for performance purposes as network traffic relying on DNS lookups cannot continue without them resolving the IP address. As an example, here is one of the entries in my Windows DNS cache:
example.org
----------------------------------------
Record Name . . . . . : example.org
Record Type . . . . . : 1
Time To Live . . . . : 61854
Data Length . . . . . : 4
Section . . . . . . . : Answer
A (Host) Record . . . : 93.184.216.34
This tells me example.org resolves to 93.184.216.34
and the time to live (in seconds) indicates this entry should be cached for around 17 hours. Note this value fluctuates depending what's backing a DNS entry. Once that's done the lookups keep going up a chain of servers to find out what the IP is. This can be one of:
- Manually set servers, such as someone setting up Google DNS
- The router, which generally forwards to your ISP
- Your ISP
- A server part of the global DNS network
- A server specific to a domain name / organization
It's worth noting that python does provide a way to get IPs from hostnames:
>>> import socket
>>> dns_result = socket.getaddrinfo('google.com', 80)
>>> dns_result
[(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('142.250.191.238', 80)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('142.250.191.238', 80)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_RAW: 3>, 0, '', ('142.250.191.238', 80)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2607:f8b0:4009:819::200e', 80, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('2607:f8b0:4009:819::200e', 80, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_RAW: 3>, 0, '', ('2607:f8b0:4009:819::200e', 80, 0, 0))]
But the returned values don't quite map out to the standard way a user would expect to work with DNS (not to mention requiring a port). Thankfully there is a python package to present DNS results in a more layout friendly method. This will require installing dnspython via pip: pip install dnspython
:
from dns.resolver import resolve
from dns.rdatatype import RdataType
for query_type in RdataType:
try:
answers = resolve('dev.to', query_type)
for rdata in answers:
print(f"{RdataType.to_text(query_type)}:{rdata.to_text()}")
except:
continue
Which will output:
> python .\dns_list.py
A:151.101.194.217
A:151.101.66.217
A:151.101.130.217
A:151.101.2.217
NS:josh.ns.cloudflare.com.
NS:jill.ns.cloudflare.com.
SOA:jill.ns.cloudflare.com. dns.cloudflare.com. 2309864129 10000 2400 604800 3600
MX:10 alt4.aspmx.l.google.com.
MX:5 alt1.aspmx.l.google.com.
MX:5 alt2.aspmx.l.google.com.
MX:10 alt3.aspmx.l.google.com.
MX:1 aspmx.l.google.com.
TXT:"v=spf1 a mx include:_spf.google.com include:sendgrid.net include:servers.mcsv.net include:shops.shopify.com ~all"
TXT:"facebook-domain-verification=1xzy1qk89qs7ngxdt5e4s0kvvqw701"
TXT:"_globalsign-domain-verification=VzRovTWhxjedMqXFfoiZ-UNRnlnuTXYHgjKemPNt33"
TXT:"google-site-verification=oTtYzW83zP_41DlUrb_VXtAjLTW1p71RBmWR2g5ctrk"
So there a number if interesting record types you'll find in many queries:
- A records: Mapping of hosts -> IP addresses
- AAAA records: Same but for IPv6
- CNAME records: Used for aliases
- TXT records: Simply text information, but commonly used for domain ownership verification purposes
- NS records: Nameservers for the domain
- MX records: Used to indicate email servers
- SAO records: Required record that indicates the start of authority with general ownership and administrative contact information
These are the ones that are most commonly used or modified. That said there are some services such as Route53 in AWS which dynamically return IP addresses for A and AAAA records based on certain conditions. This can be anything from server load to geographical location. In the case of dev.to
's DNS answers we can reason a few things:
- They are using CloudFlare for DNS.
- Fastly is providing CDN (Content Delivery Network)
- Google Mail services are being utilized for email
- SPF (Sender Policy Framework) allows Sendgrid, Mail Chimp, and Shopify to send email on behalf of dev.to
- Verification for Facebook API, Verisign Global Sign, and Google Sites was made to show ownership
Given that AAAA records are not present you wouldn't be able to connect directly using IPv6. The IP address for the A records are also interesting as they should technically fall under RIPE administration but look to be part of the Early Registration Transfer program (enough to where a bug was filed about it to RedHat). Also if you try and visit one of the A record IP addresses:
This is due to a single IP being able to host websites of multiple domains. Because of this the IP alone may not be enough and require the actual domain to be included with the request so the server knows what it's actually supposed to serve.
Conclusion
This concludes a look into networking fundamentals in python. While much of this information is abstracted away from much of modern day cloud computing, it's still interesting to know what goes on behind the scenes. It might even help solve a challenging debug session one day!
Top comments (0)