Josh Carvel

Posted on Dec 5, 2020 • Updated on Apr 28, 2022 • Originally published at joshcarvel.com

Networks, the Internet and the Web Demystified

#beginners #computerscience #codenewbie #webdev

Intro 💡

Computer networking is a confusing topic. So much jargon. So many complicated technologies. As a new developer, it's easy to get overwhelmed by it.

In this article I'll demystify the fundamentals by tackling them one at a time, in simple language. You'll understand the different types of computer networks, and exactly how the internet and the web work (and if you don't know the difference between those two, this article is definitely for you!).

We'll start from the beginning of computer networking. As a developer in the 2020s you might not care about computing in the 1960s, but it's going to help you build up concepts one at a time and not get overwhelmed. It's also a good reminder of the incredible progress we've made, from networking a few huge mainframe computers for government research, to you reading this article on a web browser from anywhere in the world!

Prerequisites

Basic knowledge of computing concepts such as bits and operating systems - see my How Computers Work series.

The ARPANET 📦

Computer networking began with the Advanced Research Projects Agency (ARPA), an agency of the US Department of Defense, established as part of its efforts in the space race in the 60s. ARPA funded computers at various universities, at a time when computers were rare, large and expensive.

Within those institutions, there was a system of timesharing, where multiple users could connect to the computer with separate physical terminals containing a monitor, and the computer was powerful enough to serve all the users. Users could share files and even send what could be described as email as long as they were connected to the same computer.

But the US government wanted ARPA researchers to be able to connect to the computers and allow results to be shared easily. Initially, an ARPA director called Bob Taylor was connected to three separate university computers via three separate terminals. He wanted to access them all from a single terminal, and secured the funding for what became the precursor to the internet, the ARPANET.

Data would be sent over phone lines, since the infrastructure already existed. However, in a phone call, you intend to speak to each other continuously for a set period, during which a physical line of communication is reserved, and then stop, at which point the line is freed up. Computers, on the other hand, should be able to send messages to any other computer at any time. A Welsh computer scientist called Donald Davies had already considered this and the ARPANET used a system he had devised: packet-switching.

The user-facing computers on the network, known as hosts, would communicate in short chunks of information known as packets. A long message could be split into multiple packets which could travel down whichever phone line was available. So a single line could be handling packets from various different hosts, one after another. In between all the host computers were reliable, single-purpose computers which directed the traffic. They read the destination address from the packet and passed it on until it got to the destination host. Today we use routers for this.

The other advantage of this system was that it provided protection from lines going down - packets had many, many ways of getting to their destination. This was important for the US military in the 60s, and the implementation of packet-switching was strongly influenced by the work of Paul Baran, who described a communication system that could survive a nuclear strike!

The first ARPANET message was sent in 1969 and the network developed throughout the 70s.

The Internet 🌐

Packet-switching established the system for sending messages, but ARPANET also needed a set of rules about how to use that system: a protocol. The protocol included that packets would conform to a standard format, with the data to be transmitted, called the payload, plus a header, containing the address it was going to.

However, from the early 1970s onwards, other computer networks began to emerge that used different protocols. You couldn't send a message from one network to another - the other network wouldn't understand it! This problem eventually led to the development of a way of inter-networking, or 'internetting'. All these networks working together became what we call the internet.

The internet started to really get going in the late 1980s. Cisco Systems began popularising the modern router - a device that could connect your network to other networks. Each network just had to know how to talk to the router, not the computers on the other network.

There are two important protocols which allow this happen.

Internet Protocol (IP)

This is a fairly straightforward protocol derived from the original ARPANET protocols. It requires that a packet has a header with an IP address, which indicates where it's going. An IP address is a unique sequence of binary digits. Addresses are managed globally by the Internet Assigned Numbers Authority (IANA) and assigned to your device by your internet service provider. It is dynamic i.e. not permanent, so addresses can be reused when possible. Despite this, in the 1990s the world began to realise it would run out of IP addresses.

The old system, IPv4, used a 32-bit number, separated by dots into four 8-bit numbers displayed in their decimal equivalent, e.g. 172.217.7.238. We officially ran out of unique IPv4 addresses in 2019 and this system is very slowly being phased out.

The new system, IPv6, uses a 128-bit number, separated by colons into eight 16-bit numbers, displayed in their hexidecimal equivalent (the base 16 number system, which uses the digits 1-10 plus the characters a-f, so it's shorter to write), e.g. 2001:cdba:0000:0000:0000:0000:3257:9652. This system was chosen for various reasons, but we can also safely say we won't run out, given that it can generate over 340 trillion trillion trillion unique addresses!

Just type 'my IP address' into google to see your current IP address.

Transmission Control Protocol (TCP)

While IP is used by routers, TCP is used by the host computers on either end of the transmission of data - the routers in between don't read the TCP information on the packet.

A TCP header contains a lot of useful information. It contains a sequence number which is used to determine the position of the data in the completed message, so if packets arrive out of order, the part of the destination host's operating system that knows how to deal with TCP will reassemble them.

It also has a system to ensure data is not lost, known as the TCP handshake or three-way handshake. It uses flags, i.e. a bit in a certain position in the header which is allocated a specific meaning and set to 1 or 0. In the first transmitted message, a flag referred to as SYN (synchronisation) is set to 1. The receiving host responds with a message with the ACK (acknowledgement) flag set to 1. Then the first host also sends a message with ACK set to 1.

After three messages, the hosts are confident of a good connection. Ideally, the fourth packet sent contains some actual data. However, if acknowledgements were not received, attempts will be made to retransmit the data (subject to a maximum number of attempts). Any duplicates received at the destination will be discarded.

TCP also enables a receiving host's operating system to know what program the data is meant for. Like a ship arriving at a harbour, TCP packets arrive at ports. This is simply a unique identifying number, and like IP addresses, IANA maintains an official list of ports for specific uses. A TCP packet specifies the number of the destination port as well as the source port it has come from. Note that the combination of an IP address and port is generally referred to as a socket or endpoint and represents one end of a TCP connection.

Internet protocol suite

TCP and IP, designed by Vint Cerf and Bob Kahn and finalised in 1978, form the basis of what we today call the Internet protocol suite, often referred to simply as TCP/IP, though it also contains many other protocols.

One such protocol which is commonly used is User Datagram Protocol (UDP). Invented in 1980, this allowed for messages to be exchanged without the initial handshake, meaning messages are transmitted more quickly, but with greater unreliability. Today it's used by applications such as video chat apps, where speed of transmission is key.

Types of network 🏢

Let's now zoom in to the kinds of networks that were being linked together by the internet.

There are many different classifications of network, mostly based on their geographical extent. For example, ARPANET was a Wide Area Network (WAN). A small, local network on the other hand, is known as a Local Area Network (LAN).

One of the first LANs was established in 1973 at Xerox's research centre, the famous Xerox PARC. At Xerox PARC they used personal computers before most of the rest of the world had even seen one, and a fast, scalable way of connecting them was invented, called ethernet.

Every computer on the network had a unique media access control (MAC) address. Packets of data were sent along a cable containing what's called an ethernet frame, which includes a header with the source and destination MAC addresses, and a data payload. The data was passed to all connected computers, but the only intended recipient processed it.

Only one computer could send a message at one time, otherwise there would be a collision. If the computers detected the network was busy, they waited for a period of time before attempting retransmission of that data - the period of time was increased exponentially each time the network was busy, which is called exponential backoff.

The configuration of the connections between computers is known as the network topology. The ethernet at Xerox PARC connected computers with a single cable, which is called the bus topology. Later, it became possible to plug several cables into a device called a hub, which could relay the signals as if the computers were all connected, but one cable going down wouldn't crash the whole network. Nowadays it is more common to use a switch in place of a hub. A switch learns the MAC addresses of the devices on the network so it only sends data to the computers that need it, allowing multiple communications between different computers to happen at once.

Using a hub or switch in this way is an example of the star topology (the hub or switch is the centre and the other cables feed out from it). There are many types of topologies, each with their own pros and cons, but star is the most common these days. For example, your home network probably uses a star topology, with the router as the central point and your devices the points on the star.

Ethernet spread to many offices in the 80s so users could do things like share files and connect to printers on the network. For this to work, the computers needed a network interface controller (NIC, also known as a network adapter). These days they are built into motherboards. The operating system of the computer also needed facilities for networking built in - again, this is now standard.

At this time servers, such as file servers, mail servers and print servers, also came into widespread use in offices. The term server usually refers to a computer that has the dedicated purpose of providing some service to another computer, known as the client, although it may also refer to server software, which today can run on pretty much any computer. Most servers are dedicated machines that have no monitor, are in constant use and have a specific operating system and hardware optimised for their serving function.

Having a server-based network allows a network administrator to manage things like data storage, security and access to files, though it comes with its fair share of expenses. The opposite of a server-based network is a peer-to-peer network. This means the computers are connected together on an equal footing - there is no administrative control. We could say it's more 'democratic'.

The ARPANET and early internet were all about peer-to-peer networking, where files could be shared freely. But as businesses adopted networking and security concerns became more apparent, things became more siloed. Files could be restricted and you could set up a firewall to block certain traffic to your network. Later on, when sites like Napster emerged, peer to peer file sharing went mainstream again, although the market for sharing copyrighted material was quickly shot down by legal challenges. Nowadays, we see new types of peer to peer networks facilitating cryptocurrencies like Bitcoin.

Modern networking

Internet connections originally used phonelines in the traditional way with a dial-up connection that hogged the phoneline and was pretty slow for data transfer. They were used with a modem that modulated digital data to analogue for the phone line, and demodulated it back to digital for the computer (hence, mo-dem).

In the early 2000s, broadband came into widespread use. It divided the phone line into several channels, allowing more data to travel down the lines (greater bandwidth) and enabling multiple connections at once, specifically using digital subscriber line (DSL) technology. This was usually asymmetric (ADSL), which is optimised for downloading information more than uploading, which suits how most people use it and reduces the scope for interference on the line.

However, most of the internet is now connected with fiber-optic cables. They go all over the world and under the oceans, carrying digital information at incredible speeds using light. ASDL is still often used, but only at the last leg of connection to the user (the 'last mile'), where it is less cost-effective to install fiber-optic cables. In either case, some kind of modem is used to translate the signal, but this is often built into the same device as the router.

The router allows access for multiple devices on the local network, using the MAC addresses of the devices - assigned by the device manufacturer - to identify where to send the data. Most routers can send data wirelessly over short distances using radio waves, i.e. Wi-Fi. The other devices have a wireless NIC built in so they can understand the signals. Ethernet can still be used and is often used by businesses for a fast, reliable connection.

Outside of local networks, we can also get wireless internet access from our mobile network operator. The same towers that connect our devices for calls are used to send and receive data to and from our devices using radio waves, and are connected to the internet via cables. This is quite cost-effective to implement and has been improved over a number of 'generations' of technology, the latest being 5G (fifth generation).

But there's one development that has impacted us more than any other, and it's what you're reading this article on right now.

The World Wide Web 🕸️

The World Wide Web and internet are often conflated, but as we know, the internet came first.

The idea for the web originated with the idea of hypertext. Inspirations for it date back to the 40s and the term was coined in the 60s. 'Hyper' comes from the Greek for 'beyond', and the idea was having text with hyperlinks that take you beyond that text, to related information. Some systems were developed to implement this on computers, including one written by scientist Tim Berners-Lee while working at CERN in 1980. It was a web of sorts, but certainly not world-wide: the information was accessible to just him.

Over the course of the 1980s, CERN became the largest node (point of connection) on the internet. Berners-Lee came up with the big idea that a hypertext system could exist on the internet, so everyone could access the same information from their own computer, without having to log into another system or ask someone else for it.

In 1991, Berners-Lee wrote the world's first web server. A web server is a server that serves files known as webpages.

Let's look at the technologies that make browsing webpages possible.

Domains

Since the birth of the internet, there have been domain names. A domain represents a computer or group of computers on the internet, and domain names provide a more user-friendly description of a location on the internet. Since 1985 there has been a dedicated system to match domain names to IP addresses, called the Domain Name System (DNS).

This means a request for 'google.com' will go to a DNS server, and that server may pass the request on to another DNS server, until one that actually knows the address for that domain name is found, and that server will pass the request to the google server that has the page. In this example, there are many servers across the world that can serve you 'google.com', even though the domain name is the same. Once your computer has the IP address, it will remember (cache) it for a while, so it doesn't have to ask again.

The top-level domain is at the end of the domain name, for example .com (commercial), or .org (organisation). Management of these is overseen by IANA. Any name preceding this last dot is a subdomain. So 'google' is a subdomain of '.com', and in 'play.google.com' (Google play store), 'play' is a subdomain of google.

'www' is in fact just another subdomain, originally just added for extra clarity. It became a standard by accident. You don't have to include that subdomain when you register your domain name, and you can redirect requests using 'www' to the domain without 'www'. Because the web is so dominant now, people generally avoid specifying 'www', and browsers usually add it automatically in the address bar.

As you know, we refer to the collection of webpages at a particular domain name as a website. A website is just a hierarchical collection of files (the webpages) on the web server. The server serves the homepage when no specific page is requested. To access a different page, the file path identifying that directory on the server is added after the domain name.

HTTP

Domains provided the infrastructure to make the web possible, but to specify the rules of requesting webpages, Berners-Lee needed a new protocol. He and his team came up with Hypertext Transfer Protocol (HTTP). Today, it works as follows.

First, the client and server establish a reliable connection using the TCP protocol. Then, the client makes a specific type of HTTP request. For example, a GET request, which requests a page (originally this was the only request available). The client can specify headers with additional information, one of which must be the location of the resource, which looks like this:

Host: google.com

The server then sends a response which includes a status code, a number from one of the following groupings (only a handful of numbers in each group are used as status codes):

Informational responses (100–199)
Successful responses (200–299)
Redirects (300–399)
Client errors (400–499)
Server errors (500–599)

The most common status codes you will have seen as a user are 404: Not Found (the resource just isn't on the server) and 500: Internal Server Error (an error occurred on the server). You are probably less familiar with 200: OK, because this means the request was successful, and in the browser you will see the webpage that was returned in the body of the server's response, rather than a status code.

There are also other types of request available, such as POST, so the client can send data the user has inputted in a form to the server. For example, if you request to log into a website, the browser sends a POST request to the web server with your credentials. If your credentials are valid and the web server uses cookies as authentication, its response would contain the Set-Cookie header with an access token. The browser stores the token and sends it in the Cookie header on future requests so the server knows you are logged in.

HTTPS

As the web grew it quickly (within a few years) became apparent that greater security was needed. Simply requesting webpages isn't much of a problem, but sending user data is. One such problem is known as a man-in-the-middle-attack - in other words, someone could intercept communication between you and the web server without you knowing.

So HTTPS (Hypertext Transfer Protocol Secure) was developed to accommodate the use of encryption. Encryption is where data is encoded into a scrambled form known as a ciphertext. An encryption key is a string of digits computers can apply in a mathematical way to decipher the message again.

The client and server cannot make the encryption key public, as it would defeat the point. However, they do need to know what key the other is using. This is achieved with mathematics, by combining a public key with a private key in such a way that both client and server generate the same number to use as a key, and use it privately to decipher the encrypted messages they send to each other. Anyone trying to intercept the messages can't reverse-engineer the process using the publicly available variables, because they are missing one variable: a private key. (For further explanation see the Computerphile video in Sources below).

The encryption protocol used by HTTPS is now Transport Layer Security (TLS) but you have more likely heard of its predecessor Secure Sockets Layer (SSL). A TLS certificate, still widely referred to as a SSL certificate, certifies the ownership of a public key, indicating to the client that HTTPS can be used (browsers usually display a padlock icon in the address bar to indicate this). All websites these days are advised to have one, and Google ranks sites that do more favourably in its search algorithms.

URLs

We haven't yet mentioned the Uniform Resource Locator (URL). This brings together a domain name with a scheme - a lowercase value indicating what protocol is being used, of which http and https are two possibilities.

What other types of scheme are there? Well, a common one is ftp, for File Transfer Protocol. FTP has been around since the early days of the internet and requires you to login to a server to upload or remove files. HTTP was a kind of adaption of FTP to optimise it for web requests. FTP is still used in the context of the web - the author of a website would use it to upload a website to a server, or remove it from one.

Other schemes you may have seen are mailto, which allows you to send email from a webpage, or file, which allows you to open a local file in a browser.

The scheme is followed by a colon to separate it, and the domain name is prefaced with two forward slashes, which indicates a path to a computer, as opposed to a path to a file on a particular computer (usually a single forward slash).

HTML

To enable a webpage to be created, Berners-Lee needed a markup language. A markup language allows the user to provide the computer with text along with annotations that indicate the display and function of that text. When the text is then displayed to the intended user, the annotations are hidden. Berners-Lee wrote his own markup language, based heavily on one in use at CERN at the time, but tailored for the web. It's called Hypertext Markup Language (HTML).

HTML provides elements that go on the webpage - headings, paragraph elements, lists and so on. An element is produced by using tags, in most cases by writing an opening tag, followed by the text to display, followed by the closing tag. If I wanted a main heading that read 'My webpage', I would write that element like this:

<h1>My webpage</h1>

But back in 1991, the star of the show was the anchor tag, <a> which creates a hyperlink and makes hypertext possible. To tell the page where you want it to go, you have to add an attribute. Each element has a valid set of attributes you can use for more control over how that element is used. The attribute to add a hyperlink is href (hypertext reference) and is used with the relevant url as the value, like so:

<a href="https://www.google.com">Go to google</a>

If the link is to another page on the same site, a relative file path could be used, e.g.

<a href="/shop.html">Shop</a>

Later, form elements would be added which allow other kinds of interactivity.

Each element has a very plain default display style, and the elements follow each other vertically down the page in the order they were written in the HTML. Originally, these default display styles couldn't be changed, but now they can. However, you can still often see them on older websites or even notice them for a split second on some modern websites before the stylesheets have loaded! 😉

Browsers

The final piece of the puzzle was the web browser - a program designed for requesting files from web servers (principally HTML files) and displaying them. Berners-Lee wrote the first browser, but it didn't take long for things to get pretty complicated.

The technologies that underpin the web are open - that's the whole idea, and the main reason that it went global. In 1994, Berners-Lee founded the World Wide Web Consortium (W3C) to establish some standards for web technologies, such as the HTML specification.

The problem is, it's up to the companies and organisations that produce browsers to actually implement the standards, and there's nothing to really stop them from doing their own thing for any technical or business reason. In the early days of the web, websites tended to be designed specifically for one type of browser (and browser version!) because the browsers differed from each other so much. This competition did at least establish the precedent of browsers being free, however.

By the end of the first 'browser war' in 2001, Microsoft had triumphed over a company called Netscape by establishing Internet Explorer as the dominant browser. Netscape later became the non-profit Mozilla foundation, and they made their browser source code open-source and released it as Firefox in 2004. Firefox is still used today, with a usage share of around 5-10% on desktop browsers.

Also in 2004, individuals from Mozilla, Opera (another browser provider) and Apple formed, wait for it... the Web Hypertext Application Technology Working Group (WHATWG). WHATWG was born from dissatisfaction with W3C's approach, and it is now instrumental in driving web standards forward. Mozilla also began what is now the Mozilla Developer Network (MDN) web docs, a hugely valuable resource for web developers which provides a reliable picture of the various web standards and their actual implementation by browsers, which these days is, thankfully, a lot more consistent.

The period since then has seen Google's browser, Chrome (first released in 2008), win the second 'browser war' and rise to huge prominence. Not only does it have around 70% of usage share, almost all other notable browsers besides Firefox and Safari (Apple's browser) use Chromium, the core part of the browser, which is open-source. This includes Microsoft Edge, the successor to Internet Explorer, as of version 79. Google (and Microsoft) are also now part of the WHATWG's Steering Group.

Layers 🍰

Before we finish, let's take a step back again.

Most explanations of computer networking inevitably reference the Open Systems Interconnection (OSI) model. This is a general-purpose concept that defines the different layers of features or services in a network, and it will help us recap.

It defines 7 layers that work independently of the other layers. This is helpful because when something goes wrong, you can first narrow the problem down to a specific layer of the network.

1. Physical Layer

This one is simple: networks rely on electricity, radio waves and the physical components that carry them to function. All of this makes up the physical layer.

2. Data Link Layer

This layer covers the connection between two nodes on a network, such as the transmission of data frames linked by two computers via ethernet. Error correction is done at this layer to reduce corruption of data caused by things like interference on the line.

Technologies at the physical and data link layer are standardised by the Institute of Electrical and Electronics Engineers (IEEE), in the IEEE 802 standards. This is a grouping of standards for things like Ethernet, WiFi, and so on.

3. Network Layer

The network layer refers to the forwarding of packets between networks. This is is the layer that routers and the Internet Protocol operate at.

4. Transport Layer

This is about the delivery of the data between computers at either end of the connection, and getting data to the right software application. TCP falls under this layer.

Layers 5 and 6 are the Session Layer and the Presentation Layer, but we won't discuss them because they are not that relevant to the Internet Protocol Suite, which predates the OSI model.

7. Application Layer

This refers to the data the user sees and interacts with in an application such as a web browser, and the protocols that are closest to it like HTTP. If you're writing a web application and send an incorrect HTTP request, then the network problem is at the application layer.

Conclusion

Let's recap the key takeaways.

Networks come in many forms and sizes, and are built on layers of technology working together.
The internet brings different computer networks together and allows a variety of services such as email, file transfer, and of course, the web, to exist.
The internet and the web were created with the advantages in mind, and the disadvantages were discovered later. We're only now just starting to get a grip on the monumental ways they can affect our lives. That the web will be free, secure and a force for good is far from a given, and achieving it takes a lot of work from people that believe in those ideals.

What is made of the Web is up to us. You, me, and everyone else — Tim Berners-Lee

Sources

I cross-reference my sources as much as possible. If you think some information in this article is incorrect, please leave a polite comment or message me with supporting evidence 🙂.