Farhan

Posted on Sep 11, 2021 • Edited on Sep 13, 2021

Deep dive into Linux Networking and Docker - Bridge, vETH and IPTables

#kubernetes #docker #linux #computerscience

This was originally published here: https://aly.arriqaaq.com/linux-networking-bridge-iptables-and-docker/

These series of articles are my log of learning about various networking concepts related to Container Orchestration Platforms (Docker, Kubernetes, etc)

Linux Networking is a very interesting topic. In this series, my aim is to dig deep to understand the various ways in which these container orchestration platforms implement network internals underneath.

Getting Started

A few questions before getting started.

1) What are namespaces?

TLDR, a linux namespace is an abstraction over resources in the operating system. Namespaces are like separate houses with their own sets of isolated resources. There are currently 7 types of namespaces Cgroup, IPC, Network, Mount, PID, User, UTS

Network isolation is what we are interested in, so we will be discussing in depth about network namespaces.

2) How to follow along?

All the examples in this article have been made on a fresh vagrant Ubuntu Bionic virtual machine.

Getting Started

A few questions before getting started.

1) What are namespaces?

Network isolation is what we are interested in, so we will be discussing in depth about network namespaces.

2) How to follow along?

All the examples in this article have been made on a fresh vagrant Ubuntu Bionic virtual machine.

Getting Started

A few questions before getting started.

1) What are namespaces?

Network isolation is what we are interested in, so we will be discussing in depth about network namespaces.

2) How to follow along?

All the examples in this article have been made on a fresh vagrant Ubuntu Bionic virtual machine.

vagrant init ubuntu/bionic64
vagrant up
vagrant ssh

Exploring Network Namespaces

How do platforms virtualise network resources to isolate containers by assigning them a dedicated network stack, and making sure these containers do not interfere with the host (or neighbouring containers)? Network Namespace. A network namespace isolates network related resources — a process running in a distinct network namespace has its own networking devices, routing tables, firewall rules etc.Let’s create one quickly.

ip netns add ns1

And voila! You have your isolated network namespace (ns1) created just like that. Now you can go ahead and run any process inside this namespace.

ip netns exec ns1 python3 -m http.server 8000

This was pretty neat! The exec $namespace $command executes $command in the named network namespace $namespace. This means that the process runs within its own network stack, separate from the host, and can communicate only through the interfaces defined in the network namespace.

Host NamespaceBefore you read ahead, I’d like to draw your attention on the default namespace for the **host **network. Let’s list down all the namespaces

ip netns

# all namespaces
ns1
default

You can notice the default **namespace that is created. This is the **host namespace, which implies whatever services that you run simply on your VM or your machine, is run under this namespace. This would be important to note moving forward.

Creating a Network Namespace

So with that said, let’s quickly move forward and create two isolated network namespaces (similar to two containers)

#!/usr/bin/env bash

NS1="ns1"
NS2="ns2"

# create namespace
ip netns add $NS1
ip netns add $NS2

Connecting the cables

We need to go ahead and connect these namespaces to our host network. The vETH (virtual Ethernet) device helps in making this connection. vETH is a local Ethernet tunnel, and devices are created in pairs.Packets transmitted on one device in the pair are immediately received on the other device. When either device is down, the link state of the pair is down.

#!/usr/bin/env bash

NS1="ns1"
VETH1="veth1"
VPEER1="vpeer1"

NS2="ns2"
VETH2="veth2"
VPEER2="vpeer2"

# create namespace
ip netns add $NS1
ip netns add $NS2

# create veth link
ip link add ${VETH1} type veth peer name ${VPEER1}
ip link add ${VETH2} type veth peer name ${VPEER2}

Think of VETH like a network cable. One end is attached to the host network, and the other end to the network namespace created. Let’s go ahead and connect the cable, and bring these interfaces up.

# setup veth link
ip link set ${VETH1} up
ip link set ${VETH2} up

# add peers to ns
ip link set ${VPEER1} netns ${NS1}
ip link set ${VPEER2} netns ${NS2}

Localhost

Ever wondered how localhost works? Well, the loopback interface directs the traffic to remain within the local system. So when you run something on localhost (127.0.0.1), you are essentially using the loopback interface to route the traffic through. Let’s bring the loopback interface up in case we’d want to run a service locally, and also bring up the peer interfaces inside our network namespace to start accepting traffic.

# setup loopback interface
ip netns exec ${NS1} ip link set lo up
ip netns exec ${NS2} ip link set lo up

# setup peer ns interface
ip netns exec ${NS1} ip link set ${VPEER1} up
ip netns exec ${NS2} ip link set ${VPEER2} up

In order to connect to the network, a computer must have at least one network interface. Each network interface must have its own unique IP address. The IP address that you give to a host is assigned to its network interface.But does every network interface require an IP address right? Well, not really. We’ll see that in the coming steps.

# assign ip address to ns interfaces
VPEER_ADDR1="10.10.0.10"
VPEER_ADDR2="10.10.0.20"

ip netns exec ${NS1} ip addr add ${VPEER_ADDR1}/16 dev ${VPEER1}
ip netns exec ${NS2} ip addr add ${VPEER_ADDR2}/16 dev ${VPEER2}

Remember, here we’ve only assigned network addresses to the interfaces inside the network namespaces (ns1 (vpeer1), ns2 (vpeer2)). The host namespaces interfaces do not have an IP assigned (veth1, veth2). Why? Do we need it? Well, not really.

Build Bridges, Not Walls

Men build too many walls and not enough bridges

Remember that when you have multiple containers running, and want to send traffic to these containers, we’d require a bridge to connect them. A network bridge creates a single, aggregate network from multiple communication networks or network segments. A bridge is a way to connect two Ethernet segments together in a protocol independent way. **Packets are forwarded based on Ethernet address*, rather than IP address (like a router). *Since forwarding is done at Layer 2, all protocols can go transparently through a bridge. Bridging is distinct from routing. Routing allows multiple networks to communicate independently and yet remain separate, whereas bridging connects two separate networks as if they were a single network.

Docker has a **docker0 **bridge underneath to direct traffic. When Docker service starts, a Linux bridge is created on the host machine. The various interfaces on the containers talk to the bridge, and the bridge proxies to the external world. Multiple containers on the same host can talk to each other through the Linux bridge.

So let’s go ahead and create a bridge.

BR_ADDR="10.10.0.1"
BR_DEV="br0"

# setup bridge
ip link add ${BR_DEV} type bridge
ip link set ${BR_DEV} up

# assign veth pairs to bridge
ip link set ${VETH1} master ${BR_DEV}
ip link set ${VETH2} master ${BR_DEV}

# setup bridge ip
ip addr add ${BR_ADDR}/16 dev ${BR_DEV}

Now that we have out network interfaces connected to the bridge, how do these interfaces know how to direct the traffic to the host? The route tables in both network namespaces only have route entries for their respective subnet IP range.

Since we have the VETH pairs connected to the bridge, the bridge network address is available to these network namespaces. Let’s add a default route to direct the traffic to the bridge.

# add default routes for ns
ip netns exec ${NS1} ip route add default via ${BR_ADDR}
ip netns exec ${NS2} ip route add default via ${BR_ADDR}

Done. Sweet! We have a proper setup to test if our containers can talk to each other. Let’s finally interact with the namespaces.

# add default routes for ns
ip netns exec ${NS1} ping ${VPEER_ADDR2}

PING 10.10.0.20 (10.10.0.20) 56(84) bytes of data.
64 bytes from 10.10.0.20: icmp_seq=1 ttl=64 time=0.045 ms
64 bytes from 10.10.0.20: icmp_seq=2 ttl=64 time=0.039 ms

ip netns exec ${NS2} ping ${VPEER_ADDR1}
PING 10.10.0.10 (10.10.0.10) 56(84) bytes of data.
64 bytes from 10.10.0.10: icmp_seq=1 ttl=64 time=0.045 ms
64 bytes from 10.10.0.10: icmp_seq=2 ttl=64 time=0.039 ms

MASQUERADE

We are able to send traffic between the namespaces, but we haven’t tested sending traffic outside the container. And for that, we’d need to use IPTables to masquerade the outgoing traffic from our namespace.

# enable ip forwarding
bash -c 'echo 1 > /proc/sys/net/ipv4/ip_forward'

iptables -t nat -A POSTROUTING -s ${BR_ADDR}/16 ! -o ${BR_DEV} -j MASQUERADE

MASQUERADE modifies the source address of the packet, replacing it with the address of a specified network interface. This is similar to SNAT, except that it does not require the machine’s IP address to be known in advance.Basically, what we are doing here is that we are adding an entry to NAT table, to masquerade the outgoing traffic from the bridge, except for the bridge traffic itself. With this, we are done with a basic setup on how docker actually implements linux network stack to isolate containers. You can find the entire script here.

Now let’s dive deep into how docker works with various networking setups.

How does Docker work?

Each Docker container has its own network stack, where a new network namespace is created for each container, isolated from other containers. When a Docker container launches, the Docker engine assigns it a network interface with an IP address, a default gateway, and other components, such as a routing table and DNS services.

Docker offers five network types. All these network types are configured through docker0 via the --net flag

1. Host Networking (--net=host*)*: The container shares the same network namespace of the default host.

You can verify this easily.

# check the network interfaces on the host
ip addr

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 02:98:b0:9b:6c:78 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic enp0s3
       valid_lft 85994sec preferred_lft 85994sec
    inet6 fe80::98:b0ff:fe9b:6c78/64 scope link
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:e5:72:10:c0 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever

Run docker in host mode, and you will see it lists out the same set of interfaces.

# check the network interfaces in the container
docker run --net=host -it --rm alpine ip addr

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 02:98:b0:9b:6c:78 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic enp0s3
       valid_lft 85994sec preferred_lft 85994sec
    inet6 fe80::98:b0ff:fe9b:6c78/64 scope link
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:e5:72:10:c0 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever

2. Bridge Networking ( — net=bridge/default*):* In this mode, the default bridge is used as the bridge for containers to connect to each other.The container runs in an isolated network namespace. Communication is open to other containers in the same network. Communication with services outside of the host goes through network address translation (NAT) before exiting the host. We’ve already seen above, the creation of a bridge network.

# check the network interfaces in the container
docker run --net=bridge -it --rm alpine ip addr

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
16: eth0@if17: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

You can notice that there is an eth0 veth pair that has been created for this container and the corresponding pair should exist on the host machine

# check the network interfaces on the host
ip addr

21: veth8a812a3@if20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
    link/ether d2:c4:4e:d4:08:ad brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::d0c4:4eff:fed4:8ad/64 scope link
       valid_lft forever preferred_lft forever

3. Custom bridge network ( — network=xxx*):* This is the same as Bridge Networking but uses a custom bridge explicitly created for containers.

# create custom bridge
docker network create foo
2b25342b1d883dd134ed8a36e3371ef9c3ec77cdb9e24a0365165232e31b17b6

# check the bridge interface on the host
22: br-2b25342b1d88: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:49:79:07:30 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 brd 172.18.255.255 scope global br-2b25342b1d88
       valid_lft forever preferred_lft forever
    inet6 fe80::42:49ff:fe79:730/64 scope link
       valid_lft forever preferred_lft forever

You can see that on the custom creation of a bridge, a bridge interface is added to the host. Now, all containers in a custom bridge can communicate with the ports of other containers on that bridge. This provides better isolation and security.

Now let’s run two containers in different terminals

# terminal 1
docker run -it --rm --name=container1 --network=foo alpine sh

# terminal 2
docker run -it --rm --name=container2 --network=foo alpine sh

# check the network interfaces on the host
ip addr

22: br-2b25342b1d88: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:49:79:07:30 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 brd 172.18.255.255 scope global br-2b25342b1d88
       valid_lft forever preferred_lft forever
    inet6 fe80::42:49ff:fe79:730/64 scope link
       valid_lft forever preferred_lft forever
30: veth86ca323@if29: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-2b25342b1d88 state UP group default
    link/ether 1e:5e:66:ea:47:1e brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::1c5e:66ff:feea:471e/64 scope link
       valid_lft forever preferred_lft forever
32: vethdf5e755@if31: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-2b25342b1d88 state UP group default
    link/ether ba:2b:25:23:a3:40 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::b82b:25ff:fe23:a340/64 scope link
       valid_lft forever preferred_lft forever

As expected, with bridge networking, both containers (container1, container2) have got their respective veth (veth86ca323, vethdf5e755) cable attached. You can verify this bridge simply by running:

# you can notice both the containers are connected via the same bridge
brctl show br-2b25342b1d88

bridge name bridge id       STP enabled interfaces
br-2b25342b1d88     8000.024249790730   no  veth86ca323
                            vethdf5e755

4. Container-defined Networking( — net=container:$container2*):* With this enabled, the container created shares its network namespace with the container called $container2.

# create a container in terminal 1
docker run -it --rm --name=container1  alpine sh

# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
33: eth0@if34: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

# create a container in terminal 1
docker run -it --rm --name=container2 --network=container:container1 alpine ip addr

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
33: eth0@if34: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

You can see that both the container share the same network interface.

5. No networking: This option disables all networking for the container

docker run --net=none alpine ip addr

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever

You can notice that only the lo (loopback) interface is enabled, nothing else is configured in this container.

In the next article, we will dive deeper (inshaAllah) into how docker manipulates iptables rules to provide network isolation.