Alan P John

Posted on Mar 5, 2022 • Edited on Mar 14, 2022 • Originally published at alanjohn.dev

Deep-dive into Containerization: Creating containers from scratch

#docker #linux #os #bash

Containers have changed the landscape for how we design, develop, and deploy your applications. Today, cloud native technologies are transforming IT ecosystems largely thanks to containerization. In this article, I’ll be creating a container runtime using shell commands. Ideally it’s not recommended to implement your own container runtime. This article is just to get a better understanding of lower level linux functionalities which will help in

Designing more secure images.
Using images more efficiently.
Debugging while using higher level tools

Linux Fundamentals

Before I start the deep dive, it is important to be familiar with certain linux concepts.

Cgroups

Cgroups limit the resources, such as memory, CPU, and network input/output, that a group of processes can use. There is a hierarchy of control groups for each type of resource being managed, and each hierarchy is managed by a cgroup controller. Any Linux process is a member of one cgroup of each type, and when it is first created, a process inherits the cgroups of its parent.

The Linux kernel communicates information about cgroups through a set of pseudo-filesystems that typically reside at /sys/fs/cgroup. You can see the different types of cgroups on your system by listing the contents of that directory.

user@myPChostname:~$ ls /sys/fs/cgroup/
blkio  cpuacct      cpuset   freezer  memory  net_cls           net_prio    pids  systemd
cpu    cpu,cpuacct  devices  hugetlb  misc    net_cls,net_prio  perf_event  rdma  unified

If you have docker installed on the system and look inside the /sys/fs/cgroup/memory directory, you’ll find a directory for docker. All the files in this directory define different kinds of memory limits on your docker containers. You’ll find a similar directory in /sys/fs/cgroup/cpu where the cpu limits for your docker containers are defined.

Namespaces

By putting a process in a namespace, you can restrict the resources that are visible to that process.

Linux kernel 5.6 currently provides 8 namespaces:

pid : provides a process with its own set of process IDs
net : allows processes to have their own network stack
mnt : abstracts filesystem view and manages mount points
ipc : provides separation of named shared memory segments
user : provides processes with their own set of user IDs and group IDs
uts : allows processes to have own domain name and hostname
cgroup : allows a process to have its own set of cgroup root directories
time : virtualize the clock of the system

A process is always in exactly one namespace of each type. When you start a Linux system it has a single namespace of each type. You can easily see the namespaces on your machine using the lsns command.

The unshare command allows us to create sub processes that don't share namespaces with their parent process. You can also use the nsenter command to specify namespaces for a process. In this article, I’ll stick to using unshare.

Creating a Containerized Process

Containers seem to be very similar to virtual machines, but it’s crucial to understand that they are very different. While virtual machines emulate a complete machine, including the operating system and a kernel, containers share the kernel of the host machine and, as explained, are only isolated processes.

Hostname

Let’s start by isolating the hostname. If you run the hostname command from within a docker container, you can see that it’s a different hostname than your host.

user@myPChostname:~$ hostname
myPChostname
user@myPChostname:~$ docker run --rm -it --name hello centos bash
[root@f1e54241a12b /]$ hostname
f1e54241a12b

To achieve a similar isolation, we need to give its own UTS namespace using the unshare command.

💡 I am running these bash commands on an ubuntu VM created using multipass.

To create a new UTS namespace, we can use the --uts flag with unshare.

ubuntu@host:/$ hostname
host
ubuntu@host:/$ sudo unshare --uts bash
root@host:/$ hostname child
root@host:/$ hostname
child

If you were to open another terminal window to the same host before exit, you can confirm that the hostname hasn’t changed for the whole (virtual) machine.

ubuntu@host:/$ hostname
host

Filesystem

Next, we need to give our containerized process it’s own root filesystem so that it does not access the host root. We’ll be using the --root option to do that. This will help us assign a directory as the new root. But before we do that, for any directory to be a root directory, it requires a root filesystem which includes directories such as /bin, /proc etc. So I am going to download the alpine minirootfs to quickly create a minimal root filesystem in my new directory. You can also export root filesystems from existing docker containers if you want.

ubuntu@host:~$ mkdir container_root
ubuntu@host:~$ cd container_root/
ubuntu@host:~/container_root$ curl -o alpine.tar.gz https://dl-cdn.alpinelinux.org/alpine/latest-stable/releases/x86_64/alpine-minirootfs-3.15.0-x86_64.tar.gz
ubuntu@host:~/container_root$ tar xvf alpine.tar.gz
ubuntu@host:~/container_root$ rm alpine.tar.gz
ubuntu@host:~/container_root$ ls
bin  etc   lib    mnt  proc  run   srv  tmp  var
dev  home  media  opt  root  sbin  sys  usr

So now to use the --root option with the unshare command

ubuntu@host:/$ sudo unshare --uts \
--root=/home/ubuntu/container_root \
sh
/$ pwd
/
/$ ls /
bin    etc    lib    mnt    proc   run    srv    tmp    var
dev    home   media  opt    root   sbin   sys    usr

The root directory of the containerized process is no longer the root directory of our host system. Which also means we can’t use the commands like bash as the /bin in our new root filesystem doesn’t have bash even though the host machine /bin did.

ubuntu@host:/$ sudo unshare --uts \
--root=/home/ubuntu/container_root \
bash
chroot: failed to run command ‘bash’: No such file or directory

Processes

Now if we run the ps command, as you can see we can’t see any processes at all. That’s because the ps command runs by listing the /proc psuedo-filesystem. All processes have their own directory within the /proc . You can run ls /proc on your linux system to see how it looks like. You can read more about the /proc filesystem here.

You can mount /proc using the mount command or by using the --mount-proc flag.

ubuntu@host:/$ sudo unshare --uts  \
--mount-proc=proc \
--root=/home/ubuntu/container_root \
sh
/$ ps
PID   USER     TIME  COMMAND
    1 root      0:01 {systemd} /sbin/init
    2 root      0:00 [kthreadd]
    3 root      0:00 [rcu_gp]
    4 root      0:00 [rcu_par_gp]
    6 root      0:00 [kworker/0:0H-kb]
        ... <truncated>

Now, we can see all the processes running on host which is not right. Containers should not be able to access the processes of the host machine. To isolate the host processes, We use the --pid flag with unshare to get a new PID namespace. Along with that we also need to use the --fork flag. This is useful when creating a new pid namespace as --fork runs the specified program as a child process of unshare rather than running it directly.

ubuntu@host:/$ sudo unshare --uts \
--pid --fork \
--mount-proc=proc \
--root=/home/ubuntu/container_root sh
/$ ps
PID   USER     TIME  COMMAND
    1 root      0:00 sh
    2 root      0:00 ps

Mounts

Now we have our processes isolated. The next namespace we need to look into is the mount namespace. We can do that using the --mount flag in the unshare command. This isolation comes handy in keeping in making sure host directories mounted into container are not visible from other containers.

ubuntu@host:~$ sudo unshare --uts \
--pid --fork \
--mount \
--mount-proc=proc \
--root=/home/ubuntu/container_root \
sh

Networking interfaces

Containers have their own networking interface and routing tables. This requires the process to have a separate network namespace which can be set using --net flag.

ubuntu@host:~$ sudo unshare --uts \
--net \
--pid --fork \
--mount \
--mount-proc=proc \
--root=/home/ubuntu/container_root \
sh
/$ ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

When we create our network namespace, we only have a loopback interface. The container will be unable to communicate if it only has a loopback interface. We need to estabilish a virtual Ethernet interface connecting the container network namespace to the default one.

While keeping the containerized process running in one terminal. Open another with root priveleges. Create a virtual ethernet interface on your host machine. You’ll need to know your container’s pid for that. We can use the lsns command to find that.

ubuntu@host:/$ sudo lsns -t net
        NS TYPE NPROCS   PID USER    NETNSID NSFS COMMAND
4026531992 net      93     1 root unassigned      /sbin/init
4026532193 net       2  2241 root unassigned      unshare --uts --net --pid --fork --mount --mo
ubuntu@host:/$ sudo ip link add ve1 netns 2241 type veth peer name ve2 netns 1

Then we need to get the connection up. On the host machine

ubuntu@host:/$ sudo ip link set ve2 up

In the container process

/$ sudo ip link set ve1 up

Now that the connection is up, we assign IP. on the host machine run

ubuntu@host:/$ sudo ip addr add 192.168.1.200/24 dev ve2

on the container process run

/$ ip addr add 192.168.1.100/24 dev ve1

now you should be able to ping the host from the container and vice versa, allowing your container to communicate with other processes.

Inter process communication

Different processes communicate with each other with the help of a shared range of memory. For that, they need to part of the same IPC namespace.We generally wouldn’t want our containers to be able to access one another’s shared memory. In which case we can use the --ipc flag.

ubuntu@host:~$ sudo unshare --uts \
--net --ipc \
--pid --fork \
--mount \
--mount-proc=proc \
--root=/home/ubuntu/container_root \
sh

Cgroups

You can use the --cgroup flag to create a new cgroup namespace which makes sure that your container process cannot see any higher cgroup configuration.

ubuntu@host:~$ sudo unshare --uts \
--net --ipc --cgroup \
--pid --fork \
--mount \
--mount-proc=proc \
--root=/home/ubuntu/container_root \
sh

Users

Currently, the user in the containerized process is the root user because we use sudo.

ubuntu@host:~$ sudo unshare --uts \
--net --ipc \
--pid --fork \
--cgroup --mount \
--mount-proc=proc \
--root=/home/ubuntu/container_root \
sh
/ $ id
uid=0(root) gid=0(root) groups=0(root)

To prevent this, we create a separate user namespace for the container process with the help of --user flag.

ubuntu@host:~$ sudo unshare --user \
--uts --net --ipc \
--pid --fork \
--cgroup --mount \
--mount-proc=proc \
--root=/home/ubuntu/container_root \
sh
~ $ id
uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)

Now the user assigned to the container process is "nobody". We can map this uid to a non-root user on the host machine by making a change in the /proc/<pid>/uid_map where <pid> is your container process pid. The user namespace is created first when you run unshare with the --user flag and you are automatically root in the container user namespace. This means you can create namespaces inside the containerized process while running unshare without sudo allowing us to run containers without any root privileges. (rootless containers)

ubuntu@host:~$ unshare --uts \
--net --ipc \
--pid --fork \
--cgroup --mount \
--mount-proc=proc \
--root=/home/ubuntu/container_root \
sh
unshare: unshare failed: Operation not permitted
ubuntu@host:~$ unshare --user \
--uts --net --ipc \
--pid --fork \
--cgroup --mount \
--mount-proc=proc \
--root=/home/ubuntu/container_root \
sh
~ $

And here we have our container ready!!

All your container runtime tools are wrappers around these in-built features which provide you more ease and flexibility of configuration.

DEV Community

Deep-dive into Containerization: Creating containers from scratch

Linux Fundamentals

Cgroups

Namespaces

Creating a Containerized Process

Hostname

Filesystem

Processes

Mounts

Networking interfaces

Inter process communication

Cgroups

Users

More resources

Top comments (0)

Read next

Accessing Remote Databases Without VPN Using SSH Tunnels

Database Management Tool

Choosing a VPS/VDS Server in 2024: My TOP-4 Picks

Docker | Docker Compose | Kubernetes