DEV Community

Cover image for What's inside a container?
Gaurav
Gaurav

Posted on

What's inside a container?

Hi People
This is my first post on Dev.to
Nowadays everyone is using containerized solution for CI/CD.
I always wonder what makes a container to run in a host machine as a separate process/CPU/memory all along.

So here are some of my findings:

Processes

Containers are just normal Linux Processes with additional configuration applied. Launch the following Redis container so we can see what is happening under the covers.

 docker run -d --name=db redis:alpine 
Enter fullscreen mode Exit fullscreen mode

Alt Text

The Docker container launches a process called redis-server. From the host, we can view all the processes running, including those started by Docker.

ps aux | grep redis-server
Enter fullscreen mode Exit fullscreen mode

Alt Text

Docker can help us identify information about the process including the PID (Process ID) and PPID (Parent Process ID) via

docker top db
Enter fullscreen mode Exit fullscreen mode

Alt Text
Who is the PPID? Use

ps aux | grep <ppid>
Enter fullscreen mode Exit fullscreen mode

to find the parent process. Likely to be Containerd.

The command pstree will list all of the sub processes. See the Docker process tree using

pstree -c -p -A $(pgrep dockerd)
Enter fullscreen mode Exit fullscreen mode

Alt Text

As you can see, from the viewpoint of Linux, these are standard processes and have the same properties as other processes on our system.

Process Directory

Linux is just a series of magic files and contents, this makes it fun to explore and navigate to see what is happening under the covers, and in some cases, change the contents to see the results.

The configuration for each process is defined within the /proc directory. If you know the process ID, then you can identify the configuration directory.

The command below will list all the contents of /proc, and store the Redis PID for future use.

DBPID=$(pgrep redis-server)
echo Redis is $DBPID
ls /proc
Enter fullscreen mode Exit fullscreen mode

Each process has it's own configuration and security settings defined within different files.

ls /proc/$DBPID
Enter fullscreen mode Exit fullscreen mode

Alt Text
For example, you can view and update the environment variables defined to that process.

cat /proc/$DBPID/environ
Enter fullscreen mode Exit fullscreen mode

Alt Text

docker exec -it db env
Enter fullscreen mode Exit fullscreen mode

Namespaces

One of the fundamental parts of a container is namespaces. The concept of namespaces is to limit what processes can see and access certain parts of the system, such as other network interfaces or processes.

When a container is started, the container runtime, such as Docker, will create new namespaces to sandbox the process. By running a process in it's own Pid namespace, it will look like it's the only process on the system.

The available namespaces are:

Mount (mnt)
Process ID (pid)
Network (net)
Interprocess Communication (ipc)
UTS (hostnames)
User ID (user)
Control group (cgroup)

More information at https://en.wikipedia.org/wiki/Linux_namespaces

Unshare can launch "contained" processes.
Without using a runtime such as Docker, a process can still operate within it's own namespace. One tool to help is unshare.

unshare --help

With unshare it's possible to launch a process and have it create a new namespace, such as Pid. By unsharing the Pid namespace from the host, it looks like the bash prompt is the only process running on the machine.

sudo unshare --fork --pid --mount-proc bash
ps
exit
Enter fullscreen mode Exit fullscreen mode

Alt Text
What happens when we share a namespace?
Under the covers, Namespaces are inode locations on disk. This allows for processes to shared/reused the same namespace, allowing them to view and interact.

List all the namespaces with

ls -lha /proc/$DBPID/ns/
Enter fullscreen mode Exit fullscreen mode

Alt Text
Another tool, NSEnter is used to attach processes to existing Namespaces. Useful for debugging purposes.

nsenter --help
Enter fullscreen mode Exit fullscreen mode
nsenter --target $DBPID --mount --uts --ipc --net --pid ps aux
Enter fullscreen mode Exit fullscreen mode

Alt Text
With Docker, these namespaces can be shared using the syntax container:. For example, the command below will connect nginx to the DB namespace.

docker run -d --name=web --net=container:db nginx:alpine
WEBPID=$(pgrep nginx | tail -n1)
echo nginx is $WEBPID
cat /proc/$WEBPID/cgroup
Enter fullscreen mode Exit fullscreen mode

Alt Text
While the net has been shared, it will still be listed as a namespace.

ls -lha /proc/$WEBPID/ns/
Enter fullscreen mode Exit fullscreen mode

However, the net namespace for both processes points to the same location.

ls -lha /proc/$WEBPID/ns/ | grep net 
Enter fullscreen mode Exit fullscreen mode
ls -lha /proc/$DBPID/ns/ | grep net
Enter fullscreen mode Exit fullscreen mode

Alt Text

Chroot

An important part of a container process is the ability to have different files that are independent of the host. This is how we can have different Docker Images based on different operating systems running on our system.

Chroot provides the ability for a process to start with a different root directory to the parent OS. This allows different files to appear in the root.

Cgroups (Control Groups)

CGroups limit the amount of resources a process can consume. These cgroups are values defined in particular files within the /proc directory.

To see the mappings, run the command:

cat /proc/$DBPID/cgroup
Enter fullscreen mode Exit fullscreen mode

Alt Text
These are mapped to other cgroup directories on disk at:

ls /sys/fs/cgroup/
Enter fullscreen mode Exit fullscreen mode

Alt Text
What are the CPU stats for a process?
The CPU stats and usage is stored within a file too!

cat /sys/fs/cgroup/cpu,cpuacct/docker/$DBID/cpuacct.stat
Enter fullscreen mode Exit fullscreen mode

The CPU shares limit is also defined here.

cat /sys/fs/cgroup/cpu,cpuacct/docker/$DBID/cpu.shares
Enter fullscreen mode Exit fullscreen mode

Alt Text
All the Docker cgroups for the container's memory configuration are stored within:

ls /sys/fs/cgroup/memory/docker/
Enter fullscreen mode Exit fullscreen mode

Alt Text
Each of the directory is grouped based on the container ID assigned by Docker.

DBID=$(docker ps --no-trunc | grep 'db' | awk '{print $1}')
WEBID=$(docker ps --no-trunc | grep 'nginx' | awk '{print $1}')
ls /sys/fs/cgroup/memory/docker/$DBID
Enter fullscreen mode Exit fullscreen mode

Alt Text
How to configure cgroups?
One of the properties of Docker is the ability to control memory limits. This is done via a cgroup setting.

By default, containers have no limit on the memory. We can view this via docker stats command.

docker stats db --no-stream
Enter fullscreen mode Exit fullscreen mode

Alt Text
The memory quotes are stored in a file called memory.limit_in_bytes.

By writing to the file, we can change the limit limits of a process.

echo 8000000 >/sys/fs/cgroup/memory/docker/$DBID/memory.limit_in_bytes
Enter fullscreen mode Exit fullscreen mode

If you read the file back, you'll notice it's been converted to 7999488.

cat /sys/fs/cgroup/memory/docker/$DBID/memory.limit_in_bytes
Enter fullscreen mode Exit fullscreen mode

Alt Text
When checking Docker Stats again, the memory limit of the process is now 7.629M -

docker stats db --no-stream
Enter fullscreen mode Exit fullscreen mode

Alt Text

Seccomp / AppArmor

All actions with Linux is done via syscalls. The kernel has 330 system calls that perform operations such as read files, close handles and check access rights. All applications use a combination of these system calls to perform the required operations.

AppArmor is a application defined profile that describes which parts of the system a process can access.

It's possible to view the current AppArmor profile assigned to a process via

cat /proc/$DBPID/attr/current
Enter fullscreen mode Exit fullscreen mode

Alt Text
The default AppArmor profile for Docker is docker-default (enforce).

Prior to Docker 1.13, it stored the AppArmor Profile in /etc/apparmor.d/docker-default (which was overwritten when Docker started, so users couldn't modify it. After v1.13, Docker now generates docker-default in tmpfs, uses apparmor_parser to load it into kernel, then deletes the file

The template can be found at https://github.com/moby/moby/blob/a575b0b1384b2ba89b79cbd7e770fbeb616758b3/profiles/apparmor/template.go

Seccomp provides the ability to limit which system calls can be made, blocking aspects such as installing Kernel Modules or changing the file permissions.

The default allowed calls with Docker can be found at https://github.com/moby/moby/blob/a575b0b1384b2ba89b79cbd7e770fbeb616758b3/profiles/seccomp/default.json

When assigned to a process it means the process will be limited to a subset of the ability system calls. If it attempts to call a blocked system call is will recieve the error "Operation Not Allowed".

The status of SecComp is also defined within a file.

cat /proc/$DBPID/status
Enter fullscreen mode Exit fullscreen mode
cat /proc/$DBPID/status | grep Seccomp
Enter fullscreen mode Exit fullscreen mode

Alt Text
The flag meaning are: 0: disabled 1: strict 2: filtering

Capabilities

Capabilities are groupings about what a process or user has permission to do. These Capabilities might cover multiple system calls or actions, such as changing the system time or hostname.

The status file also containers the Capabilities flag. A process can drop as many Capabilities as possible to ensure it's secure.

cat /proc/$DBPID/status | grep ^Cap
Enter fullscreen mode Exit fullscreen mode

Alt Text
The flags are stored as a bitmask that can be decoded with capsh

capsh --decode=00000000a80425fb
Enter fullscreen mode Exit fullscreen mode

Alt Text

Hope you found this article worth reading!

Top comments (3)

Collapse
 
thisdotmedia_staff profile image
This Dot Media • Edited

Good job on your first post Gaurav! 😊

Collapse
 
gauravratnawat profile image
Gaurav

Thanks, :-) Hope this post well serving its purpose.

Collapse
 
thisdotmedia_staff profile image
This Dot Media

Np! We're sure it will 🤗