Satoru Takeuchi

Posted on Mar 25, 2023 • Edited on Oct 4, 2023

How Linux Works: Chapter1 Linux Overview (Part1)

#linux

In this chapter, we will discuss what Linux and its component, the kernel, are, as well as the differences between Linux and other systems within the context of the entire system. We will also explain the meanings of words like programs and processes, which tend to be used in the same context.

Programs and Processes

Various programs are running on Linux. A program is a set of instructions and data that work together on a computer. In compiled languages like Go, the executable file after building the source code is considered a program. In script languages like Python, the source code itself is considered an executable file. The kernel is also a type of program.

When you turn on the machine, the kernel starts first¹. All other programs start after the kernel.

There are various types of programs running on Linux, such as:

Web browsers: Chrome, Firefox, etc.
Office suites: Libre Office, etc.
Web servers: Apache, Nginx, etc.
Text editors: Vim, Emacs, etc.
Programming language processors: C compiler, Go compiler, Python interpreter, etc.
Shells: bash, zsh, etc.
System-wide management software: systemd, etc.

A program that is running after startup is called a process. Since the term "program" is sometimes used to refer to a running process, it can be said that "program" has a broader meaning than "process."

Kernel

In this section, we will explain what a kernel is and why it is necessary, using the example of accessing storage devices like HDDs and SSDs connected to the system.

First, let's consider a system where processes can directly access storage devices.

In this case, problems arise when multiple processes try to operate on the device simultaneously.

To read and write data from storage devices, suppose you need to issue the following two commands:

Specify the location to read or write data.
Read or write data from the location specified in command 1.

If two processes, A and B, are writing data and reading data from another location simultaneously, the commands might be issued in the following order:

Process A specifies the location to write data.
Process B specifies the location to read data.
Process A writes data.

In command 3, the location to write data is not the one specified in command 1 but the one specified in command 2 causing the data at the location specified in command 2 to be corrupted. As you can see, accessing storage devices is very dangerous if the order of command execution is not properly controlled².

In addition to this, there are problems where programs that should not have access to the device can access it.

To avoid these problems, the kernel, with the help of hardware, prevents processes from directly accessing devices. Specifically, it uses a feature called mode built into the CPU.

General-purpose CPUs used in personal computers and servers have two modes: kernel mode and user mode. More precisely, there are more than three modes depending on the CPU architecture, but we will omit them here³. When a process is running in user mode, it is said to be running in userland (or user space).

While there are no restrictions in kernel mode, certain instructions cannot be executed when running in user mode.

In Linux, only the kernel operates in kernel mode and can access devices. In contrast, processes operate in user mode and cannot access devices. Therefore, processes access devices indirectly through the kernel.

The functionality of accessing devices, including storage devices, through the kernel is described in detail in Chapter 6.

In addition to the device control mentioned above, the kernel centrally manages resources shared by all processes in the system and allocates them to processes running on the system. The program that operates in kernel mode for this purpose is the kernel itself.

System Calls

System calls are a method for processes to request the kernel to perform tasks. They are used when the kernel's help is needed, such as creating new processes or operating hardware.

Examples of system calls include:

Creating and deleting processes
Allocating and deallocating memory
Communication processing
File system operations
Device operations

System calls are implemented by executing special instructions on the CPU. As previously mentioned, processes run in user mode, but when a system call is issued to request the kernel for processing, an event called an exception occurs on the CPU (exceptions are explained in the "Page Tables" section of Chapter 4). Triggered by this event, the CPU transitions from user mode to kernel mode, and the kernel processing corresponding to the request begins. Once the system call processing within the kernel is complete, the CPU returns to user mode and the process continues its operation.

At the beginning of the system call processing, the kernel checks whether the request from the process is legitimate (e.g., whether it is not requesting an amount of memory that does not exist in the system). If the request is illegitimate, the system call fails.

There is no way for a process to change the CPU mode directly without going through a system call. If there were, there would be no point in having a kernel. For example, if a malicious user were to change the CPU to kernel mode from a process and directly operate a device, they could eavesdrop on or destroy other users' data.

Visualizing System Call Invocations

You can check what system calls a process issues using the strace command. Let's try running a simple hello program that only outputs the string "hello world" through strace.



package main

import (
    "fmt"
)

func main() {
    fmt.Println("hello world")
}

First, build and run the program without strace.



$ go build hello.go
$ ./hello
hello world

As expected, it displayed "hello world". Now let's see what system calls this program issues using strace. You can specify the output destination of strace with the "-o" option.



$ strace -o hello.log ./hello
hello world

The program terminated with the same output as before. Now let's take a look at the contents of "hello.log", which contains the output of strace.



$ cat hello.log
...
write(1, "hello world\n", 12)           = 12 ... (1)
...

The output of strace corresponds to one system call per line. You can ignore the detailed numerical values and just look at the string at the beginning of each line. From line (1), you can see that the program is using the write() system call to output the string "hello world\n" (where \n represents a newline character) to the screen or a file.

In my environment, totally 150 system calls were issued. Most of these were issued by the program's startup and shutdown processes (which are also provided by the operating system) and are not something you need to worry about.

Regardless of the programming language used, when a program requests processing from the kernel, it issues a system call. Let's confirm this by examining the hello.py program shown below, which is a program that performs the same thing as the hello program written in Python.



#!/usr/bin/python3

print("hello world")

Let's run this hello.py program through strace.



$ strace -o hello.py.log ./hello.py
hello world

Let's take a look at the trace information.



$ cat hello.py.log
...
write(1, "hello world\n", 12)           = 12   ... (2)
...

Looking at (2), you can see that, just like the hello program, the write() system call is issued. Try writing a hello program equivalent in your favorite language and experiment with various things. Also, it might be interesting to run more complex programs through strace. However, be aware that the output of strace tends to be large, so be careful not to exhaust your file system capacity.

Proportion of time spent processing system calls

You can find out the proportion of instructions executed by the logical CPUs⁴ installed in the system using the sar command. First, let's try to collect information on what kind of processing CPU core 0 is executing using the sar -P 0 1 1 command. The "-P 0" option means to collect data for logical CPU 0, the next "1" means to collect every second, and the last "1" means to collect data only once.



$ sar -P 0 1 1
Linux 5.4.0-66-generic (coffee)         3/26/23  _x86_64_        (8 CPU)

09:51:03     CPU     %user     %nice   %system   %iowait    %steal     %idle ... (1)
09:51:04       0      0.00      0.00      0.00      0.00      0.00    100.00
Average:          0      0.00      0.00      0.00      0.00      0.00    100.00

Let me explain how to read this output. (1) is the header line, and the next line outputs information on how the logical CPU indicated in the second field was used for the purpose from the first field of the header line ("09:51:03") to the first field of the next line ("09:51:04").

There are six types of purpose of CPU usage, from the third field ("%user") to the eighth field ("%idle"), each expressed as a percentage, and their sum equals 100. The proportion of time spent executing processes in user mode is obtained by the sum of "%user" and "%nice" (the difference between "%user" and "%nice" is described in the "Time slice mechanism" column in Chapter 3). "%system" is the proportion of time spent processing system calls by the kernel, and "%idle" shows the proportion of time spent in idle state when nothing is done. We will omit the others here.

In the output above, "%idle" was "100.00". This means that the CPU was doing almost nothing.

Now, let's look at the output of sar while running the inf-loop.py program, which only does an infinite loop, in the background.



#!/usr/bin/python3

while True:
    pass

We will use the taskset command provided by the OS to run the inf-loop.py program on CPU0. By executing taskset -c , you can run the command on a specific CPU specified by the "-c" argument. While running this command in the background, let's collect statistical information using the sar -P 0 1 1 command.



$ taskset -c 0 ./inf-loop.py &
[1] 1911
$ sar -P 0 1 1
Linux 5.4.0-66-generic (coffee)         2021年02月27日  _x86_64_        (8 CPU)

09時59分57秒     CPU     %user     %nice   %system   %iowait    %steal     %idle
09時59分58秒       0    100.00      0.00      0.00      0.00      0.00      0.00  ... (1)
Average:          0    100.00      0.00      0.00      0.00      0.00      0.00

From (1), we can see that "%user" was "10"0 because the inf-loop.py program was constantly running on logical CPU0. The state of logical CPU0 at this time is shown below.

When the experiment is over, terminate the inf-loop.py program with kill.



$ kill 1911

Next, let's do the same thing with the syscall-inf-loop.py program, which continuously issues the simple system call getppid() to get the parent process's process ID.



#!/usr/bin/python3

import os

while True:
    os.getppid()



$ taskset -c 0 ./syscall-inf-loop.py &
[1] 2005
$ sar -P 0 1 1
Linux 5.4.0-66-generic (coffee)         2021年02月27日  _x86_64_        (8 CPU)

10時03分58秒     CPU     %user     %nice   %system   %iowait    %steal     %idle
10時03分59秒       0     35.00      0.00     65.00      0.00      0.00      0.00  ... (1)
Average:          0     35.00      0.00     65.00      0.00      0.00      0.00

This time, because the system call is constantly being issued, "%system" has increased. The state of the CPU at this time is as follows.

Now that the experiment is over, please terminate the syscall-inf-loop afterwards.

Column: Monitoring, Alerting, and Dashboards

As mentioned earlier, collecting system statistical information using tools like the sar command is crucial for ensuring that the system is functioning as expected. In business systems, it is common to continuously collect such statistical information. This kind of mechanism is called monitoring. Nowadays, for example, Prometheus is one of the attractive options of monitoring tool.

It's difficult for humans to visually monitor statistical information, so it's common to use an alerting function along with monitoring tools. This function allows humans to define in advance what constitutes a normal state and notifies administrators or operators when an anomaly occurs. Alerting tools may be integrated with monitoring tools, but they can also be standalone software, such as Alert Manager.

Ultimately, humans will troubleshoot when the system enters an abnormal state, but examining a list of numbers alone is inefficient for investigation. Therefore, a dashboard feature that visualizes the collected data is also commonly used. This feature can also be integrated with monitoring or alerting tools or used as standalone software, such as Grafana Dashboards.

Duration of System Calls

By adding the "-T" option to strace, you can know how much time was consumed for system calls with microsecond precision. This feature is useful for determining which system calls are taking time when the "%system" is high. The following is the result of running strace -T on the hello program.



$ strace -T -o hello.log ./hello
hello world
$ cat hello.log
...
write(1, "hello world\n", 12)           = 12 <0.000017>
...

In this case, for example, it took 17 microseconds for the process to output the string "hello world\n".

strace also has other options, such as the "-tt" option, which displays the issuance time of system calls in microseconds. Use them as needed, depending on your requirements.

next part

NOTE

This article is based on my book written in Japanese. Please contact me via satoru.takeuchi@gmail.com if you're interested in publishing this book's English version.

To be precise, programs like firmware and boot loaders run before that. This will be explained in Chapter 2's "Parent-Child Relationship of Processes" section. ↩
In the worst case, the device may be broken and become unusable. Such a device is commonly called "brick". ↩
For example, there are four CPU modes in the x86_64 architecture, but the Linux kernel only uses two of them. ↩
Kernel recognizes a unit as a logical CPU. If there is one core, it corresponds to one CPU; if there is a multicore CPU, it corresponds to one core; and in a system that enables SMT (refer to the "Simultaneous Multi-Threading (SMT)" section in Chapter 8), it indicates a thread within a CPU core. For simplicity in this book, we will use the term "logical CPU". ↩

Top comments (7)

Anurag Vohra • Jun 3 '23

please do write English book.
i always wanted this kind of detailed approach of "why and how" in linux working.

Jack Woodrow • Apr 30 '23

This article was exactly what I was looking for! Thank you so much, it is amazing!

minhan1910 • Jun 5 '23

Thanks for your insight. I have a question that makes me curious "How can I become a Kernel developer?" so can you create one article to know a newbie of kernel developer.
Thanks.