DEV Community


Posted on

How linux works

DEV Community

Chapter1 Linux Overview (Part1) - DEV Community
Posted on Mar 25

How Linux Works:
Chapter1 Linux Overview (Part1)
In this chapter, we will discuss what Linux and its component, the kernel, are, as well as the differences between Linux and other systems within the context of the entire system. We will also explain the meanings of words like programs and processes, which tend to be used in the same context.

Programs and Processes
Various programs are running on Linux. A program is a set of instructions and data that work together on a computer. In compiled languages like Go, the executable file after building the source code is considered a program. In script languages like Python, the source code itself is considered an executable file. The kernel is also a type of program.

When you turn on the machine, the kernel starts first1. All other programs start after the kernel.

There are various types of programs running on Linux, such as:

Web browsers: Chrome, Firefox, etc.
Office suites: Libre Office, etc.
Web servers: Apache, Nginx, etc.
Text editors: Vim, Emacs, etc.
Programming language processors: C compiler, Go compiler, Python interpreter, etc.
Shells: bash, zsh, etc.
System-wide management software: systemd, etc.
A program that is running after startup is called a process. Since the term "program" is sometimes used to refer to a running process, it can be said that "program" has a broader meaning than "process."

In this section, we will explain what a kernel is and why it is necessary, using the example of accessing storage devices like HDDs and SSDs connected to the system.

First, let's consider a system where processes can directly access storage devices.

Image description

In this case, problems arise when multiple processes try to operate on the device simultaneously.

To read and write data from storage devices, suppose you need to issue the following two commands:

Specify the location to read or write data.
Read or write data from the location specified in command 1.
If two processes, A and B, are writing data and reading data from another location simultaneously, the commands might be issued in the following order:

Process A specifies the location to write data.
Process B specifies the location to read data.
Process A writes data.
In command 3, the location to write data is not the one specified in command 1 but the one specified in command 2 causing the data at the location specified in command 2 to be corrupted. As you can see, accessing storage devices is very dangerous if the order of command execution is not properly controlled2.

In addition to this, there are problems where programs that should not have access to the device can access it.

To avoid these problems, the kernel, with the help of hardware, prevents processes from directly accessing devices. Specifically, it uses a feature called mode built into the CPU.

General-purpose CPUs used in personal computers and servers have two modes: kernel mode and user mode. More precisely, there are more than three modes depending on the CPU architecture, but we will omit them here3. When a process is running in user mode, it is said to be running in userland (or user space).

While there are no restrictions in kernel mode, certain instructions cannot be executed when running in user mode.

In Linux, only the kernel operates in kernel mode and can access devices. In contrast, processes operate in user mode and cannot access devices. Therefore, processes access devices indirectly through the kernel.

Image description

The functionality of accessing devices, including storage devices, through the kernel is described in detail in Chapter 6.

In addition to the device control mentioned above, the kernel centrally manages resources shared by all processes in the system and allocates them to processes running on the system. The program that operates in kernel mode for this purpose is the kernel itself.

System Calls
System calls are a method for processes to request the kernel to perform tasks. They are used when the kernel's help is needed, such as creating new processes or operating hardware.

Examples of system calls include:

Creating and deleting processes
Allocating and deallocating memory
Communication processing
File system operations
Device operations
System calls are implemented by executing special instructions on the CPU. As previously mentioned, processes run in user mode, but when a system call is issued to request the kernel for processing, an event called an exception occurs on the CPU (exceptions are explained in the "Page Tables" section of Chapter 4). Triggered by this event, the CPU transitions from user mode to kernel mode, and the kernel processing corresponding to the request begins. Once the system call processing within the kernel is complete, the CPU returns to user mode and the process continues its operation.

Image description

At the beginning of the system call processing, the kernel checks whether the request from the process is legitimate (e.g., whether it is not requesting an amount of memory that does not exist in the system). If the request is illegitimate, the system call fails.

There is no way for a process to change the CPU mode directly without going through a system call. If there were, there would be no point in having a kernel. For example, if a malicious user were to change the CPU to kernel mode from a process and directly operate a device, they could eavesdrop on or destroy other users' data.

Visualizing System Call Invocations
You can check what system calls a process issues using the strace command. Let's try running a simple hello program that only outputs the string "hello world" through strace.
package main

import (

func main() {
fmt.Println("hello world")
First, build and run the program without strace.
$ go build hello.go
$ ./hello
hello world
As expected, it displayed "hello world". Now let's see what system calls this program issues using strace. You can specify the output destination of strace with the "-o" option.
$ strace -o hello.log ./hello
hello world
The program terminated with the same output as before. Now let's take a look at the contents of "hello.log", which contains the output of strace.
$ cat hello.log
write(1, "hello world\n", 12) = 12 ... (1)
The output of strace corresponds to one system call per line. You can ignore the detailed numerical values and just look at the string at the beginning of each line. From line (1), you can see that the program is using the write() system call to output the string "hello world\n" (where \n represents a newline character) to the screen or a file.

In my environment, totally 150 system calls were issued. Most of these were issued by the program's startup and shutdown processes (which are also provided by the operating system) and are not something you need to worry about.

Regardless of the programming language used, when a program requests processing from the kernel, it issues a system call. Let's confirm this by examining the program shown below, which is a program that performs the same thing as the hello program written in Python.


print("hello world")
Let's run this program through strace.
$ strace -o ./
hello world
Let's take a look at the trace information.
$ cat
write(1, "hello world\n", 12) = 12 ... (2)
Looking at (2), you can see that, just like the hello program, the write() system call is issued. Try writing a hello program equivalent in your favorite language and experiment with various things. Also, it might be interesting to run more complex programs through strace. However, be aware that the output of strace tends to be large, so be careful not to exhaust your file system capacity.

Proportion of time spent processing system calls
You can find out the proportion of instructions executed by the logical CPUs4 installed in the system using the sar command. First, let's try to collect information on what kind of processing CPU core 0 is executing using the sar -P 0 1 1 command. The "-P 0" option means to collect data for logical CPU 0, the next "1" means to collect every second, and the last "1" means to collect data only once.

$ sar -P 0 1 1
Linux 5.4.0-66-generic (coffee) 3/26/23 x86_64 (8 CPU)

09:51:03 CPU %user %nice %system %iowait %steal %idle ... (1)
09:51:04 0 0.00 0.00 0.00 0.00 0.00 100.00
Average: 0 0.00 0.00 0.00 0.00 0.00 100.00
Let me explain how to read this output. (1) is the header line, and the next line outputs information on how the logical CPU indicated in the second field was used for the purpose from the first field of the header line ("09:51:03") to the first field of the next line ("09:51:04").

There are six types of purpose of CPU usage, from the third field ("%user") to the eighth field ("%idle"), each expressed as a percentage, and their sum equals 100. The proportion of time spent executing processes in user mode is obtained by the sum of "%user" and "%nice" (the difference between "%user" and "%nice" is described in the "Time slice mechanism" column in Chapter 3). "%system" is the proportion of time spent processing system calls by the kernel, and "%idle" shows the proportion of time spent in idle state when nothing is done. We will omit the others here.

In the output above, "%idle" was "100.00". This means that the CPU was doing almost nothing.

Now, let's look at the output of sar while running the program, which only does an infinite loop, in the background.


while True:
We will use the taskset command provided by the OS to run the program on CPU0. By executing taskset -c , you can run the command on a specific CPU specified by the "-c" argument. While running this command in the background, let's collect statistical information using the sar -P 0 1 1 command.
$ taskset -c 0 ./ &
[1] 1911
$ sar -P 0 1 1
Linux 5.4.0-66-generic (coffee) x86_64 (8 CPU)

CPU %user %nice %system %iowait %steal %idle
09 0 100.00 0.00 0.00 0.00 0.00 0.00 ... (1)
Average: 0 100.00 0.00 0.00 0.00 0.00 0.00
From (1), we can see that "%user" was "10"0 because the program was constantly running on logical CPU0. The state of logical CPU0 at this time is shown below.

Image description

When the experiment is over, terminate the program with kill.
$ kill 1911
Next, let's do the same thing with the program, which continuously issues the simple system call getppid() to get the parent process's process ID.


import os

while True:
$ taskset -c 0 ./ &
[1] 2005
$ sar -P 0 1 1
Linux 5.4.0-66-generic (coffee) x86_64 (8 CPU 0 35.00 0.00 65.00 0.00 0.00 0.00 ... (1)
Average: 0 35.00 0.00 65.00 0.00 0.00 0.00
This time, because the system call is constantly being issued, "%system" has increased. The state of the CPU at this time is as follows.

Image description

Now that the experiment is over, please terminate the syscall-inf-loop afterwards.

Column: Monitoring, Alerting, and Dashboards
As mentioned earlier, collecting system statistical information using tools like the sar command is crucial for ensuring that the system is functioning as expected. In business systems, it is common to continuously collect such statistical information. This kind of mechanism is called monitoring. Nowadays, for example, Prometheus is one of the attractive options of monitoring tool.

It's difficult for humans to visually monitor statistical information, so it's common to use an alerting function along with monitoring tools. This function allows humans to define in advance what constitutes a normal state and notifies administrators or operators when an anomaly occurs. Alerting tools may be integrated with monitoring tools, but they can also be standalone software, such as Alert Manager.

Ultimately, humans will troubleshoot when the system enters an abnormal state, but examining a list of numbers alone is inefficient for investigation. Therefore, a dashboard feature that visualizes the collected data is also commonly used. This feature can also be integrated with monitoring or alerting tools or used as standalone software, such as Grafana Dashboards.

Duration of System Calls
By adding the "-T" option to strace, you can know how much time was consumed for system calls with microsecond precision. This feature is useful for determining which system calls are taking time when the "%system" is high. The following is the result of running strace -T on the hello program.
$ strace -T -o hello.log ./hello
hello world
$ cat hello.log
write(1, "hello world\n", 12) = 12 <0.000017>
In this case, for example, it took 17 microseconds for the process to output the string "hello world\n".

strace also has other options, such as the "-tt" option, which displays the issuance time of system calls in microseconds. Use them as needed, depending on your requirements.

Top comments (0)