DEV Community

Satoru Takeuchi
Satoru Takeuchi

Posted on

"Command name" from the perspective of the Linux kernel.

Introduction

Those of you who use Linux probably execute various commands on Linux on a daily basis. You might use the term "command name" to identify these, but depending on the context, the meaning of this term can vary. This article explains what the Linux kernel considers a command name.

First, I will present a brief conclusion, followed by a detailed explanation, and finally, I will describe the motivation for this investigation and the subsequent research process.

TL;DR

From the Linux kernel's perspective, the command name is the first 15 bytes of the basename of the executable file name (the file name without the directory part).
It is stored as a NULL-terminated string in a 16-byte field called comm within a structure called task_struct, which exists for each process in the kernel's memory (more precisely, for each kernel-level thread).
This enables the kernel to identify processes with low cost and higher readability than using a pid.
This command name is used in kernel logs, by commands such as ps and pgrep, and in packages like procps. Longer command names are truncated due to the 15-byte limit mentioned above.

Investigation Process

Software versions used for the investigation

  • Linux kernel: v5.15
  • Procps: 3.3.17

Motivation

The motivation for investigating what was mentioned in the "TL;DR" section came from the fact that the pgrep command I used in my custom program did not work correctly. The pgrep command takes a string specified as an argument as a regular expression and retrieves a list of pids of running processes that match it. For example, below is an example of running an infinitely sleeping script called "foo.sh" and then using pgrep to display its pid.

$ cat foo.sh
#!/bin/bash

sleep infinity
$ ./foo.sh &
[2] 1086408
$ pgrep "foo\.sh"
1086408
Enter fullscreen mode Exit fullscreen mode

However, when I tried the same thing with a script called "foo-bar-baz-hoge-huga.sh" that does exactly the same thing as "foo.sh", grep did not display anything.

$ cat foo-bar-baz-hoge-huga.sh
#!/bin/bash

sleep infinity
$ ./foo-bar-baz-hoge-huga.sh &
[2] 1086868
$ pgrep "foo-bar-baz-hoge-huga\.sh"
$ 
Enter fullscreen mode Exit fullscreen mode

I thought it was odd, so I looked at man pgrep and found the following description.

The process name used for matching is limited to the 15 characters present in the output of /proc/pid/stat.

In fact, when I looked at the /proc/pid/stat file for "foo-bar-baz-hoge-huga.sh", I got the following output.

$ cat /proc/601235/stat
601235 (foo-bar-baz-hog) S 593786 601235 593786 34817 601419 4194304 224 0 0 0 0 0 0 0 20 0 1 0 5735606 8617984 900 18446744073709551615 94266299658240 94266300571405 140732967030208 0 0 0 65536 4 65538 1 0 0 17 1 0 0 0 0 0 94266300816048 94266300864080 94266304847872 140732967036675 140732967036712 140732967036712 140732967038941 0
Enter fullscreen mode Exit fullscreen mode

The string displayed inside the parentheses in the second field, which shows the command name, did indeed match only the first 15 characters of the script name, not the entire name.

Although I understood the specification itself and realized that my usage of pgrep was incorrect, I decided to verify where this 15-character limit came from.

Reading the procfs Manual

The files under the /proc/ directory are provided by a file system called procfs. Unlike file systems such as ext4 or XFS that manage data on disk, procfs exists for users to obtain kernel information and modify the kernel state through files. We will not go into the details of procfs here.

First, let's check the specifications of the /proc/pid/stat file. The specifications of files under procfs are described in man procfs. The following is an excerpt of the relevant part:

/proc/[pid]/stat
Status information about the process. This is used by ps(1). It is defined in the kernel source file fs/proc/array.c.
...
(2) comm %s
The filename of the executable, in parentheses. Strings longer than TASK_COMM_LEN (16) characters (including the terminating null byte) are silently truncated. This is visible
whether or not the executable is swapped out.

We can see that the second field of the /proc/pid/stat file contains the name of the executable file in parentheses, and that any part exceeding 16 bytes, including the NULL terminating string, is ignored. Subtracting 1 byte for the NULL character from 16 bytes gives us 15 bytes, which matches the information written in the pgrep manual.

Identifying the handler for the /proc/pid/stat file

Next, I looked at the kernel source to see where this string is actually being output and where the data is stored. The procfs manual states that the /proc/pid/stat file is defined in the fs/proc/array.c file in the kernel source, so I first looked at this file.

The relevant code seems to be in the following part of the do_task_stat() function:

https://github.com/torvalds/linux/blob/v5.15/fs/proc/array.c#L562-L564

When the seq_puts() function is called, it outputs the specified string to a file. In the code above, lines 562 and 564 output "(" and ")", and it can be inferred that the command name is probably being output to a file by the proc_task_name() function on line 563.

Before looking at the contents of proc_task_name(), I decided to first check if the do_task_stat() function is actually called when the /proc/pid/stat file is read. I traced the call stack of the do_task_stat() function and found that it is called in sequence from two functions, proc_tid_stat() and proc_tgid_stat().

https://github.com/torvalds/linux/blob/v5.15/fs/proc/array.c#L646-L656

In the kernel, tid refers to the thread ID, and tgid refers to the process name, so we can guess that the proc_tgid_stat() function is probably the caller. There are functions that display the state of threads under the /proc/pid/task directory in procfs, so the proc_tid_stat() function is probably the handler for the /proc/pid/task/tid file.

Tracing further back the call stack of these functions, I found that in the proc/pid/base.c file, which registers handlers to be called when users read and write files in procfs, the proc_tgid_stat() function is registered to be called when accessing the /proc/tgid/stat file, or in other words, the /proc/<pid>/stat file.

https://github.com/torvalds/linux/blob/v5.15/fs/proc/base.c#L3168-L3202

In summary, I found the following:

  • The user reads the /proc/pid/stat file
  • The proc_tgid_stat() function is called
  • The do_task_stat() function is called
  • The proc_task_name() function is called to output the command name to the file

Identifying the source of the command name information

Upon examining the implementation of the proc_task_name() function, it looks like this:

https://github.com/torvalds/linux/blob/v5.15/fs/proc/array.c#L99-L112

I will omit the details, but when the process indicated by the pid is a regular program, the evaluation result of the if statement on line 103 is false. This evaluation result is true only in the case of special processes created within the kernel.

Furthermore, since the escape argument of the proc_task_name() function is true when called via the proc_tgid_stat() function, the evaluation result of the if statement on line 108 is true. Therefore, we can see that the data obtained by the __get_task_comm() function (probably a NULL-terminated string) is being used as the output for the /proc/pid/stat file on line 109 within the proc_task_name() function. The seq_escape_str() function on line 109 escapes special characters and spaces, but I will not explain the details here as it is not important for this article.

Now, let's look at the contents of the __get_task_comm() function.

https://github.com/torvalds/linux/blob/v5.15/fs/exec.c#L1209-L1215

We can see that the value of tsk->comm, or more precisely, the value of the comm field of a structure named task_struct, is the source of the command name information. The task_struct structure exists for each thread. Let's take a look at the definition of the task_struct structure.

https://github.com/torvalds/linux/blob/master/include/linux/sched.h#L727-L1063

https://github.com/torvalds/linux/blob/master/include/linux/sched.h#L276-L282

We can see that the comm field is an array of char with a length of 16. The procfs manual also mentioned that the length of TASK_COMM_LENis 16 bytes.

Confirming where the value of task_struct->comm is set
The __set_task_struct() function sets the value of task_struct->comm:

https://github.com/torvalds/linux/blob/v5.15/fs/exec.c#L1223-L1230

The caller of the __set_task_struct() function is the begin_new_exec() function:

https://github.com/torvalds/linux/blob/v5.15/fs/exec.c#L1238-L1357

This function is called when the execve() system call, which creates a new process, is invoked. The bprm->filename contains the name of the executable file corresponding to the process as a NULL-terminated string. Here, we can see that the name of the executable file is processed using the kbasename() function and then saved in task->comm. The kbasename() function, similar to the basename() function in the standard C library, returns a string with the directory part of the file name removed. Therefore, if the executable file name is "./foo.sh", "foo.sh" will be stored in task_struct->comm, and if it's "./foo-bar-baz-hoge-huga.sh", "foo-bar-baz-hog" will be stored. Finally, I understood the definition of the "command name" in the /proc/pid/stat file, or, in other words, as referred to by the Linux kernel.

Examining the procps source code

Lastly, by reading the procps source code, I found out that the string output by pgrep is, as described in the man page, the longest 15 characters excluding the "(" and ")" from the second field of the /proc/pid/stat file.

Since there is nothing particularly interesting going on.

Column: Considering the Definition of Command Names

We now understand that the command name, as referred to by the Linux kernel, is the first 15 bytes of the basename of the executable file. However, why is it processed with the basename, and why is it truncated to a maximum of 15 bytes? The reasons are probably as follows:

To identify processes through kernel logs and other means, it is convenient to have easily accessible information in the form of a string, separate from the process ID (pid). The name of the executable file can be used for this purpose. However, storing the full executable file name in the task_struct structure may consume a large amount of kernel memory and could potentially create a security vulnerability if a malicious user executed a program with an excessively long file name. Therefore, storing the entire file name is not feasible.

One might think that it would be sufficient to look at the value of the executable file name stored in the process memory. However, this is not necessarily true. When accessing the process memory from the kernel, if the relevant memory might be swapped out, it is necessary to swap it back in before reading, which can be cumbersome. Moreover, this approach cannot be used in situations where the system is running out of memory, for instance, when the kernel needs to log the lack of memory. It is not possible to increase memory usage when there is already a shortage.

The reason for using the basename, such as "foo.sh" instead of the file name or full path specified at runtime like "./foo.sh", is likely due to the decision that the basename still provides sufficiently high visibility. In most cases, the basename is enough to recognize and identify the process without using the full path.

Conclusion

In this article, I desceived why the command name specification in the Linux kernel is as it is. Additionally, I wrote about the process of finding answers to small questions that arise while using a computer by reading source code, allowing readers to relive the experience of source code reading. Neither of these provide immediately useful knowledge, but I hope they can serve as tidbits of information.

Latest comments (0)