"Hello World" is the first program of many. Regardless of the programming language, we are learning it is a canonical example of how to create a program that simply prints "Hello, world!" to the screen.
One might then ask, how complex it really is? After all, it is just a single write(2)
syscall, right?
NOTE: This post refers specifically to Linux, as I will use some Linux-only tools.
The basics
For the purpose of my "Hello, world" I want to use Rust. It is a modern language suitable for low-level programming, so it surely will have much less overhead than many others. Besides that, it is just a good language.
Let's get our "Hello, world" going:
fn main() {
println!("Hello, world!");
}
We can now run it:
cargo run
Hello, world!
Everything works as expected! Well, that is not a big achievement, but hey we need to be happy with small things.
To see our write(2)
syscall and get it over with we will use strace
, a system call tracing tool for Linux:
-c
option prints a summary of the system calls at the end of the trace. If you also want to see specific system calls with their arguments as they occur, use-C
instead.
strace -c ./target/debug/hello-world
Hello, world!
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ------------------
0.00 0.000000 0 5 read
0.00 0.000000 0 1 write
0.00 0.000000 0 4 close
0.00 0.000000 0 1 poll
0.00 0.000000 0 13 mmap
0.00 0.000000 0 5 mprotect
0.00 0.000000 0 2 munmap
0.00 0.000000 0 3 brk
0.00 0.000000 0 5 rt_sigaction
0.00 0.000000 0 2 pread64
0.00 0.000000 0 1 1 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 3 sigaltstack
0.00 0.000000 0 2 1 arch_prctl
0.00 0.000000 0 1 sched_getaffinity
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 4 openat
0.00 0.000000 0 4 newfstatat
0.00 0.000000 0 1 set_robust_list
0.00 0.000000 0 2 prlimit64
0.00 0.000000 0 1 getrandom
0.00 0.000000 0 1 rseq
------ ----------- ----------- --------- --------- ------------------
100.00 0.000000 0 63 2 total
That is quite a bit more stuff than one might have expected... This begs the question then: Does the simple "Hello, world" need to do all of this? We should certainly do something about it.
Tracing complexity
Let's start by looking at those syscalls a bit closer and see if we can get an idea of what is going on:
strace ./target/debug/hello-world
The output is pretty verbose, so I chop it down to relevant pieces:
...
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=44627, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 44627, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f9c5c9ae000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 832) = 832
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=571848, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9c5c9ac000
mmap(NULL, 127304, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f9c5c98c000
mmap(0x7f9c5c98f000, 94208, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f9c5c98f000
mmap(0x7f9c5c9a6000, 16384, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a000) = 0x7f9c5c9a6000
mmap(0x7f9c5c9aa000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1d000) = 0x7f9c5c9aa000
close(3) = 0
...
openat(AT_FDCWD, "/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P4\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=1953472, ...}, AT_EMPTY_PATH) = 0
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
mmap(NULL, 1994384, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f199ee8c000
mmap(0x7f199eeae000, 1421312, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0x7f199eeae000
mmap(0x7f199f009000, 356352, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17d000) = 0x7f199f009000
mmap(0x7f199f060000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1d4000) = 0x7f199f060000
mmap(0x7f199f066000, 52880, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f199f066000
close(3)
...
We can see a lot of openat
s, newfstatat
s, mmap
s, read
s, and close
s. And most of them refere to some dynamic shared object. In the above we can see: ld.so.cache
, libgcc_s.so.1
, libc.so.6
.
While
libgcc_s.so.1
andlibc.so.6
are standard shared libraries, andld.so.cache
is basically a cache built byldconfig
. I was not really familiar withld.so.preload
, which, if we look at our system calls was not loaded successfully:access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
After a quick search, it turns out it works the same as
LD_PRELOAD
environment variable. It allows the user to specify ELF shared object that is loaded before all others. And indeed we can see it was accessed first, but since I do not have this file on my system, the result was... = -1 ENOENT (No such file or directory)
.
We can correlate that a lot of those syscalls refer to the same files by looking at the file descriptor, which is a return value from openat(2)
syscall:
openat(AT_FDCWD, "/usr/lib/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 3
In this case, the file descriptor is 3
. We can see 3
being passed to syscalls that follow, and if we consult manpages for those, we can verify that this argument is indeed expected to be a file descriptor (fd
):
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 832) = 832
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=571848, ...}, AT_EMPTY_PATH) = 0
...
mmap(NULL, 127304, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f9c5c98c000
...
Since those are dynamic libraries we do not explicitly touch their files in the code (only call functions etc.), as it is the job of the linker to make them available. This makes sense since most Rust targets are by default linked dynamically.
If we inspect our binary with file
we can see it for ourselves:
file ./target/debug/hello-world
./target/debug/hello-world: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=54d56ea3e059ced4d3b8cc088c409da6411264af, for GNU/Linux 4.4.0, with debug_info, not stripped
And in simple terms "dynamically linked" means that shared libraries are loaded into memory, and sections are mapped after the process is started.
Running the ldd
on our binary shows us some of the same files we have seen in the strace
output:
ldd ./target/debug/hello-world
linux-vdso.so.1 (0x00007ffc75f26000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007fdc33996000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007fdc337af000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007fdc33a13000)
- We have seen
libgcc_s.so.1
andlibc.so.6
being linked from our syscalls. - VDSO in
linux-vdso.so.1
stands for virtual dynamic shared object and is used for some syscalls optimizations. - The last one remaining
/usr/lib64/ld-linux-x86-64.so.2
is the linker itself. You can see it for yourself by trying to run it:
/usr/lib64/ld-linux-x86-64.so.2 --help | head -n 4
Usage: /usr/lib64/ld-linux-x86-64.so.2 [OPTION]... EXECUTABLE-FILE [ARGS-FOR-PROGRAM...]
You have invoked 'ld.so', the program interpreter for dynamically-linked
ELF programs. Usually, the program interpreter is invoked automatically
when a dynamically-linked executable is started.
So what all of it means for our problem is that before actually running our code that simply prints the "Hello, world!", the linker will do all this magic, open, memory map all dependencies, and so on.
While dynamic linking is great, it sounds like way too much work for a simple "Hello, world!". Let's try to cut it out...
Eliminating linker
Since we identified our first suspect that bloats output of our strace
we can now eliminate it.
From the same Rust docs linked above we can read that is possible to link Rust with C runtime (crt
) statically using crt-static
target feature. We can pass it to the compiler using RUSTFLAGS
:
RUSTFLAGS="-C target-feature=+crt-static" cargo build
Let's check our improvements in action:
strace -c ./target/debug/hello-world
Hello, world!
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ------------------
0.00 0.000000 0 2 read
0.00 0.000000 0 1 write
0.00 0.000000 0 1 close
0.00 0.000000 0 1 poll
0.00 0.000000 0 1 mmap
0.00 0.000000 0 2 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 5 brk
0.00 0.000000 0 5 rt_sigaction
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 readlink
0.00 0.000000 0 3 sigaltstack
0.00 0.000000 0 2 1 arch_prctl
0.00 0.000000 0 1 sched_getaffinity
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 openat
0.00 0.000000 0 1 newfstatat
0.00 0.000000 0 1 set_robust_list
0.00 0.000000 0 2 prlimit64
0.00 0.000000 0 1 getrandom
0.00 0.000000 0 1 rseq
------ ----------- ----------- --------- --------- ------------------
100.00 0.000000 0 35 1 total
This is significantly better as we dropped from 63 to 35 syscalls, but that is still
way more than we need. We can however confirm that our binary is now linked statically:
ldd ./target/debug/hello-world
statically linked
An alternative way of building statically linked binary is to use musl
libc instead of glibc
. musl
was designed with static linking in mind so it is worth giving it a shot. We can do that by specifying the x86_64-unknown-linux-musl
target. We no longer need to pass RUSTFLAGS
as static linking is a default behavior for musl
target:
cargo build --target x86_64-unknown-linux-musl && strace -c ./target/x86_64-unknown-linux-musl/debug/hello-world
Hello, world!
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
0.00 0.000000 0 1 write
0.00 0.000000 0 1 poll
0.00 0.000000 0 1 mmap
0.00 0.000000 0 1 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 2 brk
0.00 0.000000 0 5 rt_sigaction
0.00 0.000000 0 3 rt_sigprocmask
0.00 0.000000 0 1 execve
0.00 0.000000 0 3 sigaltstack
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 1 set_tid_address
------ ----------- ----------- --------- --------- ----------------
100.00 0.000000 0 21 total
Mind the different binary path in the
target
directory!
We dropped another few syscalls. It is pretty hard to tell "why" without diving into the actual source code of both glibc
and musl
. Both are completely different implementations of libc so as long as function interfaces are preserved, the implementation can handle things differently.
Coming back to our task, however, we are still quite far from the goal. Perhaps it is Rust that is at fault here? Maybe it was not a good choice after all...
Descending into C
There sometimes comes a time when you have to abandon your ideals, and just get the job done. This time is now. To verify if it is Rust runtime causing all those syscalls we can try to write the same program in good old C:
#include <stdio.h>
int main() {
printf("Hello, world!\n");
}
Wasn't too bad... Since we already identified musl
as a good candidate for static linking, we can build it with musl-gcc
(a wrapper for gcc
that links against musl
):
musl-gcc -static main.c && ./a.out
Hello, world!
Let's see how it does:
strace -c ./a.out
Hello, world!
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
0.00 0.000000 0 1 ioctl
0.00 0.000000 0 1 writev
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 1 set_tid_address
------ ----------- ----------- --------- --------- ----------------
100.00 0.000000 0 5 total
Now, that gets us much closer to what we want.
You may have noticed that the write
syscall was replaced with writev(2)
. writev
is simply a different version of write
that allows writing multiple buffers at once (known as vectored I/O).
If we check the actual arguments passed to the syscall:
strace -e trace=writev ./a.out
-e
option allows us to specify an expression that modifies events to trace and how to trace them. In our case, we want to trace only thewritev
syscall.
writev(1, [{iov_base="Hello, world!", iov_len=13}, {iov_base="\n", iov_len=1}], 2Hello, world!
) = 14
+++ exited with 0 +++
We can see that our string was split into two buffers, one for "Hello, world!"
and another for the new line "\n"
.
Why? Well, it is complicated... Syscall itself comes somewhere from here. If we are adventurous enough and go up the stack we can find printf_core
, which is called by vprintf
, which can (indirectly) take us back to the printf
itself...
There seems to be really a lot of code until we get to the actual syscall... I am sure it is all justified and so on, but for us, it sounds like a lot of unnecessary complexity.
Fortunately, we can just use the syscall directly bypassing all that magic:
#include <unistd.h>
#include <sys/syscall.h>
int main(void) {
syscall(SYS_write, 1, "Hello, world!\n", 14);
}
We pass SYS_write
as a first argument to syscall
, which is nothing more than a constant that represents the syscall number (1 in the case of write
). The rest of the arguments, are syscall specific, and as described in man page for write
, those are:
- file descriptor (1 for stdout)
- buffer
- number of bytes to write
Let's run it:
musl-gcc -static main.c && strace -c ./a.out
Hello, world!
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
0.00 0.000000 0 1 write
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 1 set_tid_address
------ ----------- ----------- --------- --------- ----------------
100.00 0.000000 0 4 total
printf
hidden another syscall (ioctl
) from us, and we are back to a simple write
.
This brings us down to four system calls remaining. It still sounds like more than necessary. As there are no more obvious things to chop off, it might be time to put down our axe and approach it with a bit more precision.
Last syscalls standing
Let's start with an easy one. We cannot really get rid of execve
as something (in this case strace
) needs to actually execute our program. So even tho we see it in strace
output, it is "not really" our "Hello, world!" program that calls it.
When running
strace
, the process will fork, and starts tracing a child process. The child, therefore, needs to later execute the desired program that we pass as an argument (a.out
binary in the case of our C program), to do that it calls theexecve
syscall.
We can take a peek at that bystrace
ing thestrace
:strace strace -c ./a.out
The output is a bit messy, but if we zoom in on important parts, we can see the
clone
syscall, which is used to create a new process, followed byptrace
withPTRACE_SEIZE
argument as__ptrace_request
, which attaches to the process with a pid that we got as a result ofclone
:... clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb11c716550) = 4031 ... ptrace(PTRACE_SEIZE, 4031, NULL, PTRACE_O_TRACESYSGOOD|PTRACE_O_TRACEEXEC|PTRACE_O_TRACEEXIT) = 0 ...
Only after that child process will run our program with the
execve
syscall, which we can see bystrace
ing our binary:strace -e trace=execve ./a.out
execve("./a.out", ["./a.out"], 0x7ffd96e60d30 /* 31 vars */) = 0 Hello, world! +++ exited with 0 +++
We now know we cannot live without the execve
, but what about arch_prctl
and set_tid_address
, then?
To the best of what I have found, those are responsible for setting up thread local storage (TLS).
As we can read in man pages for arch_prctl
:
arch_prctl - set architecture-specific thread state
Digging a bit more, what this means is interfacing with FS (and GS) registers (FS in particular for TLS), which cannot be set from user space and is used to store per thread context.
Another syscall related to threading is set_tid_address
("set pointer to thread ID"). I did not find great sources on this one, but from reading the man page we can try to reason about it.
set_tid_address
will set the clear_child_tid
attribute of the given thread to the address specified by the system call. And as the name (clear_child_tid
) suggests, when the thread terminates, the value at the address will be set to 0, or in other words, it will be cleared.
Why is it useful? Again a per man page, if applicable the kernel will then perform:
futex(clear_child_tid, FUTEX_WAKE, 1, NULL, NULL, 0);
which can be thought of as releasing the lock of a given memory location and waking up a single thread that is waiting on it. This does not happen for our program since we only have a single thread, so there is nothing to wake up.
If you are familiar with Go, this sound similar to
sync.Cond
.
Okay, we have a better idea of what those system calls do, and we can reasonably suspect that they come somewhere from libc
(musl
). At the same time, both of them are not necessary for simply printing Hello, world!
. If only we could get rid of libc
...
Look! There is one more door in this dark basement, and it leads to an even darker place...
Assembly
There is one language that we can "easily" reach for to write the "Hello, world!" in without all that overhead -- Assembly. We will use 64-bit x86 assembly as this is the machine I am running on.
So... brace yourself and create hello.asm
:
section .text
global _start
_start:
mov rax, 1 ; write syscall number to rax - 1 is write
mov rdi, 1 ; 1 is stdout file descriptor
mov rsi, msg ; msg is our "Hello, world!\n" defined in .rodata section
mov rdx, msglen ; specify message length in rdx - sizeof("Hello, world!\n")
syscall ; execute syscall
mov rax, 60 ; write syscall number to rax - 60 is exit
mov rdi, 0 ; write program exit code to rdi
syscall ; execute syscall
section .rodata
msg: db "Hello, world!", 10
msglen: equ $ - msg
Well, we got through it. The code is pretty simple, and if you do not speak assembly (do not worry me neither), comments on the right side should give you an idea of what is going on.
You may have noticed that we actually call syscall
twice, which is not exactly what we wanted. However, the second call is just an exit
syscall. Technically we could get rid of it and we would still get our Hello, world!
printed on the screen.
The catch here is that the CPU would not know that our program is finished, and would try to run the next instruction, which is not there. So this would cause the CPU to try to read some memory that is not accessible by our program, and result in an error beloved by all C programmers:
Hello, world!
[1] 2529 segmentation fault (core dumped) ./hello
Let's be nice to our CPU, accept the exit
syscall as necessary, and do not count it for our "one syscall" goal. As we have seen before strace -c
will not show it in the summary anyway.
In most cases one would likely prefer to use
exit_group(2)
syscall instead ofexit(2)
, as it exits all threads in the process. That is what most (if not all)exit
functions in different standard libraries do. You can see that is what our previous programs (both Rust and C) did by runningstrace
.For this case
exit
is completely sufficient.
With that in mind we can assemble the program with nasm
assembler, and link it using ld
:
nasm -f elf64 hello.asm
ld -static -o hello hello.o
and feed it to strace
:
strace -c ./hello
Hello, world!
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
0.00 0.000000 0 1 write
0.00 0.000000 0 1 execve
------ ----------- ----------- --------- --------- ----------------
100.00 0.000000 0 2 total
And there we have it, "Hello, world!" stripped down to a single syscall! Doesn't victory taste sweet? If only not for this smell of Assembly everywhere, and a touch of C flashbacks... And yeah, I know, I know, it was supposed to be in Rust...
Okay, fine, let's look at the positives... At least now we have a chance to rewrite it in Rust...
We are going to embark on that journey in the next part.
Top comments (0)