The Walking Dead (but with processes)

#linux #systems #init #kernel

What's a zombie?

When interviewing system engineers a common answer for "what is a zombie process?" is "A process which is dead and doesn't have a parent" or something like "you can kill it with kill -9 <pid>". There seems to be a misunderstanding of zombie processes. Let's delve into what exactly a zombie is, and how they should be handled by Linux.

A zombie process is a process which has exit()'d and whose parent has not called wait()/waitpid() syscall against its process id (PID). Meaning, the process exited and left a status code for the parent to read, but the parent has yet to read it from the process table. Every process that terminates is briefly a zombie, they become an issue when they stick around too long.

Some key points:

They cannot be "killed" - SIGKILL, SIGTERM are ineffective as the process is already dead.
They are not orphan processes.
They can be removed via orphaning where init adopts the process (by killing the parent).
They utilize no resources and simply occupy a process table entry.
Every process, upon termination, is a zombie until their parent process calls wait().

This little snippet demonstrates how a zombie can be created:

int main(void)
{
    pid_t pid;
    int status;

    if ((pid = fork()) < 0) {
        // Failed to fork
        exit(1);
    }
    // Child process starts execution from here, child pid == 0
    if (pid == 0)
        // All the child does is exit
        exit(0);

    // Parent continues from here. A zombie roams for 100 seconds..brains..
    sleep(100);

    // After 100 seconds, the parent removes the zombie by calling wait()
    pid = wait(&status);

    return 0;
}

A zombie shows up as <defunct> in the output of ps, when running the above snippet.

129274  \_ ./zombie 
129275    \_ [zombie] <defunct>

This means the parent (PID 129274) hasn't called wait()/waitpid() on PID 129275. Until it calls wait, 129275 will eat up a process table entry. Since they take up no resources, they typically aren't an issue.

The problem emerges when too many exist and you run out of PIDs, meaning no more new processes (ex: can't ssh to a server that can't spawn you a shell..)

You can see the max PIDs with the following command.

# cat /proc/sys/kernel/pid_max
131072

It's a good idea to monitor process count on your machines. If you have a process leaking zombies that is long-running it will eventually fill the process table which could take down your service, prevent remote-access, etc.

A way you can remove a zombie is by killing the parent of the zombie, which would then re-parent the process to init (PID 1) who should periodically call wait() on its child processes.

Zombies usually appear if a program has a flaw where it doesn't call wait() on its children. what if the process got stuck in some infinite loop and never called wait()? what if that process was named init?

When init doesn't do its job...

A while ago I came across a number of servers that had thousands of zombie processes. My team had simply been rebooting these boxes to clear them up, but I became curious after noticing something odd:

All the zombie processes had init as a parent!

This means that processes were exiting, and init (as their parent), was never calling wait(). Additionally, if you were to kill a zombies parent to make init the parent, init would do nothing to help you.

init, as one of its core functions, should be calling wait() on its child processes to clear out any zombies, so what was happening?

% sudo strace -p 1 -r -s 500
Process 1 attached - interrupt to quit
     0.000000 write(8, "init: serial-ttyS main process ended, respawning\r\n", 50) = ? ERESTARTSYS (To be restarted)
   273.565117 --- SIGCHLD (Child exited) @ 0 (0) ---
     0.000062 write(4, "\0", 1)         = 1
     0.000066 rt_sigreturn(0x4)         = 1
     0.000046 write(8, "init: serial-ttyS main process ended, respawning\r\n", 50) = ? ERESTARTSYS (To be restarted)
     0.003033 --- SIGCHLD (Child exited) @ 0 (0) ---
     0.000026 write(4, "\0", 1)         = 1
     0.000048 rt_sigreturn(0x4)         = 1
     0.000039 write(8, "init: serial-ttyS main process ended, respawning\r\n", 50) = ? ERESTARTSYS (To be restarted)
     7.825845 --- SIGCHLD (Child exited) @ 0 (0) ---

stracing showed init stuck in a loop trying to write to a tty device, and each write getting ERESTARTSYS back (meaning: please re-attempt that write). Init had no mechanism to handle that error, so it gets stuck in an infinite loop.

As for why it was getting ERESTARTSYS, the tty it was writing to in this case was a serial tty. grepping the serial tty driver code for ERESTARTSYS found:

/**
 *  tty_send_xchar  -   send priority character
 *
 *  Send a high priority character to the tty even if stopped
 *
 *  Locking: none for xchar method, write ordering for write method.
 */

int tty_send_xchar(struct tty_struct *tty, char ch)
{
    int was_stopped = tty->stopped;

    if (tty->ops->send_xchar) {
        tty->ops->send_xchar(tty, ch);
        return 0;
    }

    if (tty_write_lock(tty, 0) < 0)
        return -ERESTARTSYS;

    if (was_stopped)
        start_tty(tty);
    tty->ops->write(tty, &ch, 1);
    if (was_stopped)
        stop_tty(tty);
    tty_write_unlock(tty);
    return 0;
}

Writing to a serial tty takes a lock in the form of atomic_write_lock. Searching for info regarding this lock I found a bug:

>>  Possible unsafe locking scenario:
>>
>>        CPU0                    CPU1
>>        ----                    ----
>>   lock(&tty->termios_rwsem);
>>                                lock(&tty->atomic_write_lock);
>>                                lock(&tty->termios_rwsem);
>>   lock(&tty->atomic_write_lock);

Wrapping all this back together found a new bug due to a different bug.

If the serial tty gets into a deadlock, any logging that init tries to perform against it will get init stuck into a loop of write -> ERESTARTSYS -> write ...

If init stays in this state long enough, without the tty resetting, then zombies will pile up and you'll run out of PIDs.

How should init handle child processes?

Looking at the sysvinit source in src/init.c we see a couple different ways init handles it child processes. Here we see init establishing a signal handler for SIGCHLD

SETSIG(sa, SIGCHLD,  chld_handler, SA_RESTART);

void chld_handler(int sig)
        CHILD           *ch;
        int             pid, st;
        int             saved_errno = errno;

        /*
         *      Find out which process(es) this was (were)
         */
        while((pid = waitpid(-1, &st, WNOHANG)) != 0) {
                if (errno == ECHILD) break;
                for( ch = family; ch; ch = ch->next )
                        if ( ch->pid == pid && (ch->flags & RUNNING) ) {
                                INITDBG(L_VB,
                                        "chld_handler: marked %d as zombie",
                                        ch->pid);
                                ADDSET(got_signals, SIGCHLD);
                                ch->exstat = st;
                                ch->flags |= ZOMBIE;
                                if (ch->new) {
                                        ch->new->exstat = st;
                                        ch->new->flags |= ZOMBIE;
                                }
                                break;
                        }
                if (ch == NULL) {
                        INITDBG(L_VB, "chld_handler: unknown child %d exited.",
                                pid);
                }
        }

Examining chld_handler, this will execute if init gets a SIGCHLD signal which is sent to a parent when a child of theirs dies. Init handles zombies in a push model here by calling waitpid() and flagging the process as a zombie for later cleanup

ch->flags |= ZOMBIE;

Earlier, I said when init gets stuck in this state that it would not be able to reap any of its own children who exit, but it has a signal handler for this? so what's going on?

The initlog() function blocks all signals while logging:

/*
 *      Re-establish connection with syslogd every time.
 *      Block signals while talking to syslog.
 */
sigfillset(&nmask);
sigprocmask(SIG_BLOCK, &nmask, &omask);
openlog("init", 0, LOG_DAEMON);
syslog(LOG_INFO, "%s", buf);
closelog();
sigprocmask(SIG_SETMASK, &omask, NULL);

So when we call initlog() anywhere in the main loop, and call syslog(LOG_INFO, "%s", buf); we hit our earlier bug. syslog() is respecting ERESTARTSYS from write() so we get stuck in here and we block all signals (including SIGCHLD).

Outside of the chld_handler's mechanism for reaping zombies, the rest of the main loop handles a variable called family which stores all the child processes of init. It loops over these looking for processes to kill and/or reap from the process table if they have died.