loading...
Cover image for Ensuring healthy Node.js program using watchdog timer

Ensuring healthy Node.js program using watchdog timer

gajus profile image Gajus Kuizinas ・3 min read

If you have a Node.js program that is designed to pull tasks and process them asynchronously, then you should watch out for hanging processes.

Consider an example of how such a program could look:

import delay from 'delay';

const getNextJob = async () => { /* ... */ };
const doJob = async () => { /* ... */ };

const main = async () => {
  while (true) {
    const maybeNextJob = await getNextJob();

    if (maybeNextJob) {
      await doJob(maybeNextJob);
    } else {
      await delay(1000);
    }
  }
};

main();

getNextJob is used to pull task instructions from some arbitrary database, and doJob is used to execute those tasks.

The risk here is that any asynchronous tasks might hang indefinitely, e.g. If getNextJob is pulling data from a remote database, the database socket can hang indefinitely. This is almost certainly always a bug.

In my specific case, I ran into a bug in node-postgres causing connection to hang in ClientRead state. The latter happens when server sees a protocol message that begins a query, but it hasn't yet returned to the idle state, which happens when the server sends the ReadyForQuery response at the end of the query. PostgreSQL does not have a timeout for ClientRead, i.e. this was causing the equivalent of my getNextJob to hang indefinitely.

The best way to protect against such a risk is to add a timeout to the loop used to pull and execute tasks. The timeout should be refreshed on every iteration; when timeout is not reset in time, then you should terminate the process and log enough details to identify what caused the process to hang. This pattern is called watchdog timer.

Here is what an example implementation of watchdog timer looks:

import delay from 'delay';

const getNextJob = async () => { /* ... */ };
const doJob = async () => { /* ... */ };

const main = async () => {
  const timeoutId = setTimeout(() => {
    console.error('watchdog timer timeout; forcing program termination');

    process.exit(1);
  }, 30 * 1000);

  timeoutId.unref();

  while (true) {
    timeoutId.refresh();

    const maybeNextJob = await getNextJob();

    if (maybeNextJob) {
      await doJob(maybeNextJob);
    } else {
      await delay(1000);
    }
  }
};

main();

This creates a timer that is refreshed at the beginning of every loop checking for new tasks. The 30 second timeout is for the entire cycle (i.e. getNextJob and doJob) and, because you are forcing sudden termination, it should be well above whatever the internal task limits are.

I had to implement the above pattern in multiple of my applications to prevent these ghosts processes from hanging in what is otherwise a large scale deployment of many processes orchestrated using Kubernetes. As such, I have abstracted the above logic + some sugar into a module watchdog-timer. For the most part, it can be used exactly like the earlier example using setTimeout:

import {
  createWatchdogTimer,
} from 'watchdog-timer';
import delay from 'delay';

const getNextJob = async () => { /* ... */ };
const doJob = async () => { /* ... */ };

const main = async () => {
  const watchdogTimer = createWatchdogTimer({
    onTimeout: () => {
      console.error('watchdog timer timeout; forcing program termination');

      process.exit(1);
    },
    timeout: 1000,
  });

  while (true) {
    watchdogTimer.refresh();

    const maybeNextJob = await getNextJob();

    if (maybeNextJob) {
      await doJob(maybeNextJob);
    } else {
      await delay(1000);
    }
  }
};

main();

It is important to emphasize that this is an in-process guard, i.e. if something is blocking the event loop, then the timeout is not going to be called. To protect yourself against the latter, you also need an external-service to check liveness of your application. If you are using Kubernetes, then this functionality is served by the livenessProbe and it can be implemented using lightship NPM module.

watchdog-timer nicely integrates with Lightship:

import {
  createWatchdogTimer,
} from 'watchdog-timer';
import {
  createLightship,
} from 'lightship';

const main = async () => {
  const lightship = createLightship({
    timeout: 5 * 1000,
  });

  lightship.signalReady();

  lightship.registerShutdownHandler(async () => {
    console.log('shutting down');
  });

  const watchdogTimer = createWatchdogTimer({
    onTimeout: () => {
      // If you do not call `destroy()`, then
      // `onTimeout` is going to be called again on the next timeout.
      watchdogTimer.destroy();

      lightship.shutdown();
    },
    timeout: 1000,
  });

  while (true) {
    if (lightship.isServerShuttingDown()) {
      console.log('detected that the service is shutting down; terminating the event loop');

      break;
    }

    // Reset watchdog-timer on each loop.
    watchdogTimer.reset();

    // `foo` is an arbitrary routine that might hang indefinitely,
    // e.g. due to a hanging database connection socket.
    await foo();
  }

  watchdogTimer.destroy();
};

main();

To sum up, in order to avoid hanging processes, you must have an in-process watchdog to find when your application is sitting idle/ not performing expected steps; and you must use an out-of-process watchdog to ensure that the application is not stuck in a blocking event loop.

Discussion

markdown guide