loading...

Linux: systemd-unit files edit, restart on failure and email notifications

setevoy profile image Arseny Zinchenko Originally published at rtfm.co.ua on ・8 min read

We have a RabbitMQ service which sometimes can go down.

So need to:

  1. restart it if is exited with the failure
  2. send an email notification

Let’s do it via RabbitMQ’s systemd service (though there are various options, e.g. using the monit, check the Monit: мониторинг и перезапуск NGINX post).

Will use two options here:

  • RestartSec=: delay on restart – to have a chance to finish some disk I/O operations if any, just in case
  • Restart=: the condition to be used

Available conditions for the Restart are:

Table 2. Exit causes and the effect of the Restart= settings on them

Restart settings/Exit causes no always on-success on-failure on-abnormal on-abort on-watchdog
Clean exit code or signal X X
Unclean exit code X X
Unclean signal X X X X
Timeout X X X
Watchdog X X X X

systemd-unit files edit

The default RabbitMQ’s unit-file in the /lib/systemd/system/rabbitmq-server.service.

You can observe it using systemctl cat:

$ admin@bttrm-production-console:~$ systemctl cat rabbitmq-server.service
/lib/systemd/system/rabbitmq-server.service

[Unit]
Description=RabbitMQ Messaging Server
After=network.target

[Service]
Type=simple
User=rabbitmq
SyslogIdentifier=rabbitmq
LimitNOFILE=65536
ExecStart=/usr/sbin/rabbitmq-server
ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait
ExecStop=/usr/sbin/rabbitmqctl stop

[Install]
WantedBy=multi-user.target

Do not edit it in the /lib/systemd/system/ directly, like any other file there as it will be overwritten during rabbitmq-server package next upgrade.

When you need to update any service’s default behavior – you have to put your new files in the /etc/systemd/system directory.

To edit an existing service – use the systemctl edit foo.service with the --full option:

# root@bttrm-dev-console:/home/admin# systemctl edit --full rabbitmq-server.service

This will create a temporary file like /etc/systemd/system/rabbitmq-server.service.d/.#override.conf6a0bfbaa5ed8b8d8 with the current /lib/systemd/system/rabbitmq-server.service content and here you can update it.

Restart of failure

Add both options here – Restart=on-failure и RestartSec=60s:

[Unit] Description=RabbitMQ Messaging Server 
After=network.target 

[Service] 
Type=simple
User=rabbitmq
SyslogIdentifier=rabbitmq
LimitNOFILE=65536
ExecStart=/usr/sbin/rabbitmq-server
ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait
ExecStop=/usr/sbin/rabbitmqctl stop

Restart=on-failure
RestartSec=60s

[Install]
WantedBy=multi-user.target

Re-read systemd‘s config files:

# root@bttrm-dev-console:/home/admin# systemctl daemon-reload

systemd will create a /etc/systemd/system/rabbitmq-server.service file with the new content.

Now get RabbitMQ’s server PID:

# root@bttrm-dev-console:/home/admin# systemctl status rabbitmq-server.service | grep PID
Main PID: 14668 (rabbitmq-server)

Kill it with SIGKILL (check the Linux&FreeBSD: команды kill, nohup — сигналы и управление процессами) to make on-failure parameter be applied:

# root@bttrm-dev-console:/home/admin# kill -9 14668

Check its status now:

# root@bttrm-dev-console:/home/admin# systemctl status rabbitmq-server.service
● rabbitmq-server.service - RabbitMQ Messaging Server
Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
Active: activating (auto-restart) (Result: signal) since Thu 2019-02-28 12:08:32 EET; 4s ago
Process: 7093 ExecStop=/usr/sbin/rabbitmqctl stop (code=exited, status=0/SUCCESS)
Main PID: 14668 (code=killed, signal=KILL)

Logs:

...
Mar 01 13:26:00 bttrm-dev-console systemd[1]: rabbitmq-server.service: Main process exited, code=killed, status=9/KILL
Mar 01 13:26:00 bttrm-dev-console rabbitmq[27392]: Stopping and halting node 'rabbit@bttrm-dev-console'
...
Mar 01 13:26:00 bttrm-dev-console systemd[1]: rabbitmq-server.service: Unit entered failed state.
Mar 01 13:26:00 bttrm-dev-console systemd[1]: rabbitmq-server.service: Failed with result 'signal'.
...

And after one minute:

# root@bttrm-dev-console:/home/admin# systemctl status rabbitmq-server.service
● rabbitmq-server.service - RabbitMQ Messaging Server
Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
Active: activating (start-post) since Thu 2019-02-28 12:09:33 EET; 2s ago
...
Feb 28 12:09:33 bttrm-stage-console systemd[1]: rabbitmq-server.service: Service hold-off time over, scheduling restart.
Feb 28 12:09:33 bttrm-stage-console systemd[1]: Stopped RabbitMQ Messaging Server.
Feb 28 12:09:33 bttrm-stage-console systemd[1]: Starting RabbitMQ Messaging Server
...

Logs again:

Mar 01 13:27:01 bttrm-dev-console systemd[1]: rabbitmq-server.service: Service hold-off time over, scheduling restart.
Mar 01 13:27:01 bttrm-dev-console systemd[1]: Stopped RabbitMQ Messaging Server.
Mar 01 13:27:01 bttrm-dev-console systemd[1]: Starting RabbitMQ Messaging Server
...
Mar 01 13:27:01 bttrm-dev-console rabbitmq[27526]: Waiting for 'rabbit@bttrm-dev-console' 
...
Mar 01 13:27:01 bttrm-dev-console rabbitmq[27526]: pid is 27533 ...
Mar 01 13:27:04 bttrm-dev-console systemd[1]: Started RabbitMQ Messaging Server.
...

“Service hold-off time over, scheduling restart” – here is our 60 seconds delay.

email notification

Now let’s add an email notification to be sent if RabbitMQ went down with an error.

Send test email first:

# root@bttrm-dev-console:/home/admin# echo "Stage RabbitMQ restarted on failure!" | mailx -s "RabbitMQ failure notice" admin@example.com

Now you can use ExecStopPost= or OnFailure=. OnFailure looks better – let’s use it.

Create the /etc/systemd/system/rabbitmq-notify-email@.service file:

[Unit]
Description=%i failure email notification 

[Service]
Type=oneshot
ExecStart=/bin/bash -c '/bin/systemctl status %i | /usr/bin/mailx -s "[%i] failure notification" admin@example.com'

Add the OnFailure option to the rabbitmq-server.service using systemctl edit in the [Unit] block:

[Unit] Description=RabbitMQ Messaging Server 
After=network.target 
OnFailure=rabbitmq-notify-email@%i.service ...

Do not forget to reload systemd files:

# root@bttrm-dev-console:/home/admin# systemctl daemon-reload

Kill RabbitMQ again:

# root@bttrm-dev-console:/home/admin# kill -9 29970

Check logs:

...
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Main process exited, code=killed, status=9/KILL
Feb 28 13:55:33 bttrm-dev-console rabbitmq[30476]: Stopping and halting node 'rabbit@bttrm-dev-console' ...
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Unit entered failed state.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Triggering OnFailure= dependencies.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Failed with result 'signal'.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Starting rabbitmq-server failure email notification...
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Started rabbitmq-server failure email notification.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Service hold-off time over, scheduling restart.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Stopped RabbitMQ Messaging Server.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Starting RabbitMQ Messaging Server
...
Feb 28 13:55:34 bttrm-dev-console rabbitmq[30619]: Waiting for 'rabbit@bttrm-dev-console'
...
Feb 28 13:55:34 bttrm-dev-console rabbitmq[30619]: pid is 30625 ...
Feb 28 13:55:37 bttrm-dev-console systemd[1]: Started RabbitMQ Messaging Server.
...
  1. Triggering OnFailure= dependencies.
  2. Started rabbitmq-server failure email notification.

Okay – all works.

Mail logs:

# root@bttrm-dev-console:/home/admin# tail /var/log/exim4/mainlog
2019-02-28 13:48:58 1gzK7S-0007Td-Bt H=alt2.aspmx.l.google.com [2a00:1450:400b:c01::1b] Network is unreachable
2019-02-28 13:51:09 1gzK7S-0007Td-Bt H=alt1.aspmx.l.google.com [172.217.192.27] Connection timed out
2019-02-28 13:51:42 1gzK7S-0007Td-Bt =\> admin@example.com R=dnslookup T=remote\_smtp H=alt2.aspmx.l.google.com [74.125.193.27] X=TLS1.2:ECDHE\_RSA\_CHACHA20\_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC,CN=mx.google.com" C="250 2.0.0 OK  1551354702 x34si4667116edb.147 - gsmtp"
2019-02-28 13:51:42 1gzK7S-0007Td-Bt Completed
2019-02-28 13:53:53 1gzK16-0006pp-NU H=alt2.aspmx.l.google.com [74.125.193.27] Connection timed out
2019-02-28 13:53:53 1gzK16-0006pp-NU H=aspmx2.googlemail.com [2800:3f0:4003:c02::1a] Network is unreachable
2019-02-28 13:54:59 1gzK16-0006pp-NU =\> admin@example.com R=dnslookup T=remote\_smtp H=aspmx3.googlemail.com [74.125.193.26] X=TLS1.2:ECDHE\_RSA\_CHACHA20\_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC,CN=mx.google.com" C="250 2.0.0 OK  1551354899 s45si1200185edm.357 - gsmtp"
2019-02-28 13:54:59 1gzK16-0006pp-NU Completed
2019-02-28 13:54:59 End queue run: pid=29201
2019-02-28 13:55:33 1gzKHl-0007xl-Lm \<= root@dev.backend-console-internal.example.com U=root P=local S=1331

If you didn’t get an email – check the exim‘s queue:

# root@bttrm-dev-console:/home/admin# exim -bp
0m  1.2K 1gzL3R-0000dn-5h 
<root@dev.backend-console-internal.example.com>
admin@example.com

It hangs here.

Run it manually:

# root@bttrm-dev-console:/home/admin# runq

Check logs again:

# root@bttrm-dev-console:/home/admin# cat /var/log/exim4/mainlog | grep 1gzL3R-0000dn-5h
2019-02-28 14:44:49 1gzL3R-0000dn-5h \<= root@dev.backend-console-internal.example.com U=root P=local S=1241
2019-02-28 14:46:48 1gzL3R-0000dn-5h H=aspmx.l.google.com [2607:f8b0:400d:c0f::1a] Network is unreachable
2019-02-28 14:46:49 1gzL3R-0000dn-5h =\> admin@example.com R=dnslookup T=remote\_smtp H=aspmx.l.google.com [173.194.68.26] X=TLS1.2:ECDHE\_RSA\_CHACHA20\_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC,CN=mx.google.com" C="250 2.0.0 OK  1551358009 w11si208223qvc.68 - gsmtp"
2019-02-28 14:46:49 1gzL3R-0000dn-5h Completed

And your email:

To solve sending email issue (not sure why exim won’t send them) – add some dirty “hack” to the /etc/systemd/system/rabbitmq-notify-email@.service – the ExecStartPost option:

... 
ExecStart=/bin/bash -c '/bin/systemctl status %i | /usr/bin/mailx -s "[%i] failure notification" admin@example.com' 
ExecStartPost=runq ...

To remove an old message from the queue – use their IDs:

# root@bttrm-dev-console:/home/admin# exim -Mrm 1gzVar-0003oO-Rf
Message 1gzVar-0003oO-Rf has been removed

Done.

Similar posts

Posted on by:

setevoy profile

Arseny Zinchenko

@setevoy

DevOps, cloud and infrastructure engineer. Love Linux, OpenSource, and AWS.

Discussion

markdown guide
 

I just implemented something very similar and struggled with the email not being sent.
After some investigation, I found out that the cause was that mailx is sending messages asynchronously by default, which is not compatible with the way systemd works (see the last post here)

Therefore the following option to mail is necessary: sendwait, i.e. your full mail command would be:
/usr/bin/mailx -Ssendwait -s "[%i] failure notification" admin@example.com

 

Thanks, @jmon, it's interesting.
Although I'm using mailx via local ssmtp at my Arch Linux workstations (both at work and at home) for this - didn't notice such issues.

 

I'm not at all an expert in that matter, and won't investigate further. It can probably help others in my situation.
However if you're interested, I can provide you some configuration information if you guide me a bit.
Anyway thank you for your very nice work!

 
 

What is the meaning of "%i"?