Madhuri Malviya for CloudRaft

Posted on Nov 15, 2023 • Originally published at cloudraft.io on Nov 10, 2023

Linux Troubleshooting For SREs

Introduction

As a Linux user or administrator, understanding and mastering the art of troubleshooting is very crucial. Regardless of how well-designed and optimized your systems may be, issues are bound to arise from time to time. These can range from minor hiccups to critical problems that hinder the performance and availability of your Linux machines or containers. In this comprehensive article, we will explore real-life examples of performance issues and provide you with a collection of useful Linux commands to troubleshoot everything from CPU and IO to network and errors.

Common causes of performance issues

Performance issues can be caused by a variety of factors. Some common causes include insufficient memory or CPU resources, disk I/O bottlenecks, network congestion, inefficient code and bugs. In addition, misconfigurations, outdated software, and runaway or zombie processes can also impact performance.

The importance of Linux troubleshooting

By diligently troubleshooting and resolving the issues, you not only ensure the smooth operation of your systems but also minimize the MTTR (Mean Time to Repair) – the average time it takes to fix a problem.

Mastering Linux troubleshooting allows you to swiftly diagnose and resolve performance bottlenecks, errors, and other issues that can potentially disrupt your operations.

To efficiently diagnose and resolve problems on your Linux machines there are two widely used approaches, RED and USE methodologies.

RED Methodology

The RED methodology focuses on three indicators: Rate, Errors, and Duration, especially directed at request-driven systems such as modern web applications. The idea is to resolve any performance issue and provide smooth running of the application. Let us understand it with an example. Suppose your service is becoming unresponsive, to resolve this issue you first look into:

Rate: It measures the number of requests that the service receives per unit of time. An unexpectedly high request rate could be indicative of an increased load on the server, causing performance issues.

Error: This metric tracks the number of errors that occur during the processing of requests. When dealing with a slow server, monitoring for errors is crucial in identifying any issues or bugs within the server's processing logic. It would pinpoint the root cause of inefficiency and you will be able to resolve the issue.

Duration: It measures the time taken by the server to process each request. For a slow service, analyzing the duration helps identify the specific requests or processes that are taking longer than usual to complete. By identifying the slow-performing components, you can focus on optimizing those areas.

USE Methodology

The USE methodology focuses on identifying problems with system resources while using three criteria – Utilization Saturation and Error.

Utilization: In this, we monitor how resources are used, and whether they are being used to their fullest. High utilization of resources leads to slower performance as no more work can be accepted.

Saturation: When a process is waiting for a resource for a long time it leads to saturation. When dealing with a slow server, monitoring for saturation helps identify any backlogs or queues that are causing delays in processing requests.

Error: it looks for system warnings that pop up on your screen that can cause your system to hang, slow down, or crash. By examining error rates and types, you can pinpoint specific areas where errors are prevalent, helping you identify the root causes of the slowdown.

In the next section, we will explore real-life examples of common problems in Linux systems and walk through step-by-step solutions using powerful commands and techniques.

Essential troubleshooting commands and techniques

Let's deep dive into the different scenarios in which we can use the mechanisms of troubleshooting. In this article, we are using Ubuntu OS and Intel processor, if you are on a different system or architecture, the output will be slightly different.
Suppose you are on call and you have an incident to troubleshoot some performance issue on a Linux machine or Container. Don’t worry we got you covered for every problem you face! Here are some commands to come to your rescue.

top

This command provides real-time information about system resource usage, including CPU, memory, and running processes. You might get overwhelmed as to what to look for in this output.

$ top - 20:31:34 up 1 day,  6:05,  1 user,  load average: 0.50, 0.65, 0.57
      Tasks:  88 total,   1 running,  87 sleeping,   0 stopped,   0 zombie
%Cpu(s):  533.0 us,  242.0 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
MiB Mem :    941.6 total,    214.6 free,    197.8 used,    529.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    577.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2180 ubuntu    20   0   17208   5580   3144 S   0.3   0.6   0:01.75 sshd
   6387 ubuntu    20   0   10776   3860   3288 R   0.3   0.4   0:00.01 top
      1 root      20   0  102004  10812   6096 S   0.0   1.1   0:05.79 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.03 kthreadd
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp
      5 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 slub_flushwq
      6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 netns

Load Average- indicates the average system load for past 1,5 and 15 mins, respectively.A load average of 1 represents a fully utilized single-core CPU. Higher values indicate an increasingly overloaded system.
Zombie Processes- the dead processes whose execution is completed but are still using system resources. If it is present then there are issues with the process management.
Cpu%- shows the amount of CPU resources being used. It should be well below 100%.
S- You can check the process state. If your web server is not responding at all then you’ll see "D" (uninterruptible sleep) state indicating the process is stuck.
RS (resident memory usage)- it indicates the amount of physical memory being used by the application. If the system is actively paging memory this value would be exceptionally high which indicates that the process is demanding more memory than what is physically available. This situation might lead to frequent swapping of memory between RAM and the swap space, causing a performance bottleneck.
SHR(shared memory)- If the value is high, it could suggest that the application is relying heavily on shared libraries or is engaging in unnecessary data sharing, leading to a resource restriction.
If there is a process that is using maximum virtual memory and because of that your system is becoming slow you can check which process has the maximum amount of virtual memory usage through the VIRT parameter.

sar

This command collects, reports, and saves system activity information over some time.

$ sar -n TCP,ETCP 1
Linux 5.15.0-88-generic (top-gerbil)       11/07/23        _x86_64_        (1 CPU)

21:29:02     active/s  passive/s  iseg/s    oseg/s
21:29:03         0.00      0.00       0.00         0.00

21:29:02     atmptf/s  estres/s  retrans/s   isegerr/s       orsts/s
21:29:03         0.00      0.00      0.00         0.00              0.00

21:29:03     active/s  passive/s    iseg/s    oseg/s
21:29:04         0.00      0.00       11.00        11.00

21:29:03     atmptf/s  estres/s   retrans/s isegerr/s       orsts/s
21:29:04         0.00      0.00      0.00         0.00             0.00

You can check the Network Interface Statistics through which we can get the metrics such as bytes transmitted and received, packet counts, errors, and drops that can be useful for monitoring the network performance.
You can also check the information regarding the process queues and scheduling activities through the PROCESS AND QUEUE STATISTIC parameter which can be used to resolve issues related to process management and scheduling.
active/s and passive/s- identify if your web server is not responsive. It can be because of a high no. of active connections or too many passive connections.If the active/s parameter is high that can indicate that there is a sudden spike in the traffic or due to DoS attack and if the passive/s is higher than usual it may mean that the incoming requests are not processed efficiently due to lack of resources.
retrans/s- identify network congestion or unreliable networks like if you are experiencing slow file transfer because the network is suffering from high rates of packet loss and rectifying that can help in reducing retransmission and increase file transfer speed.
estres/s- shows no. of current active connections so if your server is running slow you can optimize your server's capacity by ending the connections which are not required
orsts/s- tells about the sender's retransmission rate like if it's high then it possibly due to unreliable links and this suggests that we have a low QoS.
in-seg/s- tells that if the parameter is high then your server has a surge in request and this can affect your network infrastructure.

free

The free command is used to display the amount of free and used memory in the system, including both physical and swap memory.

~$ free -m
               total        used        free      shared  buff/cache   available
Mem:            7828        1896        2996        1010        2935        4382
Swap:           16023           0       16023

shared- share how much memory is used by the shared libraries (it does not mean memory it refers to a specific type of software component that contains reusable code and data that multiple programs or applications can use. Shared libraries are loaded into memory when an application that depends on them is executed. If it's high then it means you may have high memory usage.
cached- indicate that memory is being used to cache frequently accessed files that means high I/O performance.

vmstat

This command is used to report virtual memory statistics.

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 219264  20164 524172    0    0    17    28  352   55  0  0 100  0  0
 0  0      0 219264  20164 524176    0    0     0     0  346   94  0  1 99  0  0
 0  0      0 219264  20164 524176    0    0     0     0  296   82  0  0 100  0  0
 0  0      0 219264  20164 524176    0    0     0     0  321   71  0  0 100  0  0
 0  0      0 219264  20164 524176    0    0     0     0  348   83  0  1 99  0  0
 0  0      0 219264  20164 524176    0    0     0     0  324   75  0  0 100  0  0

r- indicates the number of processes running in the CPU. If your web server is slow and the value of r is high then it indicates that there are many processes in the CPU and they are all competing for the resources for their completion.
swpd, free, buff, cache, si, so- indicate the characteristics of memory such as how much memory is free or in cache and how much amount of memory is swapping in/out from the disk.
inand cs- indicate the number of interrupts and context switches per second. High value of cs tells us about the frequent switches that can decrease the CPU performance.
id and wa- indicate the percentage for how much time the CPU is idle and time spent in waiting for I/O operation. High value of wa can lead to slow CPU performance and high value of in indicate that the CPU is free and we can add some processes to increase the efficiency of the CPU

mpstat

You are working on a server running multiple applications that heavily rely on CPU resources. You noticed that some services are not responding as quickly as they should, and that there are occasional service disruptions.

Use mpstat command which will display CPU usage statistics for all available processors.

$ mpstat -P ALL
Linux 5.15.0-88-generic (top-gerbil)    11/07/23        _x86_64_        (1 CPU)

23:44:28     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
23:44:28     all    0.04    0.01    0.09    0.04    0.00    0.19    0.00    0.00    0.00   99.63
23:44:28       0    0.04    0.01    0.09    0.04    0.00    0.19    0.00    0.00    0.00   99.63

%usr- percentage of CPU time spent on user-level processes. If this is unusually high, it would indicate that certain user applications or processes are consuming excessive CPU resources.
%sys - percentage of CPU time spent on system processes. If this parameter is high, it suggests that the kernel or system services are utilizing a substantial amount of CPU time, which might point to a system-level issue.
%iowait- percentage of time CPU spends waiting for I/O operations. An increased value in this might imply that the system is experiencing I/O bottlenecks or storage-related problems, resulting in the CPU waiting for I/O operations to complete.

iostat

You are experiencing slow disk performance, resulting in delayed read/write operations and increased latency for applications reliant on disk access.

Use iostat to monitor the I/O performance of the system's storage devices.

$ iostat -dx 5
Linux 5.15.0-88-generic (top-gerbil)    11/08/23        _x86_64_        (1 CPU)

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
loop0            0.00      0.02     0.00   0.00    0.65     8.97    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop1            0.00      0.01     0.00   0.00    1.95    16.89    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop2            0.01      0.47     0.00   0.00    0.83    43.30    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop3            0.00      0.00     0.00   0.00    0.00     1.27    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sda              0.25     15.21     0.04  14.47    1.67    60.24    0.31     25.99     0.47  60.09    5.01    83.61    0.00      0.00     0.00   0.00    0.00     0.00    0.07    3.38    0.00   0.12
sr0              0.00      0.00     0.00   0.00    0.70     2.92    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00

Parameters like Blk_read/s,Blk_wrtn/s,kB_read/s, kB_wrtn/s are used in maintaining record of reading and writing per second.
avgqu-sz- indicates the average number of requests made to the system. If it's greater than 1 then it can lead to saturation.

df

You received an alert that your disk partition is full and the system is becoming unresponsive.

Use df command to display information about the disk space usage of file systems. It provides an overview of available, used, and total disk space, as well as the mounted file systems.

$ df -kh
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            97M  1.2M   96M   2% /run
/dev/sda1       4.7G  2.3G  2.4G  50% /
tmpfs           482M     0  482M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15       98M  5.1M   93M   6% /boot/efi
tmpfs            97M  4.0K   97M   1% /run/user/1000

Filesystem- identifies which partition or filesystem is associated with the network. If a user is unable to save a file on a network share the user can take the help of this parameter.
Used- indicate how much space is currently in use, if it’s close to the storage capacity you need to free up space.

ifconfig

You need to troubleshoot network connectivity issues on a Linux server.

ifconfig command would show you all the configured network interfaces and ip address.

$ ifconfig
docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        inet6 fe80::42:96ff:fed3:d4d2  prefixlen 64  scopeid 0x20<link>
        ether 02:42:96:d3:d4:d2  txqueuelen 0  (Ethernet)
        RX packets 4128  bytes 183296 (183.2 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 6243  bytes 85480503 (85.4 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

inet and inet6 show us the ipv4 and ipv6 addresses that are attached to the interface for the connection.
bcast helps us to identify if there is a broadcasting-related issue like if there is an issue with the broadcast address that can lead to problems like network discovery.
mask (netmask) tells us about the subnet-related problems that can lead to communication issues between devices on different subnets.
mtu tells us about the maximum transmission unit i.e the maximum packet size that can be transferred in the transmission. If not it can lead to fragmentation and ultimately affecting the performance

dmesg

A Linux server is experiencing hardware issues, such as disk errors or network interface failures. You must analyze the system logs to identify any potential hardware-related errors or warnings.

The dmesg command provides information about hardware devices, system events, and potential issues encountered during system operation.

$ sudo dmesg
[    0.000000] Linux version 5.15.0-88-generic (buildd@lcy02-amd64-058) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 (Ubuntu 5.15.0-88.98-generic 5.15.126)
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-88-generic root=UUID=5a569d86-b935-46dd-ae79-7a72a25b6a4c ro console=tty1 console=ttyS0
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel
[    0.000000]   AMD AuthenticAMD
[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003e1b6fff] usable
[    0.000000] BIOS-e820: [mem 0x000000003e1b7000-0x000000003e1fffff] reserved
[    0.000000] BIOS-e820: [mem 0x000000003e200000-0x000000003eceefff] [    0.000000] BIOS-e820: [mem 0x000000003f36b000-0x000000003ffeffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000ffc00000-0x00000000ffffffff] reserved
[    0.000000] NX (Execute Disable) protection: active

By analyzing the logs, you can identify any hardware issues or error messages that could be affecting the server's performance and stability.

journalctl

This command displays the system call logs.

$ journalctl
Nov 01 17:15:42 ubuntu kernel: Linux version 5.15.0-87-generic (buildd@lcy02-amd64-011) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils>
Nov 01 17:15:42 ubuntu kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-87-generic root=LABEL=cloudimg-rootfs ro console=tty1 console=ttyS0
Nov 01 17:15:42 ubuntu kernel: KERNEL supported cpus:
Nov 01 17:15:42 ubuntu kernel:   Intel GenuineIntel
Nov 01 17:15:42 ubuntu kernel:   AMD AuthenticAMD
Nov 01 17:15:42 ubuntu kernel:   Hygon HygonGenuine
Nov 01 17:15:42 ubuntu kernel: secureboot: Secure boot disabled
Nov 01 17:15:42 ubuntu kernel: SMBIOS 2.5 present.
Nov 01 17:15:42 ubuntu kernel: DMI: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Nov 01 17:15:42 ubuntu kernel: Hypervisor detected: KVM

Suppose a service is failing to start on your Linux system, by using journalctl command you can get detailed log information about the service's attempts to start and any associated error messages.

nicstat

It offers comprehensive statistics on network interfaces, including data on failures, packets, and bandwidth usage.

$ nicstat
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
01:52:51       lo    0.00    0.00    0.01    0.01   93.01   93.01  0.00   0.00
01:52:51   enp0s3    4.72    0.05    3.33    0.29  1451.0   175.4  0.00   0.00
01:52:51  docker0    0.00    0.68    0.03    0.05   44.40 13692.2  0.00   0.00

time- tell us about the timestamp of each network statistics so if we encounter a network issue suddenly by the help of timestamp we can identify the patterns or potential triggers that led to the problem.
name- tell the name of each network interface we are using in a multi network environment so you can pinpoint the network which is causing the problem and resolve it.
kbps in, kbps out, pkt/s in, pkt/s out, err/s in, err/s out, drops/s in, drops/s out, missed/s in, missed/s out - indicates the information of data or packets transferred and received, how many packets are dropped and missed, how many damaged packets received and transmitted.
queue in, queue out tell us about the packet queuing like if the numbers are higher than usual this can lead to latency and we can say lag in the network.

lsof

A file is continuously growing in size which was not expected. You need to identify which process is writing into the file.

The lsof command gives a list of files that are opened.

lsof

ubuntu@top-gerbil:/$ sudo lsof -R
COMMAND    PID  TID TASKCMD   PPID       USER   FD      TYPE     DEVICE SIZE/OFF   NODE NAME
systemd      1                  0       root  cwd       DIR      8,1     4096       2    /
systemd      1                  0       root  rtd       DIR      8,1     4096       2    /
systemd      1                  0      root  txt       REG      8,1    1849992    3335 /usr/lib/systemd/system
container 3243                  1       root  txt       REG      8,1    52632728   39545 /usr/bin/containerd
container 3243                  1       root  mem-W     REG      8,1    32768      73792 /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db

PID- This would give you the process ID associated with the files. You can then resolve the issue by monitoring the process.
USER- indicate who has accessed the files.

pstack

You have an application running on your Linux system that suddenly becomes unresponsive or experiences a segmentation fault. The issue might be related to the application’s call stack.

Use the pstack command along with the process ID of the running process or the core dump file generated during the crash.

$ pstack 432

Thread 1 (Thread 0x7f7f03600700 (LWP 6516)):
#0  0x00007f7f0576b9d5 in poll () from /lib64/libc.so.6
#1  0x00007f7f06f47b36 in ?? () from /usr/lib64/libglib-2.0.so.0
#2  0x00007f7f06f47c1a in g_main_context_iteration () from /usr/lib64/libglib-2.0.so.0
#3  0x00007f7f073d587d in ?? () from /usr/lib64/libgio-2.0.so.0
#4  0x00007f7f06f6f16d in g_main_loop_run () from /usr/lib64/libglib-2.0.so.0
#5  0x00007f7f07471d7a in ?? () from /usr/lib64/libgio-2.0.so.0
#6  0x00007f7f06f1e82f in ?? () from /usr/lib64/libglib-2.0.so.0
#7  0x00007f7f0692fdd5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f7f0577703d in clone () from /lib64/libc.so.6

It provides you with the stack trace of the application, displaying the function calls and corresponding memory addresses at the time of the crash. By analyzing the stack trace, you can identify the specific function or module causing the issue and gain insights into the application's behavior leading up to the crash.

strace

It collects all system calls made by a process and the signals received by the process.

$ strace -p 5647

openat(AT_FDCWD, "/proc/self/mountinfo", O_RDONLY) = 3
newfstatat(3, "", {st_mode=S_IFREG|0444, st_size=0, ...}, AT_EMPTY_PATH) = 0
read(3, "23 29 0:21 / /sys rw,nosuid,node"..., 1024) = 1024
read(3, "rmware/efi/efivars rw,nosuid,nod"..., 1024) = 1024
read(3, "re20/2015 ro,nodev,relatime shar"..., 1024) = 973
read(3, "", 1024)                       = 0
lseek(3, 0, SEEK_CUR)                   = 3021
close(3)                                = 0
ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
newfstatat(AT_FDCWD, "/run", {st_mode=S_IFDIR|0755, st_size=980, ...}, 0) = 0
newfstatat(AT_FDCWD, "/", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
newfstatat(AT_FDCWD, "/sys/kernel/security", {st_mode=S_IFDIR|0755, st_size=0, ...}, 0)= 0

open system call - identifies problems related to accessibility of a file. By examining the file paths referenced in the open system calls, you can determine if there are any file-related issues contributing to the application's failure.
read and write system calls - offer information about data read from or written to specific files or resources. If there are any issues related to reading or writing data, these system calls can pinpoint where the problem lies, such as incorrect data handling or file manipulation.
errno- it displays the number of errors generated during the system calls. Analyze the type of error encountered such as file not found, permission denied, or invalid argument.

eBPF Performance tools

There's so much buzz in the market about BPF. Big companies like Meta and Amazon are using it. But what exactly is BPF? It stands for Berkeley Packet Filters, originally used for monitoring specific network traffic. But it's evolved into extended BPF that's like a magic wand for Linux, enabling tracing complex issues.

Brendan Gregg, a former Netflix member, shares his troubleshooting expertise in his book BPF Performance Tools, which is a must-read. He breaks down performance into different domains, offering practical examples of BPF in action using BCC and bpftrace.

Conclusion

In this article, we saw that it is imperative to troubleshoot using a methodological approach as it offers an organized and systematic means of locating issues and resolving them quickly and effectively. Without a systematic method, troubleshooting can become disorganized and time-consuming, which frequently results in frustration, trial-and-error fixes, and, in certain situations, worsening the problems.

Tired of sifting through convoluted outputs? I would highly recommend you to explore the alternative tools suggested by Julia Evans in her article A list of new(ish) command line tools to optimize your workflow. For instance,angle grinder outperforms traditional data analysis methods (grep), with precise and efficient results.

Check out the official documentary on the groundbreaking eBPF technology, highlighting its impact on the Linux Kernel and its journey of development with key industry players, including Meta, Intel, Isovalent, Google, Red Hat, and Netflix.

If you are stuck with the Linux issue and looking for SREs to troubleshoot, contact us for quick support.

DEV Community