As a DevOps engineer, ensuring system stability and performance is crucial. Occasionally, processes can go rogue, causing resource bottlenecks or unexpected behavior. Identifying these problematic processes requires effective monitoring techniques. Here's where Bash scripting and process monitoring tools like ps
and lsof
come in handy!
Understanding Linux Process:
In Linux, everything that you run/use is a type of file. As soon as you start your system, the kernel creates an init
process, which every other process forks from.
We need a forked system call to create/fork a new process. These new processes can be created as a child of an existing process.
Suppose we have an existing child process. The child process will let the kernel know by setting its terminal status to 0. The parent needs to acknowledge the termination of the child process using something called a wait call. After all, we need the parent process to know that a child process has died. This exact operation can lead us to 2 possible issues:
1.Orphan process:
What happens if the parent process is exited/terminated before the child process? The children's process is left an orphan and is transferred over to the init
process and marked as "orphan". These processes now wait for wait call
from init to be terminated.
2.Zombie process: 🧟
What is a zombie? someone who is mindlessly running trying to eat human brains!! A MANACE. A zombie process is something similar, A process has gone zombie when the child process is terminated but the parent process has not called for wait call
yet.
A zombie process' resources are freed up. However, the process still appears in the ps
table. You can't manually kill the zombie process, the entry of this process is cleared when the parent process eventually makes the wait call
.(Just like a real-world zombie you need to shoot it twice, in our case the second bullet needs to come from the parent process).
Command use and output:
Linux provides us with an in-built package called procps
which contains useful tools such as ps
, top
, iostat
, etc. to help us in performance and process monitoring. I will discuss a couple of them here which are used for performance monitoring.
1.ps
ps aux
-
a
- list all processes even owned by other users -
u
- list more details regarding the process -
x
- list process that doesn't have TTY associated with them.
2.top
Do you want to know which thread is consuming how much CPU, and memory? Which user or group executed the thread?
The top
command provides a dynamic real-time view of a running system. It can display a various type of information, such as from list of tasks currently being managed, Memory and CPU utilized by the current process, etc.
as displayed above running top
would list:
The top part of the top command displays a summary of the current process running on the kernel.
- PID (Process ID) - The task's unique process ID.
- USER - Task owner's username.
- PR - Scheduling priority of the task.
- NI - Nice value - A bit confusing but it simply shows the priority of the task within the same autogroup. A positive value means lower priority and a negative value means higher priority.
- VIRT - Virtual memory size - The total amount of virtual memory used by the task.
- RES - Resident memory size - Non-swapped physical memory a task is currently using.
- SHR - Shared memory size - Part of RES which can be utilized by other process.
- S - Process status.
- D - Uninterruptible sleep
- I - Idle
- R - Running
- S - Sleeping
- T - Stopped by job control signal
- t - Stopped by debugger during trace
- Z - Zombie
- %CPU usage
- %MEMORY usage
- TIME - CPU time.
- COMMAND - Command name or command line.
Everything is a file
One of the core principles of Linux is that everything is treated as a file. This means that not only traditional files like documents and images but also devices, network connections, and even processes are represented as files within the operating system. This unified approach simplifies system management and provides a consistent interface for interacting with various system components. By treating everything as a file, Linux allows for a more flexible and efficient way to handle and manipulate different types of resources.
lsof
helps us to narrow down current open files (running process) and we can even sort it based on the ports, user, ipv4, ipv6, etc. The best way to learn about them is to use them. Here's the best resource I found: lsof cheatsheet
Real world example:
Problem statement:
Imagine your server experiences performance issues. Users complain about slow applications, and resource usage is through the roof. You suspect a single process is to blame, but pinpointing it can feel like searching for a needle in a haystack.
Assumptions & lab setup
Before I go over the solution. Just as part of the setup in the lab environment, I have created a python code that runs infinitely generating some random numbers using the code below, This is just to create a process that is actually eating up a bunch of resources:
import time
import random
def calculate_something_complex():
# Simulate a computationally intensive task
result = 0
for i in range(10000000):
result += random.random()
return result
def long_running_task():
while True:
result = calculate_something_complex()
print(result)
time.sleep(1)
if __name__ == "__main__":
long_running_task()
Before going any further we must define a basic flow of the solution:
Determine how to find which is the process eating up most of the resources? - using
top
will be a very straightforward approach as it is purposefully used to list processes that are using most of the resources.We do not only need the process ID but also list the files it is running to determine what can be the root cause. -
lsof
is your answer to do this.We also need to be able to perform this task multiple times and probably automated based on a CRON job. - Writing a bash file that can be executed easily is a good option.
Solution & Explaination:
Here's the code I came up with to get the CPU-intensive task and list the open files under that PID.
#!/bin/bash
find_cpu_hog() {
top -b -n 1 | grep -A99 PID | grep -v COMMAND | head -n 1 | awk '{print $1}'
}
list_open_files() {
pid="$1"
lsof -p "$pid"
}
main() {
cpu_hog_pid=$(find_cpu_hog)
echo "$cpu_hog_pid"
if [ -z "$cpu_hog_pid" ]; then
echo "No high CPU processes found."
exit 0
fi
list_open_files "$cpu_hog_pid"
}
main
I believe the most important code above is
top -b -n 1 | grep -A99 PID | grep -v COMMAND | head -n 1 | awk '{print $1}'
Basically, what I am trying to do is get the top and manipulate it's result to get the process Id. Here's how in detail:
top -b -n 1
: this will get the CPU intensive process by running the top command in a batch.grep -A99 PID
:
if you have ran the top
command before you know that the top provides a very useful info. such as uptime, tasks, %CPU usage etc. read more here: What the first five lines of Linux’s top command tell you
So to omit that data and only get the process table I am performing a grep
on PID and by -A99 I am getting everything after that.
grep -v COMMAND
: To omit the first line of the heading in the process table.head -n 1
: getting only the first line, Most CPU-intensive process.awk '{print $1}'
: getting the first argument of the output, PID.
Running the above code with the Python script running in the background I get the below result. Looking at the output we can determine that we have a Python code running which is CPU-intensive, we can also pinpoint the user who is performing this.
Corrections and comments are welcome.
Top comments (0)