LEO Qin

Posted on Apr 6, 2023

Troubleshooting a JVM GC Long Pause

#jvm #java #springcloud

Initially, there was an abnormality in garbage collection of a certain application online, and some instances in the application experienced particularly long Full GC time, lasting about 15-30 seconds. On average, it occurred once every two weeks.

JVM parameter configuration:

-Xms2048M –Xmx2048M –Xmn1024M –XX:MaxPermSize=512M

Analyze GC logs. The GC log records the execution time and result of each GC. By analyzing the GC logs, you can optimize heap and GC settings, or improve the object allocation pattern of the application.

In this case, the reason for Full GC is Ergonomics, because UseAdaptiveSizePolicy is enabled, and the JVM is adapting and adjusting itself, causing Full GC.

This log mainly reflects the changes before and after GC, but it is currently not clear what is causing the issue.

To enable GC logs, the following JVM startup parameters need to be added:

-XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/export/log/risk_pillar/gc.log

The meanings of common Young GC and Full GC logs are as follows:

Further investigate server performance metrics. After obtaining the GC execution time, investigate the metrics with abnormal values at this time point through a monitoring platform. It was ultimately discovered that around 5:06 (the time of GC), CPU usage increased significantly, while SWAP showed a release of resources and a turning point in the growth of memory resources.

Did the JVM use swap?
Was the sudden increase in CPU usage and the release of swap space to memory caused by GC?
To verify whether the JVM used swap, we checked the process memory resource usage under the "proc" directory.

for i in (cd/proc;ls∣grep"[0−9]"∣awk′0 >100');
do awk '/Swap:/{a=a+2}END{print '"i"',a/1024"M"}' /proc/$i/smaps 2>/dev/null;
done | sort -k2nr | head -10

head -10" means to retrieve the top 10 processes with high memory usage. The first column of the output represents the process ID, and the second column represents the size of the process's swap usage. We can see that there is indeed a process using 305MB of swap space.

Here's a brief introduction to what swap is:

Swap refers to a swapping partition or file, which is primarily used to trigger memory recycling when there is pressure on memory usage. At this point, some of the data in memory may be swapped to the swap space so that the system does not run out of memory and cause OOM or other fatal situations.

When a process requests memory from the OS and finds that there is not enough, the OS will swap out temporarily unused data from memory and place it in the swap partition, a process called "swap out". When the process needs this data again and the OS finds that there is free physical memory, it will swap the data back into physical memory from the swap partition, a process called "swap in".

To verify that there is a necessary relationship between GC time and swap operations, I surveyed more than a dozen machines, focusing on GC logs with long duration, and confirmed that the time points of GC and swap operations are indeed consistent.

Furthermore, by checking the swappiness parameter of each instance of the virtual machine, a common phenomenon is that instances with longer Full GC are configured with the parameter vm.swappiness = 30 (a larger value means a greater tendency to use swap), while instances with relatively normal GC times are configured with the parameter vm.swappiness = 0 (maximizing the reduction of using swap).

Swappiness can be set to a value between 0 and 100. It is a Linux kernel parameter that controls the relative weight of memory usage during swap.

swappiness=0: maximum use of physical memory, followed by swap space
swappiness=100: active use of swap partition, and timely swapping of data on memory to swap space

The corresponding physical memory usage rate and swap usage are shown below.

Problem analysis:
When the memory usage reaches the waterline (vm.swappiness), Linux will move some temporarily unused memory data to the disk swap to free up more available memory space. When data in the swap area is needed, it will be moved back into memory. When the JVM performs garbage collection, it needs to traverse the used memory in the corresponding heap partition. If part of the heap content has been swapped to the swap space during GC, when it is traversed, it needs to be swapped back to memory. Because it requires accessing the disk, it will be much slower than accessing physical memory, and the GC pause time will be very long, which can cause Linux to lag behind in swapping area recovery (memory-to-disk swapping operations are very CPU and system IO intensive). In high-concurrency/QPS services, this lag can be fatal (STW).
Questions:
Will the JVM with swap enabled always take longer to perform GC?
If the JVM dislikes swap so much, why doesn't it prohibit its use?
What is the working mechanism of swap? This server has 8GB of physical memory and uses swap memory, which means that physical memory is not enough, but according to the free command, the actual physical memory usage does not seem to be that high, while Swap has occupied nearly 1G.

free：The amount of remaining memory excluding buff/cache

shared：Shared memory.

buff/cache：The number of memory used for buffering and caching (frequently caused by programs frequently accessing files).

available：The real amount of available remaining memory.

5.Further thoughts
One may consider what it means to disable swap disk cache.

In fact, it is not necessary to be so radical. It is important to note that the world is never simply binary, and everyone tends to choose somewhere in between. Some lean towards 0, while others towards 1.

Clearly, with regard to the swap issue, the JVM can choose to minimize its usage to reduce its impact. It is important to understand how Linux memory recovery works in order to reduce any possible concerns.

Let's first take a look at how swap is triggered.

Linux triggers memory recovery in two scenarios: when there is not enough free memory during memory allocation, and when a daemon process (kswapd process) periodically checks the system's memory and initiates memory recovery when the available memory drops below a specific threshold.

6.Speculation
Due to the short intervals of GC in real-time services, the things in memory have no chance to be swapped to swap and are immediately recovered during GC. When GC is performed, data from the swap partition does not need to be swapped back to physical memory, but is calculated entirely based on memory, which makes it much faster. The selection strategy for which memory data to swap into the swap partition is likely to be similar to the LRU algorithm (least recently used).

Lowering the heap size appropriately can also solve the problem.

This also indirectly indicates that when deploying Java services on Linux systems, memory allocation should not simply be large and comprehensive, but should consider the memory requirements of the JVM for Java permanent generation, Java heap (young and old generations), thread stack, and Java NIO in different scenarios.

7.Conclusion
In conclusion, when swap and GC occur at the same time, GC time will be very long, causing serious JVM stuttering, and in extreme cases, service crashes.

The main reason is that when the JVM performs GC, it needs to traverse the used memory of the corresponding heap partition. If a part of the heap has been swapped to swap at the time of GC, it must be swapped back to memory when traversing this part. In more extreme cases, if another part of the heap in memory needs to be swapped to swap due to insufficient memory space, the entire heap partition will be written to SWAP in turn during the process of traversing the heap partition, resulting in excessively long GC time. The size of the swap area should be limited online, and if the swap usage ratio is high, it should be investigated and resolved. When appropriate, the heap size can be lowered or physical memory can be added.

Therefore, when deploying Java services on Linux systems, it is important to be cautious about memory allocation.

Top comments (1)

LEO Qin • Apr 6 '23

need your time to read it, hope can help you a lot

DEV Community

Troubleshooting a JVM GC Long Pause

head -10" means to retrieve the top 10 processes with high memory usage. The first column of the output represents the process ID, and the second column represents the size of the process's swap usage. We can see that there is indeed a process using 305MB of swap space.

Top comments (1)

Read next

Проклятие Циклической Зависимости

Easily Set Up Multiple Spring Beans with Unique Configurations

Java Trends to Watch for in 2024: Key Developments Shaping the Future of Development

IntaLink: A New NL2SQL Technology Distinct from Large Models