Yousef Zook

Posted on Nov 13, 2021 • Edited on Dec 3, 2021

Java Performance - 4 - Working with the JIT Compiler

#performance #java #programming #books

Recap

This article is part 5 for the series Java Performance that summarize the java performance book by Scot Oaks

In the previous chapter we have discussed performance toolbox in java. We have mentioned JVM commands to monitor cpu, network and disk usage. We have also talked about the JFR (Java Flight Recorder).

In this chapter we are going to talk about how java is run on a computer and how it's converted into binary, we are also going to describe the difference between JIT and AOT compilers and some details about GraalVM.

Great, let's start the fourth chapter...

Chapter Title:

Working with the JIT Compiler

The just-in-time (JIT) compiler is the heart of the Java Virtual Machine; nothing controls the performance of your application more than the JIT compiler.

1) Just-in-Time Compilers: An Overview

Computers CPUs can execute only a relatively few specific instructions, called machine code.
There are two types of programming languages:

Compiled Languages like C++ and Fortran. Their programs are derived as binary (machine) code ready to run on the cpu.
Interpreted Languages like PHP and perl. They are interpreted which means that the same program source code can be run on any CPU as long as the machine has the correct interpreter (that is, the program called php or perl). The interpreter translates each line of the program into binary code as that line is executed.

Each system has advantages and disadvantages. Programs written in interpreted languages are portable, however it is slower than compiled ones.

A- HotSpot Compilation

As discussed in Chapter 1, the Java implementation discussed in this book is Oracle’s HotSpot JVM. This name (HotSpot) comes from the approach it takes toward compiling the code.

It compile the java code to java bytecodes, then starting at run time interpreting these bytecodes.

when the JVM executes code, it does not begin compiling the code immediately. There are two basic reasons for this:

First: if the code is going to be executed only once, then compiling it is essentially a wasted effort; it will be faster to interpret the Java bytecodes than to compile them and execute (only once) the compiled code.
Second: is the optimization reason, the more times that the JVM executes a particular method or loop, the more information it has about that code. This allows the JVM to make numerous optimizations when it compiles the code.

example: consider the equals() method. This method exists in every Java object (because it is inherited from the Object class) and is often overridden. When the interpreter encounters the statement b = obj1.equals(obj2), it must look up the type (class) of obj1 in order to know which equals() method to execute. This dynamic lookup can be somewhat time-consuming.

Over time, say the JVM notices that each time this statement is executed, obj1 is of type java.lang.String. Then the JVM can produce compiled code that directly calls the String.equals() method. Now the code is faster not only because it is compiled but also because it can skip the lookup of which method to call.

2) Tiered Compilation

Once upon a time, the JIT compiler came in two flavors, and you had to install different versions of the JDK depending on which compiler you wanted to use. These compilers are known as the client (now called C1) and server (now called C2) compilers. Today, all shipping JVMs include both compilers (though in common usage, they are usually referred to as server JVMs).

C1: begins compiling sooner, less aggressive in optimization but faster
C2: begins compiling later, after collecting some optimization informations while the code is running

That technique is known as tiered compilation, and it is the technique all JVMs now use. It can be explicitly disabled with the -XX:TieredCompilation flag (the default value of which is true).

3) Common Compiler Flags

Two commonly used flags affect the JIT compiler; we’ll look at them in this section.
1- Code cache
2- Inspection flag

A- Tuning the code cache

When the JVM compiles code, it holds the set of assembly-language instructions in the code cache. The code cache has a fixed size, and once it has filled up, the JVM is not able to compile any additional code.
When the code cache fills up, the JVM spits out this warning:

    Java HotSpot(TM) 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled.
    Java HotSpot(TM) 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=

To solve the problem, a typical option is to simply double or quadruple the default.

The maximum size of the code cache is set via the -XX:ReservedCodeCacheSize=N flag (where N is the default just mentioned for the particular compiler).
there is an initial size (specified by -XX:InitialCodeCacheSize=N).
The initial size of the code cache is 2,496 KB, and the default maximum size is 240 MB.

Resizing the cache happens in the background and doesn’t really affect performance, so setting the ReservedCodeCacheSize size (i.e., setting the maximum code cache size) is all that is generally needed.

In Java 11, the code cache is segmented into three parts:

Nonmethod code
Profiled code
Nonprofiled code By default, the code cache is sized the same way (up to 240 MB), and you can still adjust the total size of the code cache by using the ReservedCodeCacheSize flag.

You’ll rarely need to tune these segments individually, but if so, the flags are as follows:

-XX:NonNMethodCodeHeapSize=N for the nonmethod code
-XX:ProfiledCodeHapSize=N for the profiled code
-XX:NonProfiledCodeHapSize=N for the nonprofiled code

B- Inspecting the Compilation Process

The second flag isn’t a tuning per se: it will not improve the performance of an application. Rather, the -XX:+PrintCompilation flag (which by default is false) gives us visibility into the workings of the compiler (though we’ll also look at tools that provide similar information).

If PrintCompilation is enabled, every time a method (or loop) is compiled, the JVM prints out a line with information about what has just been compiled, with the following format:

timestamp  compilation_id  attributes  (tiered_level)  method_name  size  deopt

timestamp here is the time after the compilation has finished (relative to 0, which is when the JVM started).
Compilation_id is an internal task ID. Sometimes you may see an out-of-order compilation ID. This happens most frequently when there are multiple compilation threads
attributes: it is a series of five characters that indicates the state of the code being compiled
- % the compilation is OSR (on-stack replacement): JIT compilation is an asynchronous process: when the JVM decides that a certain method should be compiled, that method is placed in a queue. Rather than wait for the compilation, the JVM then continues interpreting the method, and the next time the method is called, the JVM will execute the compiled version of the method (assuming the compilation has finished, of course).
- s The method is synchronized
- ! The method has an exception handler.
- b Compilation occurred in blocking mode: will never be printed by default in current versions of Java; it indicates that compilation did not occur in the background.
- n Compilation occurred for a wrapper to a native method: indicates that the JVM generated compiled code to facili‐ tate the call into a native method.
tiered_level indicates which compiler (c1 levels vs c2 levels), If tiered compilation has been disabled, the next field (tiered_level) will be blank. Otherwise, it will be a number indicating which tier has completed compilation.
method_name the name of the compiled method
size the size (in bytes) of the code being compiled.
deopt in some cases appears and indicates that some sort of deoptimization has occurred.

The compilation log may also include a line that looks like this:

    timestamp compile_id COMPILE SKIPPED: reason

reasons:

Code cache filled: The size of the code cache needs to be increased using the ReservedCodeCache flag.
Concurrent classloading: The class was modified as it was being compiled. The JVM will compile it again later; you should expect to see the method recompiled later in the log.

Here are a few lines of output from enabling PrintCompilation on the stock REST application:

The server took about 2 seconds to start; the remaining 26 seconds before anything else was compiled were essentially idle as the application server waited for requests.
The process() method is synchronized, so the attributes include an s.
Inner classes are compiled just like any other class and appear in the output with the usual Java nomenclature: outer-classname$inner-classname.
The processRequest() method shows up with the exception handler as expected.

C- Tiered Compilation Levels

The compilation log for a program using tiered compilation prints the tier level at which each method is compiled.

So the levels of compilation are as follows:

0: Inerpreted code
1: Simple C1 compiled code
2: Limited C1 compiled code
3: Full C1 compiled code
4: C2 compiled code

A typical compilation log shows that most methods are first compiled at level 3: full C1 compilation. (All methods start at level 0, of course, but that doesn’t appear in the log.) If a method runs often enough, it will get compiled at level 4 (and the level 3 code will be made not entrant). This is the most frequent path: the C1 compiler waits to compile something until it has information about how the code is used that it can leverage to perform optimizations.

If the C2 compiler queue is full, methods will be pulled from the C2 queue and com‐ piled at level 2, which is the level at which the C1 compiler uses the invocation and back-edge counters (but doesn’t require profile feedback).
On the other hand, if the C1 compiler queue is full, a method that is scheduled for compilation at level 3 may become eligible for level 4 compilation while still waiting to be compiled at level 3. In that case, it is quickly compiled to level 2 and then transi‐ tioned to level 4.
Trivial methods may start in either level 2 or 3 but then go to level 1 because of their trivial nature. If the C2 compiler for some reason cannot compile the code, it will also go to level 1. And, of course, when code is deoptimized, it goes to level 0.

D- Deoptimization

Deoptimization means that the compiler has to “undo” a previous compilation. The effect is that the performance of the application will be reduced, at least until the compiler can recompile the code in question.

Deoptimization occurs in two cases:

when code is made not entrant: Two things cause code to be made not entrant:
- One is due to the way classes and interfaces work, example: If an interface has two implementations, and compiler assumes that an implementation of an interface is called most of the time so decided to inline the code of the first implementation as optimization, then the second implementation has been called which means that the compiler assumption is not correct, then it need to do deoptimization.
- and one is an implementation detail of tiered compilation: When code is compiled by the C2 compiler, the JVM must replace the code already compiled by the C1 compiler.
when code is made zombie: When the compilation log reports that it has made zombie code, it is saying that it has reclaimed previous code that was made not entrant. For performance, this is a good thing. Recall that the compiled code is held in a fixed- size code cache; when zombie methods are identified, the code in question can be removed from the code cache, making room for other classes to be compiled (or lim‐ iting the amount of memory the JVM will need to allocate later).

4) Advanced Compiler Flags

A- Compilation Thresholds

This chapter has been somewhat vague in defining just what triggers the compilation of code. The major factor is how often the code is executed; once it is executed a cer‐ tain number of times, its compilation threshold is reached, and the compiler deems that it has enough information to compile the code.

Compilation is based on two counters in the JVM:

the number of times the method has been called, - and the number of times any loops in the method have branched back. Branching back can effectively be thought of as the number of times a loop has completed execution, either because it reached the end of the loop itself or because it executed a branching statement like continue.

When the JVM executes a Java method, it checks the sum of those two counters and decides whether the method is eligible for compilation.

Tunings affect these thresholds. When tiered compilation is disabled, standard compilation is triggered by the value of the -XX:CompileThreshold=N flag. The default value of N is 10,000. Changing the value of the CompileThreshold flag will cause the compiler to choose to compile the code sooner (or later) than it normally would have. Note, however, that although there is one flag here, the threshold is calculated by adding the sum of the back-edge loop counter plus the method entry counter.

changing the flags -XX:Tier3InvocationThreshold=N (default 200) to get C1 to compile a method more quickly, and -XX:Tier4InvocationThreshold=N (default 5000) to get C2 to compile a method more quickly. Similar flags are available for the back-edge threshold.

B- Compilation Threads

when a method (or loop) becomes eligible for compilation, it is queued for compilation. That queue is processed by one or more background threads.

These queues are not strictly first in, first out; methods whose invocation counters are higher have priority.
The C1 and C2 compilers have different queues, each of which is processed by (potentially multiple) different threads.

The following table shows default number of C1 and C2 compiler threads for tiered compilation.

If tiered compilation is disabled, only the given number of C2 compiler threads are started.

C- Inlining

One of the most important optimizations the compiler makes is to inline methods. Code that follows good object-oriented design often contains attributes that are accessed via getters (and perhaps setters):

public class Point { 
    private int x, y;

    public void getX() { return x; }
    public void setX(int i) { x = i; } }

The overhead for invoking a method call like this is quite high, especially relative to the amount of code in the method.

Fortunately, JVMs now routinely perform code inlining for these kinds of methods. Hence, you can write this code:

Point p = getPoint();
p.setX(p.getX() * 2);

The compiled code will essentially execute this:

Point p = getPoint(); 
p.x=p.x*2;

Inlining is enabled by default. It can be disabled using the -XX:-Inline flag.

D- Escape Analysis

The C2 compiler performs aggressive optimizations if escape analysis is enabled (-XX:+DoEscapeAnalysis, which is true by default). For example, consider this class to work with factorials:

public class Factorial {
    private BigInteger factorial; private int n;
    public Factorial(int n) {
        this.n = n; 
    }
    public synchronized BigInteger getFactorial() { 
        if (factorial == null)
                factorial = ...;
                return factorial; 
    }
}

To store the first 100 factorial values in an array, this code would be used:

ArrayList<BigInteger> list = new ArrayList<BigInteger>(); 
for(inti=0;i<100;i++){
    Factorial factorial = new Factorial(i);
    list.add(factorial.getFactorial());
}

The factorial object is referenced only inside that loop; no other code can ever access that object. Hence, the JVM is free to perform optimizations on that object:

It needn’t get a synchronization lock when calling the getFactorial() method.
It needn’t store the field n in memory; it can keep that value in a register. Similarly, it can store the factorial object reference in a register.
In fact, it needn’t allocate an actual factorial object at all; it can just keep track of the individual fields of the object.

This kind of optimization is sophisticated: it is simple enough in this example, but these optimizations are possible even with more-complex code.

5) Tiered Compilation Trade-offs

Question Given the performance advantages it provides, is there ever a reason to turn tiered compilation off?

Answer:

One such reason might be when running in a memory-constrained environment.
for example you may be running in a Docker container with a small memory limit or in a cloud virtual machine that just doesn’t have quite enough memory.

The below table shows the effect of tiered compilation on the code cache

The C1 compiler compiled about four times as many classes and predictably required about four times as much memory for the code cache.

6) The GraalVM

The GraalVM is a new virtual machine. It provides a means to run Java code, of course, but also code from many other languages. This universal virtual machine can also run JavaScript, Python, Ruby, R, and traditional JVM bytecodes from Java and other languages that compile to JVM bytecodes (e.g., Scala, Kotlin, etc.). Graal comes in two editions: a full open source Community Edition (CE) and a commercial Enter‐ prise Edition (EE). Each edition has binaries that support either Java 8 or Java 11.

The GraalVM has two important contributions to JVM performance:

First, an add- on technology allows the GraalVM to produce fully native binaries.
Second, the GraalVM can run in a mode as a regular JVM, but it contains a new implementation of the C2 compiler. This compiler is written in Java (as opposed to the traditional C2 compiler, which is written in C++).

Within the JVM, using the GraalVM compiler is considered experimental, so to enable it, you need to supply these flags: -XX:+UnlockExperimentalVMOptions, -XX:+EnableJVMCI, and -XX:+UseJVMCICompiler. The default for all those flags is false.

The following table shows the perfroamcne of Graal Compiler:

7) Precompilation

We began this chapter by discussing the philosophy behind a just-in-time compiler. Although it has its advantages, code is still subject to a warm-up period before it exe‐ cutes. What if in our environment a traditional compiled model would work better: an embedded system without the extra memory the JIT requires, or a program that completes before having a chance to warm up?
In this section, we’ll look at two experimental features that address that scenario. Ahead-of-time compilation is an experimental feature of the standard JDK 11, and the ability to produce a fully native binary is a feature of the Graal VM

A- Ahead-of-Time compilation AOT

Ahead-of-time (AOT) compilation was first available in JDK 9 for Linux only, but in JDK 11 it is available on all platforms. From a performance standpoint, it is still a work in progress, but this section will give you a sneak peek at it.

AOT compilation allows you to compile some (or all) of your application in advance of running it. This compiled code becomes a shared library that the JVM uses when starting the application. In theory, this means the JIT needn’t be involved, at least in the startup of your application: your code should initially run at least as well as the C1 compiled code without having to wait for that code to be compiled.

In practice, it’s a little different: the startup time of the application is greatly affected by the size of the shared library (and hence the time to load that shared library into the JVM). That means a simple application like a “Hello, world” application won’t run any faster when you use AOT compilation (in fact, it may run slower depending on the choices made to precompile the shared library).

B- GraalVM Native Compilation

AOT compilation was beneficial for relatively large programs but didn’t help (and could hinder) small, quick-running programs. That is because it’s still an experimen‐ tal feature and because its architecture has the JVM load the shared library.

he GraalVM, on the other hand, can produce full native executables that run without the JVM. These executables are ideal for short-lived programs. If you ran the examples, you may have noticed references in some things (like ignored errors) to GraalVM classes: AOT compilation uses GraalVM as its foundation. This is an Early Adopter feature of the GraalVM; it can be used in production with the appropriate license but is not subject to warranty.

Limitations also exist on which Java features can be used in a program compiled into native code. These limitations include the following:

Dynamic class loading (e.g., by calling Class.forName()).
Finalizers.
The Java Security Manager.
JMX and JVMTI (including JVMTI profiling).
Use of reflection often requires special coding or configuration.
Use of dynamic proxies often requires special configuration.
Use of JNI requires special coding or configuration.

🏃 See you in chapter 5 ...

🐒take a tip

Embrace your beliefes, but make yourself open for changes. 🌔

DEV Community