This article will be the second article in the JVM performance optimization series (the first article: Portal), and the Java compiler will be the core content discussed in this article.
In this article, the author (Eva Andreasson) first introduces different types of compilers and compares the running performance of client-side compilation, server-side compiler and multi-layer compilation. Then, at the end of the article, several common JVM optimization methods are introduced, such as dead code elimination, code embedding, and loop body optimization.
Java's most proud feature, "platform independence," originates from the Java compiler. Software developers do their best to write the best Java applications possible, and a compiler runs behind the scenes to produce efficient executable code based on the target platform. Different compilers are suitable for different application requirements, thus producing different optimization results. Therefore, if you can better understand how compilers work and know more types of compilers, then you can better optimize your Java program.
This article highlights and explains the differences between the various Java virtual machine compilers. At the same time, I will also discuss some optimization solutions commonly used by just-in-time compilers (JIT).
What is a compiler?
Simply put, a compiler takes a programming language program as input and another executable language program as output. Javac is the most common compiler. It exists in all JDKs. Javac takes java code as output and converts it into JVM executable code - bytecode. These bytecodes are stored in files ending with .class and loaded into the java runtime environment when the java program starts.
The bytecode cannot be read directly by the CPU. It also needs to be translated into a machine instruction language that the current platform can understand. There is another compiler in the JVM that is responsible for translating the bytecode into instructions executable by the target platform. Some JVM compilers require several levels of bytecode code stages. For example, a compiler may need to go through several different forms of intermediate stages before translating bytecode into machine instructions.
From a platform agnostic perspective, we want our code to be as platform-agnostic as possible.
To achieve this, we work at the last level of translation—from the lowest bytecode representation to real machine code—that truly binds the executable code to the architecture of a specific platform. From the highest level, we can divide compilers into static compilers and dynamic compilers. We can choose the appropriate compiler based on our target execution environment, the optimization results we desire, and the resource constraints we need to meet. In the previous article we briefly discussed static compilers and dynamic compilers, and in the following sections we will explain them in more depth.
Static compilation VS dynamic compilation
The javac we mentioned earlier is an example of static compilation. With a static compiler, the input code is interpreted once, and the output is the form in which the program will be executed in the future. Unless you update the source code and recompile (via the compiler), the execution result of the program will never change: this is because the input is a static input and the compiler is a static compiler.
With static compilation, the following program:
Copy the code code as follows:
staticint add7(int x ){ return x+7;}
will be converted into bytecode similar to the following:
Copy the code code as follows:
iload0 bipush 7 iadd ireturn
A dynamic compiler dynamically compiles one language into another language. The so-called dynamic refers to compiling while the program is running - compiling while running! The advantage of dynamic compilation and optimization is that it can handle some changes when the application is loaded. Java runtime often runs in unpredictable or even changing environments, so dynamic compilation is very suitable for Java runtime. Most JVMs use dynamic compilers, such as JIT compilers. It is worth noting that dynamic compilation and code optimization require the use of some additional data structures, threads, and CPU resources. The more advanced the optimizer or bytecode context analyzer, the more resources it consumes. But these costs are negligible compared to the significant performance improvements.
JVM Types and Platform Independence of Java
A common feature of all JVM implementations is to compile bytecode into machine instructions. Some JVMs interpret the code when the application is loaded and use performance counters to find "hot" code; others do this through compilation. The main problem with compilation is that centralization requires a lot of resources, but it also leads to better performance optimizations.
If you are new to Java, the intricacies of the JVM will definitely make you confused. But the good news is that you don’t need to figure it out! The JVM will manage the compilation and optimization of the code, and you don't need to worry about machine instructions and how to write the code to best match the architecture of the platform the program is running on.
From java bytecode to executable
Once your java code is compiled into bytecode, the next step is to translate the bytecode instructions into machine code. This step can be implemented through an interpreter or through a compiler.
explain
Interpretation is the simplest way to compile bytecode. The interpreter finds the hardware instruction corresponding to each bytecode instruction in the form of a lookup table, and then sends it to the CPU for execution.
You can think of the interpreter like a dictionary: for each specific word (bytecode instruction), there is a specific translation (machine code instruction) corresponding to it. Because the interpreter immediately executes an instruction every time it reads it, this method cannot optimize a set of instructions. At the same time, every time a bytecode is called, it must be interpreted immediately, so the interpreter runs very slowly. The interpreter executes code in a very accurate manner, but because the output instruction set is not optimized, it may not produce optimal results for the target platform's processor.
compile
The compiler loads all the code to be executed into the runtime. This way it can refer to all or part of the runtime context when it translates the bytecode. The decisions it makes are based on the results of code graph analysis. Such as comparing different execution branches and referencing runtime context data.
After the bytecode sequence is translated into a machine code instruction set, optimization can be performed based on this machine code instruction set. The optimized instruction set is stored in a structure called the code buffer. When these bytecodes are executed again, the optimized code can be obtained directly from this code buffer and executed. In some cases, the compiler does not use the optimizer to optimize the code, but uses a new optimization sequence - "performance counting".
The advantage of using a code cache is that the result set instructions can be executed immediately without the need for reinterpretation or compilation!
This can greatly reduce execution time, especially for Java applications where a method is called multiple times.
optimization
With the introduction of dynamic compilation, we have the opportunity to insert performance counters. For example, the compiler inserts a performance counter that is incremented every time a block of bytecode (corresponding to a specific method) is called. The compiler uses these counters to find "hot blocks" so it can determine which code blocks can be optimized to bring the greatest performance improvement to the application. Runtime performance analysis data can help the compiler make more optimization decisions in the online state, thereby further improving code execution efficiency. Because we get more and more accurate code performance analysis data, we can find more optimization points and make better optimization decisions, such as: how to sequence instructions better, and whether to use a more efficient instruction set. Replace the original instruction set, and whether to eliminate redundant operations, etc.
For example
Consider the following java code Copy code The code is as follows:
staticint add7(int x ){ return x+7;}
Javac will statically translate it into the following bytecode:
Copy the code code as follows:
iload0
bipush 7
iadd
ireturn
When this method is called, the bytecode will be dynamically compiled into machine instructions. The method may be optimized when the performance counter (if it exists) reaches a specified threshold. The optimized results may look like the following machine instruction set:
Copy the code code as follows:
lea rax,[rdx+7] ret
Different compilers are suitable for different applications
Different applications have different needs. Enterprise server-side applications usually need to run for a long time, so they usually want more performance optimization; while client-side applets may want faster response times and less resource consumption. Let's discuss three different compilers and their pros and cons.
Client-side compilers
C1 is a well-known optimizing compiler. When starting the JVM, add the -client parameter to start the compiler. By its name we can find that C1 is a client compiler. It is ideal for client applications that have few available system resources or require fast startup. C1 performs code optimization by using performance counters. This is a simple optimization method with less intervention in the source code.
Server-side compilers
For long-running applications (such as server-side enterprise applications), using a client-side compiler may not be sufficient. At this time we should choose a server-side compiler like C2. The optimizer can be started by adding server to the JVM startup line. Because most server-side applications are typically long-running, you will be able to collect more performance optimization data by using the C2 compiler than short-running, lightweight client-side applications. Therefore you will also be able to apply more advanced optimization techniques and algorithms.
Tip: Warm up your server-side compiler
For server-side deployments, the compiler may take some time to optimize those "hot" codes. So server-side deployment often requires a "warming up" phase. So when performing performance measurements on server-side deployments, always make sure your application has reached a steady state! Giving the compiler enough time to compile will bring many benefits to your application.
The server-side compiler can obtain more performance tuning data than the client-side compiler, so that it can perform more complex branch analysis and find optimization paths with better performance. The more performance analysis data you have, the better your application analysis results will be. Of course, performing extensive performance analysis requires more compiler resources. For example, if the JVM uses the C2 compiler, it will need to use more CPU cycles, a larger code cache, etc.
Multi-level compilation
Multi-tier compilation mixes client-side compilation and server-side compilation. Azul was the first to implement multi-layer compilation in his Zing JVM. Recently, this technology has been adopted by Oracle Java Hotspot JVM (after Java SE7). Multi-level compilation combines the advantages of client-side and server-side compilers. The client compiler is active in two situations: when the application starts, and when performance counters reach lower-level thresholds to perform performance optimizations. The client compiler also inserts performance counters and prepares the instruction set for later use by the server-side compiler for advanced optimization. Multi-layer compilation is a performance analysis method with high resource utilization. Because it collects data during low-impact compiler activity, this data can be used later in more advanced optimizations. This approach provides more information than analyzing counters using interpretive code.
Figure 1 describes the performance comparison of interpreters, client-side compilation, server-side compilation, and multi-layer compilation. The X-axis is execution time (unit of time), and the Y-axis is performance (number of operations per unit time)
Figure 1. Compiler performance comparison
Relative to purely interpreted code, using a client-side compiler can bring about 5 to 10 times performance improvements. The amount of performance gain you gain depends on the efficiency of the compiler, the kinds of optimizers available, and how well the application's design matches the target platform. But for program developers, the last one can often be ignored.
Compared with client-side compilers, server-side compilers can often bring 30% to 50% performance improvements. In most cases, performance improvements often come at the cost of resource consumption.
Multi-level compilation combines the advantages of both compilers. Client-side compilation has shorter startup time and can perform fast optimization; server-side compilation can perform more advanced optimization operations during the subsequent execution process.
Some common compiler optimizations
So far, we have discussed what it means to optimize code and how and when the JVM performs code optimization. Next, I will end this article by introducing some optimization methods actually used by compilers. JVM optimization actually occurs at the bytecode stage (or lower-level language representation stage), but the Java language will be used here to illustrate these optimization methods. It is impossible to cover all JVM optimization methods in this section; of course, I hope that these introductions will inspire you to learn hundreds of more advanced optimization methods and innovate in compiler technology.
Dead code elimination
Dead code elimination, as the name suggests, is to eliminate code that will never be executed - that is, "dead" code.
If the compiler finds some redundant instructions during operation, it will remove these instructions from the execution instruction set. For example, in Listing 1, one of the variables will never be used after an assignment to it, so the assignment statement can be completely ignored during execution. Corresponding to the operation at the bytecode level, the variable value never needs to be loaded into the register. Not having to load means less CPU time is consumed, thus speeding up code execution, ultimately resulting in a faster application - if the loading code is called many times per second, the optimization effect will be more obvious.
Listing 1 uses Java code to illustrate an example of assigning a value to a variable that will never be used.
Listing 1. Dead code copy code code is as follows:
int timeToScaleMyApp(boolean endlessOfResources){
int reArchitect =24;
int patchByClustering =15;
int useZing =2;
if(endlessOfResources)
return reArchitect + useZing;
else
return useZing;
}
During the bytecode phase, if a variable is loaded but never used, the compiler can detect and eliminate the dead code, as shown in Listing 2. If you never perform this loading operation, you can save CPU time and improve the execution speed of the program.
Listing 2. The optimized code copy code is as follows:
int timeToScaleMyApp(boolean endlessOfResources){
int reArchitect =24; //unnecessary operation removed here…
int useZing =2;
if(endlessOfResources)
return reArchitect + useZing;
else
return useZing;
}
Redundancy elimination is an optimization method that improves application performance by removing duplicate instructions.
Many optimizations try to eliminate machine instruction level jump instructions (such as JMP in x86 architecture). Jump instructions will change the instruction pointer register, thus diverting program execution flow. This jump instruction is a very resource-consuming command compared to other ASSEMBLY instructions. That's why we want to reduce or eliminate this kind of instruction. Code embedding is a very practical and well-known optimization method for eliminating transfer instructions. Because executing jump instructions is expensive, embedding some frequently called small methods into the function body will bring many benefits. Listing 3-5 demonstrates the benefits of embedding.
Listing 3. Calling method copy code The code is as follows:
int whenToEvaluateZing(int y){ return daysLeft(y)+ daysLeft(0)+ daysLeft(y+1);}
Listing 4. The called method copy code code is as follows:
int daysLeft(int x){ if(x ==0) return0; else return x -1;}
Listing 5. Inline method copy code code is as follows:
int whenToEvaluateZing(int y){
int temp =0;
if(y==0)
temp +=0;
else
temp += y -1;
if(0==0)
temp +=0;
else
temp +=0-1;
if(y+1==0)
temp +=0;
else
temp +=(y +1)-1;
return temp;
}
In Listing 3-5 we can see that a small method is called three times in another method body, and what we want to illustrate is: the cost of embedding the called method directly into the code will be less than executing three jumps The cost of transferring instructions.
Embedding a method that is not often called may not make a big difference, but embedding a so-called "hot" method (a method that is often called) can bring a lot of performance improvements. The embedded code can often be further optimized, as shown in Listing 6.
Listing 6. After the code is embedded, further optimization can be achieved by copying the code as follows:
int whenToEvaluateZing(int y){ if(y ==0)return y; elseif(y ==-1)return y -1; elsereturn y + y -1;}
Loop optimization
Loop optimization plays an important role in reducing the additional cost of executing the loop body. The extra cost here refers to expensive jumps, a lot of condition checks, and non-optimized pipelines (that is, a series of instruction sets that do no actual operations and consume extra CPU cycles). There are many types of loop optimizations. Here are some of the more popular loop optimizations:
Loop body merging: When two adjacent loop bodies execute the same number of loops, the compiler will try to merge the two loop bodies. If two loop bodies are completely independent of each other, they can also be executed simultaneously (in parallel).
Inversion Loop: At its most basic, you replace a while loop with a do-while loop. This do-while loop is placed inside an if statement. This replacement will reduce two jump operations; but it will increase the conditional judgment, thus increasing the amount of code. This kind of optimization is a great example of trading more resources for more efficient code - the compiler weighs the costs and benefits and makes decisions dynamically at runtime.
Reorganize the loop body: Reorganize the loop body so that the entire loop body can be stored in the cache.
Expand the loop body: Reduce the number of loop condition checks and jumps. You can think of this as executing several iterations "inline" without having to do conditional checking. Unrolling the loop body also brings certain risks, because it may reduce performance by affecting the pipeline and a large number of redundant instruction fetches. Once again, it is up to the compiler to decide whether to unroll the loop body at runtime, and it is worth unrolling if it will bring a greater performance improvement.
The above is an overview of how compilers at the bytecode level (or lower level) can improve the performance of applications on the target platform. What we have discussed are some common and popular optimization methods. Due to limited space, we only give some simple examples. Our goal is to arouse your interest in in-depth study of optimization through the above simple discussion.
Conclusion: Reflection Points and Key Points
Choose different compilers according to different purposes.
1. An interpreter is the simplest form of translating bytecode into machine instructions. Its implementation is based on an instruction lookup table.
2. The compiler can optimize based on performance counters, but it requires consuming some additional resources (code cache, optimization thread, etc.).
3. The client compiler can bring 5 to 10 times performance improvement compared to the interpreter.
4. The server-side compiler can bring about a 30% to 50% performance improvement compared to the client-side compiler, but it requires more resources.
5. Multi-layer compilation combines the advantages of both. Use client-side compilation for faster response times, and then use the server-side compiler to optimize frequently called code.
There are many possible ways to optimize the code here. An important job of the compiler is to analyze all possible optimization methods, and then weigh the costs of various optimization methods against the performance improvement brought by the final machine instructions.