DEV Community

Cover image for Java Performance - 2 -  An Approach to Performance Testing
Yousef Zook
Yousef Zook

Posted on • Updated on

Java Performance - 2 - An Approach to Performance Testing


This article is part 3 for the series Java Performance that summarize the java performance book by Scot Oaks

In the previous chapter we have discussed a breif outline, the platforms, hardware and software, and perfromance hints The Complete Performance Story.

In this chapter we are going to discuss some intersting performance entries. We will understand the difference between:

  • Microbenchmarks
  • Macrobenchmarks
  • Mesobenchmarks We are going also to talk about response time, batching and throughput, understand variability and see some interesting code examples.

Great, let's start the second chapter...

Chapter Title:

An Approach to Performance Testing

1) Test a Real Application

The first principle is that testing should occur on the actual product in the way the product will be used. There are, roughly speaking, three categories of code that can be used for performance testing, each with its own advantages and disadvantages. The category that includes the actual application will provide the best results.

A- Microbenchmarks

The first of these categories is the microbenchmark. A microbenchmark is a test designed to measure a very small unit of performance, exmples:

  • The time to call a synchronized method versus a nonsynchronized method
  • The overhead in creating a thread versus using a thread pool
  • The time to execute one arithmetic algorithm versus an alternate implementation

Points to take care of
Consider the following code that measures the performance of different implementations of a method to compute the 50th Fibonacci number:

public void doTest() { // Main Loop
    double l;
    long then = System.currentTimeMillis();
       l = fibImpl1(50);
    long now = System.currentTimeMillis(); 
    System.out.println("Elapsed time: " + (now - then));
private double fibImpl1(int n) {
    if (n < 0) throw new IllegalArgumentException("Must be > 0"); if (n == 0) return 0d;
    if (n == 1) return 1d;
    double d = fibImpl1(n - 2) + fibImpl(n - 1);
    if (Double.isInfinite(d)) throw new ArithmeticException("Overflow");
    return d; 
Enter fullscreen mode Exit fullscreen mode

The previous code has the following issues:
1- Microbenchmarks must use their results: The biggest problem with this code is that it never actually changes any program state. Because the result of the Fibonacci calculation is never used, the compiler is free to discard that calculation. A smart compiler will end using the following code:

long then = System.currentTimeMillis();
long now = System.currentTimeMillis(); System.out.println("Elapsed time: " + (now - then));
Enter fullscreen mode Exit fullscreen mode

Why?: because you have never used the l variable in your code, so the compiler will assume that this is a redundant code, removing it before executing.
Solution try to consume the l variable with any method you like.

Enter fullscreen mode Exit fullscreen mode

2- Microbenchmarks must not include extraneous operations: This code performs only one operation: calculating the 50th Fibonacci number. A very smart compiler can figure that out and execute the loop only once—or at least discard some of the iterations of the loop since those oper‐ ations are redundant.
Additionally, the performance of fibImpl(1000) is likely to be very different than the performance of fibImpl(1); if the goal is to compare the performance of different im‐ plementations, then a range of input values must be considered.
The easy way to code the use of the random number generator is to process the loop as follows:

int[] input = new int[nLoops]; 
    input[i] = random.nextInt();
long then = System.currentTimeMillis(); for(inti=0;i<nLoops;i++){
    try {
        l = fibImpl1(input[i]);
    } catch (IllegalArgumentException iae) {} 
long now = System.currentTimeMillis();
Enter fullscreen mode Exit fullscreen mode

3- Microbenchmarks must measure the correct input: The third pitfall here is the input range of the test: selecting arbitrary random values isn’t necessarily representative of how the code will be used. In this case, an exception will be immediately thrown on half of the calls to the method under test (anything with a negative value). An exception will also be thrown anytime the input parameter is greater than 1476, since that is the largest Fibonacci number that can be represented in a double.
Consider this alternate implementation:

public double fibImplSlow(int n) {
    if (n < 0) throw new IllegalArgumentException("Must be > 0"); 
    if (n > 1476) throw new ArithmeticException("Must be < 1476"); 
    return verySlowImpl(n);
Enter fullscreen mode Exit fullscreen mode

If that this implementation is very slow "solwer that the first one in point #2", It will give better perforamnce results because the input range checks in it lines 1, and 2.

4- No warmup period: The previous implementation doesn't offer a warm-up peroid which is important because that One of the performance characteristics of Java is that code performs better the more it is executed, a topic that is covered in Chapter 4, something realted to the JIT compilers in java.
So the final version of the performance test should be as the following:

package net.sdo;
import java.util.Random;
public class FibonacciTest { 
    private volatile double l; private int nLoops; 
    private int[] input;
    public static void main(String[] args) {
        FibonacciTest ft = new                 

    private FibonacciTest(int n) { 
        nLoops = n;
        input = new int[nLoops];
        Random r = new Random(); 
            input[i] = r.nextInt(100);

    private void doTest(boolean isWarmup) { 
        long then = System.currentTimeMillis(); 
            l = fibImpl1(input[i]);
        if (!isWarmup) {
            long now = System.currentTimeMillis(); 
            System.out.println("Elapsed time: " + (now - then));

    private double fibImpl1(int n) {
        if (n < 0) 
            throw new IllegalArgumentException("Must be > 0");
        if (n == 0) return 0d;
        if (n == 1) return 1d;
        double d = fibImpl1(n - 2) + fibImpl(n - 1);
        if (Double.isInfinite(d)) throw new ArithmeticException("Overflow"); 
        return d;
Enter fullscreen mode Exit fullscreen mode

B- Macrobenchmarks

The best thing to use to measure performance of an application is the application itself, in conjunction with any external resources it uses. Testing the whole application with all the external resources is called Macrobenchmark.
Complex systems are more than the sum of their parts; they will behave quite differently when those parts are assembled. Mocking out database calls, for example, may mean that you no longer have to worry about the database perfor‐ mance—and hey, you’re a Java person; why should you have to deal with someone else’s performance problem?
The other reason to test the full application is one of resource allocation. In a perfect world, there would be enough time to optimize every line of code in the application. In the real world, deadlines loom, and optimizing only one part of a complex environment may not yield immediate benefits.

C- Mesobenchmarks

Java EE engineers tend to use that term to apply to something else: bench‐ marks that measure one aspect of performance, but that still execute a lot of code.
An example of a Java EE might be something that measures how quickly the response from a simple JSP can be returned from an application server. The code involved in such a request is substantial compared to a traditional microbenchmark: there is a lot of socket-management code, code to read the request, code to find (and possibly compile) the JSP, code to write the answer, and so on. From a traditional standpoint, this is not microbenchmarking.
This kind of test is not a macrobenchmark either: there is no security (e.g., the user does not log in to the application), no session management, and no use of a host of other Java EE features.
So this is called a Mesobenchmark

Common Code Examples

Many of the examples throughout the book are based on a sample application that calculates the “historical” high and low price of a stock over a range of dates, as well as the standard deviation during that time. Historical is in quotes here because in the application, all the data is fictional; the prices and the stock symbols are randomly generated.

  • The basic object within the application is a StockPrice object that represents the price range of a stock on a given day:
public interface StockPrice { 
    String getSymbol();
    Date getDate();
    BigDecimal getClosingPrice();
    BigDecimal getHigh();
    BigDecimal getLow();
    BigDecimal getOpeningPrice();
    boolean isYearHigh();
    boolean isYearLow();
    Collection<? extends StockOptionPrice> getOptions();
Enter fullscreen mode Exit fullscreen mode

The sample applications typically deal with a collection of these prices, representing the history of the stock over a period of time (e.g., 1 year or 25 years, depending on the example):

public interface StockPriceHistory { 
    StockPrice getPrice(Date d);
    Collection<StockPrice> getPrices(Date startDate, Date endDate);
    Map<Date, StockPrice> getAllEntries();
    Map<BigDecimal,ArrayList<Date>> getHistogram();
    BigDecimal getAveragePrice();
    Date getFirstDate();
    BigDecimal getHighPrice();
    Date getLastDate();
    BigDecimal getLowPrice();
    BigDecimal getStdDev();
    String getSymbol();
Enter fullscreen mode Exit fullscreen mode

The basic implementation of this class loads a set of prices from the database:

public class StockPriceHistoryImpl implements StockPriceHistory { 
    public StockPriceHistoryImpl(String s, Date startDate, Date endDate, EntityManager em) {
        Date curDate = new Date(startDate.getTime()); 
        symbol = s;
        while (!curDate.after(endDate)) {
            StockPriceImpl sp = em.find(StockPriceImpl.class, new StockPricePK(s, (Date) curDate.clone())); 
            if (sp != null) {
                Date d = (Date) curDate.clone(); 
                if (firstDate == null) {
                    firstDate = d;
                prices.put(d, sp);
                lastDate = d;
            curDate.setTime(curDate.getTime() + msPerDay);
Enter fullscreen mode Exit fullscreen mode

The architecture of the samples is designed to be loaded from a database, and that functionality will be used in the examples in Chapter 11. However, to facilitate running the examples, most of the time they will use a mock entity manager that generates random data for the series.

2) Understand Throughput, Batching, and Response Time

The second principle in performance testing involves various ways to look at the ap‐ plication’s performance. Which one to measure depends on which factors are most important to your application.

A- Elapsed Time (Batch) Measurements:

The simplest way of measuring performance is to see how long it takes to accomplish a certain task, example:

  • Retrieve the history of 10,000 stocks for a 25-year period and calculate the standard deviation of those prices; produce a report of the payroll benefits for the 50,000 employees of a corporation; execute a loop 1,000,000 times.

In the non-Java world, this testing is straightforward: the application is written, and the time of its execution is measured. In the Java world, there is one wrinkle to this: just-in-time compilation which means that the program needs some time to be fully optimized [that's why we needed the warmup in the previous section code].

B- Throughput Measurements:

A throughput measurement is based on the amount of work that can be accomplished in a certain period of time.

  • In a client-server test, a throughput measurement means that clients have no think time. If there is a single client, that client sends a request to the server. When it receives a response, it immediately sends a new request.
  • This measurement is frequently referred to as transactions per second (TPS), requests per second (RPS), or operations per second (OPS).
  • All client-server tests run the risk that the client cannot send data quickly enough to the server. This may occur because there aren’t enough CPU cycles on the client machine to run the desired number of client threads, or because the client has to spend a lot of time processing the request before it can send a new request. In those cases, the test is effectively measuring the client performance rather than the server performance, which is usually not the goal.
  • It is common for tests that measure throughput also to report the average response time of its requests.

C- Response Time Tests

The last common test is one that measures response time: the amount of time that elapses between the sending of a request from a client and the receipt of the response.
Difference between response time and throughput is that the client threads in a response time test sleep for some period of time between operations. This is referred to as think time.
When think time is introduced into a test, throughput becomes fixed: a given number of clients executing requests with a given think time will always yield the same TPS.
The simplest way is for clients to sleep for a period of time between requests:

while (!done) {
    time = executeOperation();                 
Enter fullscreen mode Exit fullscreen mode

In this case, the throughput is somewhat dependent on the response time.
There is another alternative is known as cycle time (rather than think time). Cycle time sets the total time between requests to 30 seconds, so that the time the client sleeps depends on the response time:

while (!done) {
    time = executeOperation(); 
    Thread.currentThread().sleep(30*1000 - time);
Enter fullscreen mode Exit fullscreen mode

This alternative yields a fixed throughput of 0.033 OPS per client regardless of the re‐ sponse time (assuming the response time is always less than 30 seconds in this example).

There are two ways of measuring response time. Response time can be reported as:

  • Average: the individual times are added together and divided by the number of requests.
  • Percentile request: for example the 90th% response time. If 90% of responses are less than 1.5 seconds and 10% of responses are greater than 1.5 seconds, then 1.5 seconds is the 90th% response time. The following 2 graphs shows how the percentile is important: Normal


3) Understand Variability

The third principle involves understanding how test results vary over time. Programs that process exactly the same set of data will produce a different answer each time they are run. Why? because of the following:

  • Background processes on the machine will affect the application
  • the network will be more or less congested when the program is run
  • ... etc.

You must consider statstics while measuring performance of an application, run the test many times to be more confident of your results.

4) Test Early, Test Often

Fourth and finally, performance geeks like to recommend that perfor‐ mance testing be an integral part of the development cycle.

The typical development cycle does not make things any easier. A project schedule often establishes a feature-freeze date: all feature changes to code must be checked into the repository at some early point in the release cycle, and the remainder of the cycle is devoted to shaking out any bugs (including performance issues) in the new release. This causes two problems for early testing:

  • Developers are under time constraints to get code checked in to meet the schedule; they will balk at having to spend time fixing a performance issue when the schedule has time for that after all the initial code is checked in. The developer who checks in code causing a 1% regression early in the cycle will face pressure to fix that issue; the developer who waits until the evening of the feature freeze can check in code that causes a 20% regression and deal with it later.
  • Performance characteristics of code will change as the code changes. This is the same principle that argued for testing the full application (in addition to any module-level tests that may occur): heap usage will change, code compilation will change, and so on.
  • A developer who introduces code causing a 5% regression may have a plan to address that regression as development proceeds: maybe her code depends on some as-yet-to-be integrated feature, and when that feature is available, a small tweak will allow the regression to go away. That’s a reasonable position, even though it means that performance tests will have to live with that 5% regression for a few weeks (and the unfortunate but unavoidable issue that said regression is masking other regressions).

Early, frequent testing is most useful if the following guidelines are followed:

A- Automate everything

  • All performance testing should be scripted
  • The scripts must be able to run the test multiple times
  • Perform t-test analysis on the results
  • Produce a report showing the confidence level that the results are the same
  • The automation must make sure that the machine is in a known state before tests are run

B- Measure everything

  • The automation must gather every conceivable piece of data that will be useful for later analysis
  • System information sampled throughout the run: CPU usage, disk usage, network usage, memory usage, and so on.
  • The monitoring information must also include data from other parts of the system, if applicable: for example, if the program uses a database, then include the system statistics from the database machine as well as any diagnostic output from the database

C- Run on the target system

A test that is run on a single-core laptop will behave very differently than a test run on a machine with a 256-thread SPARC CPU. That should be clear in terms of threading effects: the larger machine is going to run more threads at the same time, reducing contention among application threads for access to the CPU. At the same time, the large system will show synchronization bottlenecks that would be unno‐ ticed on the small laptop.
Hence, the performance of a particular production environment can never be fully known without testing the expected load on the expected hardware. Approxima‐ tions and extrapolations can be made from running smaller tests on smaller hard‐ ware, and in the real world, duplicating a production environment for testing can be quite difficult or expensive.

🏃 See you in chapter 3 ...

🐒take a tip

Play a sport to work your brain effectively, Exercise your body. 🏊


Top comments (0)