Frank Rosner

Posted on Mar 21, 2018 • Edited on Aug 13

Microbenchmarking your Scala Code

#scala #microbenchmarking #performance

Motivation

I am sure you recognize this loading spinner icon. I do not know anyone who likes to wait for the computer. However, when writing software I usually favour readability, maintainability, and extensibility over speed. I agree with Donald Knuth that premature optimization usually causes more problems than it solves.

Nevertheless at some point you are going to write code where performance matters, or at least bad performance hurts. In this situation it might be useful to look at the performance characteristics of your code. I personally like to combine two approaches:

Complexity analysis
Runtime benchmarks

In this blog post I want to focus on runtime benchmarks only, specifically microbenchmarking. The next section is going to set a few theoretical foundations. Afterwards we are going to look at ScalaMeter, a tool for automated performance testing in Scala. The last section contains a few examples, comparing the runtime performance of different implementations. We finish the blog post by summarizing the main points. The examples will be written in Scala and related to the Scala programming language but should be understandable for anyone with a bit of functional programming knowledge.

If you want to learn more about how to analyze the complexity of your algorithms, I can recommend the amazing book "Introduction to Algorithms" [1].

Microbenchmarking

In computing, a benchmark is the act of running a computer program, a set of programs, or other operations, in order to assess the relative performance of an object, normally by running a number of standard tests and trials against it. [2]

In bechmarking there are different granularity levels, similar to system tests, integration tests, and unit tests in functional testing. Microbenchmarking typically refers to isolated benchmarks of individual methods, e.g. API calls.

Similar to unit tests, having automated microbenchmarks cannot give you any guarantees. The results heavily depend on the selected input and interaction effects between different components of your architecture are not taken into consideration. Nevertheless they are a useful tool to compare the relative performance of different implementations. They can also be used for regression testing.

ScalaMeter

"ScalaMeter is a microbenchmarking and performance regression testing framework for the JVM platform that allows expressing performance tests in a way which is both simple and concise." [3]

A simple benchmark looks like this: [4]

import org.scalameter.api._

object RangeBenchmark
extends Bench.LocalTime {
  val sizes = Gen.range("size")(300000, 1500000, 300000)

  val ranges = for {
    size <- sizes
  } yield 0 until size

  performance of "Range" in {
    measure method "map" in {
      using(ranges) in {
        r => r.map(_ + 1)
      }
    }
  }
}

It generates integer ranges from 0 to 300.000, 600.000, 900.000, 1.200.000, and 1.500.000 respectively. It then measures the run time of the map operation on these ranges and generates the following output:

Parameters(size -> 300000):  1.653809 ms
Parameters(size -> 600000):  3.282649 ms
Parameters(size -> 900000):  4.939347 ms
Parameters(size -> 1200000): 6.492767 ms
Parameters(size -> 1500000): 8.148826 ms

ScalaMeter provides a highly configurable testing framework with default configuration for different standard use cases from quick console reporting all the way to sophisticated regression testing with HTML reporting. I find the following features especially useful:

Concise and readable DSL for data generation and test specification
Configurable execution (e.g. separate JVM, warm-up runs, measured runs)
Configurable measurements and aggregations (ignoring GC, outlier elimination, mean, median, ...)
Configurable reporting (text, HTML, logging, charts, DSV, ...)
Configurable persistence (Java or JSON serialization)

In the next section we are going to look at some experiments where I used ScalaMeter to perform the measurements.

Example Experiments

In this section we are going to look at three experiments:

How do chained map operations perform compared to a single combined map operation?
How do different collections perform when being sorted? How does the Scala sort implementation perform compared to the native Java one?
When building up a collection, how does the performance differ when using a builder vs. concatenating?

All experiments are performed using ScalaMeter 0.9 and Scala 2.12.4. My computer has a 2016 3,3 GHz Intel Core i7 with 16 GB of RAM. I am using the Bench.OfflineReport, which executes the code in a separate JVM and applies an appropriate number of warm-up runs.

Chained Map Operations

Motivation

When working with collections in Scala, the map operation is quite common. xs.map(f) applies the function f to every element x in xs and returns the result. If you have two composable functions f and g and you want to apply both, you express that either as

xs.map(g compose f), or
xs.map(f).map(g)

In terms of the result, both operations are equivalent. The memory footprint and runtime however might differ, depending on the implementation of xs and map. If you are using a strictly evaluated collection, on every map call the result will be computed. If the collection is immutable, a new collection will be created with the resulting values.

In this experiment we want to look at the relative runtime performance of both expressions comparing a List (strict) and a SeqView (lazy).

Variables

val strictList = List.iterate(0, 1000000)(_ + 1)
val lazyList = strict.view
val f: Int => Int = _ + 1
val fs = List.fill(10)(f)
val fsAndThen = fs.reduce(_ andThen _)

Experiments

Given both the strictList and the lazyList as l, we perform the following two experiments for both of them. Note that we omit the force command here for the sake of simplicity. In the experiment it is needed to actually trigger the computation of the view.

l.map(fsAndThen), which applies f 10 times in a single map operation
fs.foldLeft(l)((l, f) => l.map(f)), which applies f one time in each of the 10 map operations

Results

Looking at these results I find three notable observations:

On the strict list, the chained map operations took more than three times longer on average than using the single map operation.
This effect is not present when using the list view.
The performance of the chained map on the list view is comparable to the strict list single map results.

Given these results, we can draw the following conclusions. Using chained map operations on strictly evaluated, immutable collections can have a significant performance impact. If performance matters, you should aim to combine your map operations. If you cannot combine the map operations yourself (maybe you are just providing a library, like Apache Spark), using a lazily evaluated collection can help reducing the run time significantly.

Sorting Data Structures

Motivation

Sorting a collection is required in many applications. May it be showing a list of events ordered by their time of occurrence, or preparing a table for being joined with another one using a merge-join algorithm. 20 years ago, developers had to be able to write efficient sorting algorithms themselves, as standard libraries were not as rich and computers not as fast.

Nowadays you will find fast-enough implementations of sorting algorithms in almost any standard library. If you are not dealing with strict performance requirements, this is also fine, as using available standard functions can make the code less buggy and more readable.

Scala offers a method to sort immutable collections called sorted, which is available for all standard sequence-like collections. In this experiment we want to compare the relative performance of sorted on different Scala data structures, and also compare it to the performance of the java.util.Arrays.sort method.

Variables

val size = Gen.enumeration("size")(List.iterate(1, 7)(_ * 10): _*)
val list = for { s <- size } yield List.fill(s)(Random.nextInt)
val array = for { l <- list } yield l.toArray
val vector = for { l <- list } yield l.toVector

Experiments

Given the list, array, and vector as l, filled with random integers, we sort them using the Scala sorted method. For the array, we also apply the Java sort, which works in-place. In order to make the results comparable to Scala, which gives you a new collection back instead of modifying the existing one, we also copy the array first in another experiment.

l.sorted
util.Arrays.sort(l)
val newArray = new Array[Int](l.length) Array.copy(l, 0, newArray, 0, l.length) util.Arrays.sort(newArray)

Results

Looking at the performance for different collection sizes there are no surprises (log scale would've been better I know...). When looking at the difference between the Scala and Java array sort, I was kind of surprised.

The Scala sorts are so much slower. Looking at implementation of sorted we can spot two details which might explain the difference in runtime.

def sorted[B >: A](implicit ord: Ordering[B]): Repr = {
   val len = this.length
   val b = newBuilder
   if (len == 1) b ++= this
   else if (len > 1) {
     b.sizeHint(len)
     val arr = new Array[AnyRef](len)
     var i = 0
     for (x <- this) {
       arr(i) = x.asInstanceOf[AnyRef]
       i += 1
     }
     java.util.Arrays.sort(arr, ord.asInstanceOf[Ordering[Object]])
     i = 0
     while (i < arr.length) {
       b += arr(i).asInstanceOf[A]
       i += 1
     }
   }
   b.result()
 }

You can see that it also uses Arrays.sort internally. But why is it so much slower? I think the reason for that is that it does not only copy the data to a new array but also has to copy it back to a collection of the original type. When sorting an immutable list you expect to get another immutable list back. This is done using a respective builder b (e.g. a ListBuilder):

while (i < arr.length) {
  b += arr(i).asInstanceOf[A]
  i += 1
}

But also the creation of the initial array could be a reason for it being slower. For making the copy, I was using Arrays.copy, which uses the native method java.lang.System.arraycopy under the hood. In the Scala method, the array is created in a loop:

for (x <- this) {
  arr(i) = x.asInstanceOf[AnyRef]
  i += 1
}

Looking at those results, I think it is obvious that when it comes to sorting, you should always check the performance of your implementation if it matters to you. Using a sort algorithm that immediately outputs a new immutable collection instead of relying on an intermediate array would be the better choice. Nevertheless most of the time the benefits you gain from using standard methods outweigh the performance gain of custom solutions.

Concatenation vs. Builder

Motivation

One of the most common beginner mistakes when working with Java is building up strings using concatenation. In Java, strings are immutable. As they are so commonly used, the JVM maintains a string pool (see flyweight pattern).

This has an implication when concatenating strings. The following code will create three string objects:

var s1 = "Hello "
s1 = s1 + "World"

While this is not a problem in most cases, it can become one when used inside loops. When building up immutable data structures you should thus always rely on the respective builders, which are mutable, and then create the final, immutable structure when you are done. This is fine as long as the builder cannot be accessed outside the local scope.

In this experiment we are going to compare the performance of building up immutable data structures step by step using concatenation on the immutable object vs. using a builder to construct the final structure.

Variables

val size = Gen.single("size")(10000)
val numbers = for { s <- size } yield 0 to s
val strings = for { range <- numbers } yield range.map(_.toString)

Experiments

Given the range of numbers and strings as xs, we compare the performance of constructing a new string, set, and list out of them step by step. First we use plain concatenation (+, and ::), then we use the appropriate builder.

String

xs.foldLeft("")((l, x) => l + x)
var b = StringBuilder.newBuilder xs.foreach(x => b + x) b.result

Set

xs.foldLeft(Set.empty[Int])((l, x) => l + x)
var b = Set.newBuilder[Int] xs.foreach(x => b += x) b.result

List

xs.foldLeft(List.empty[Int])((l, x) => x :: l)
var b = List.newBuilder[Int] xs.foreach(x => b += x) b.result

Results

All three data structures are immutable. But what is the performance implication of using concatenation? Let's look at strings first.

The graph shows that the overhead of creating additional string objects for each iteration is immense. How does it behave for sets?

For sets it is actually even worse. This is probably because it has to initialize the whole set again for every iteration. Building up a hash set is no trivial operation which makes the performance impact even higher than for strings. Does the same hold for lists?

With list the picture is different. The reason is that it requires only constant time to append an item to the front of the list. If we concatenated to the end, the run time might be longer, depending on whether the implementation stores a pointer to the end of the list as well.

If you don't know what collection implementation you are using, you can safely rely on the newBuilder function just to make sure.

Summary

In this blog post we looked at ScalaMeter, a tool for automated microbenchmarking and regression testing in Scala. We've seen three examples of how small details in the code can significantly change your runtime performance.

While it is better to stick to standard methods and libraries for the sake of maintainability und readability, it might require custom solutions if performance matters to you. Anyway, by measuring the performance in an automated way you can gain more confidence about your code and also catch performance regressions automatically.

Did you write any performance tests in your projects? Did you know that the JVM needs to warm up before you can reliably measure runtime performance? Was there any performance regression introduced into one of your projects that you would've caught with some automated performance tests? What is your favourite benchmarking tool for the JVM? I'm curious to know your thoughts!

References

[1] Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, and Charles E. Leiserson. 2001. Introduction to Algorithms (2nd ed.). McGraw-Hill Higher Education.
[2] Fleming, Philip J.; Wallace, John J. 1986. How not to lie with statistics: the correct way to summarize benchmark results. Communications of the ACM. 29 (3): 218–221.
[3] ScalaMeter Homepage
[4] ScalaMeter Getting Started Guide
Cover image taken by Jérôme S

If you liked this post, you can support me on ko-fi.

Top comments (11)

rhymes • Mar 21 '18

Loved it and I know zero about Scala.

One thing I wanted to ask: do you think Scala's sorted implementation should be improved?

"Unpacking" an Scala list to put it in a Java array to sort it and then to re-pack again as a Scala list seems a very inefficient algorithm.

Is it possible to improve on that? Keep in mind that I don't know how Scala works and I'm sure there's a very good reason for that being the implementation they chose.

Frank Rosner • Mar 21 '18 • Edited

I agree with you that it is indeed not very efficient. It basically requires you to copy the data two times. If you want to use the already highly optimized java.util.Arrays.sort, however, it is the only way to do it when you don't work with an array already.

What you could try is to use a sorting algorithm which works well for the data structure you are using, like merge sort for immutable linked lists. If you are using quick sort it probably performs very poorly, as the random access for pivoting will be linear in complexity.

It would actually be interesting to see how a specialized sorting algorithm that requires you to copy the data only once would perform.

About your question whether you can change the standard library: You can try. I don't know if the development team will like it :D But I have to admit that I also thought about it. IMO the reason for choosing this implementation is that it is good enough for most of the cases and it is general enough to work for all sequence-like data structures. Also it was easy to write and verify as it is basically just wrapping around the existing and optimized Java sort.

I'll post an update if I manage to write a merge sort and compare it.

Frank Rosner • Mar 22 '18 • Edited

I wrote a simple merge sort which is not tail recursive and compared the Scala sorted and mine on linked lists. Our intuition seems to be correct, you can be more efficient if you put some effort to adjust the algorithm.

  def merge(xs: List[Int], ys: List[Int]): List[Int] =
    (xs, ys) match {
      case (Nil, l) => l
      case (l, Nil) => l
      case (xh :: xt, yh :: yt) =>
        if (xh <= yh)
          xh :: merge(xt, ys)
        else
          yh :: merge(xs, yt)
    }

  def sort(xs: List[Int]): List[Int] =
    xs match {
      case Nil         => Nil
      case head :: Nil => head :: Nil
      case xs =>
        val (left, right) = xs.splitAt(xs.length / 2)
        merge(sort(left), sort(right))
    }

  performance of "List" in {
    measure method "sorted" in {
      using(list) in { l =>
        l.sorted
      }
    }
    measure method "merge-sort" in {
      using(list) in { l =>
        sort(l)
      }
    }
  }

rhymes • Mar 22 '18

I understand your point but, at least from the perspective of the user of a programming language, the expectation is (going to use Python's syntax here) that "mutablelist.sort()" or "immutablelist.sort()" or and so on are going to choose a different sorting algorithm, for each type of the caller.

I wasn't proposing to change Scala's implementation, just trying to understand why there's only one if there are so many containers with wildly different behaviours.

If you're curious, as far as i know this is still Python's sorting algorithm: timsort

Frank Rosner • Mar 22 '18

the expectation is (going to use Python's syntax here) that "mutablelist.sort()" or "immutablelist.sort()" or and so on are going to choose a different sorting algorithm, for each type of the caller.

I'm not sure about whether one should have this expectation. The problem is that there are so many different possible implementations of a value sequence and so many different sorting algorithms. How would you possibly know what is the best sorting algorithm when designing the library? It might not only depend on the data structure but also the data itself. The strategy to have a general-purpose algorithm in your library that works everywhere is not a bad one IMHO. It will be good enough for all the cases where performance does not matter. If performance matters you anyway have to put some thought into picking the right data structure and right sorting algorithm.

just trying to understand why there's only one if there are so many containers with wildly different behaviours.

I don't know but I could imagine that it's good enough like this. Maybe the development focus of the language was just different and nobody complained / made an effort to change it.

This is also indicated by the documentation of scala.util.Sorting: "Note also that high-performance non-default sorts for numeric types are not provided. If this is required, it is advisable to investigate other libraries that cover this use case."

So it seems that if you need fast sorting, stick to arrays. If this is not fast enough, try a different library.

If you're curious, as far as i know this is still Python's sorting algorithm: timsort

Java > 6 also uses timsort for sorting arrays of objects (still quick sort for primitives afaik). But does Python have different data structures or only arrays? What happens if you try to call timsort on a linked list in Python? (sorry if my question is stupid but I didn't use Python a lot).

Thanks a lot for the discussion :) It's a lot of fun to dig deeper!