Discussion on: Microbenchmarking your Scala Code

View post

Replies for: Loved it and I know zero about Scala. One thing I wanted to ask: do you think Scala's sorted implementation should be improved? "Unpacking" an Sc...

I agree with you that it is indeed not very efficient. It basically requires you to copy the data two times. If you want to use the already highly optimized java.util.Arrays.sort, however, it is the only way to do it when you don't work with an array already.

What you could try is to use a sorting algorithm which works well for the data structure you are using, like merge sort for immutable linked lists. If you are using quick sort it probably performs very poorly, as the random access for pivoting will be linear in complexity.

It would actually be interesting to see how a specialized sorting algorithm that requires you to copy the data only once would perform.

About your question whether you can change the standard library: You can try. I don't know if the development team will like it :D But I have to admit that I also thought about it. IMO the reason for choosing this implementation is that it is good enough for most of the cases and it is general enough to work for all sequence-like data structures. Also it was easy to write and verify as it is basically just wrapping around the existing and optimized Java sort.

I'll post an update if I manage to write a merge sort and compare it.

rhymes • Mar 22 '18

I understand your point but, at least from the perspective of the user of a programming language, the expectation is (going to use Python's syntax here) that "mutablelist.sort()" or "immutablelist.sort()" or and so on are going to choose a different sorting algorithm, for each type of the caller.

I wasn't proposing to change Scala's implementation, just trying to understand why there's only one if there are so many containers with wildly different behaviours.

If you're curious, as far as i know this is still Python's sorting algorithm: timsort

Frank Rosner • Mar 22 '18

the expectation is (going to use Python's syntax here) that "mutablelist.sort()" or "immutablelist.sort()" or and so on are going to choose a different sorting algorithm, for each type of the caller.

I'm not sure about whether one should have this expectation. The problem is that there are so many different possible implementations of a value sequence and so many different sorting algorithms. How would you possibly know what is the best sorting algorithm when designing the library? It might not only depend on the data structure but also the data itself. The strategy to have a general-purpose algorithm in your library that works everywhere is not a bad one IMHO. It will be good enough for all the cases where performance does not matter. If performance matters you anyway have to put some thought into picking the right data structure and right sorting algorithm.

just trying to understand why there's only one if there are so many containers with wildly different behaviours.

I don't know but I could imagine that it's good enough like this. Maybe the development focus of the language was just different and nobody complained / made an effort to change it.

This is also indicated by the documentation of scala.util.Sorting: "Note also that high-performance non-default sorts for numeric types are not provided. If this is required, it is advisable to investigate other libraries that cover this use case."

So it seems that if you need fast sorting, stick to arrays. If this is not fast enough, try a different library.

If you're curious, as far as i know this is still Python's sorting algorithm: timsort

Java > 6 also uses timsort for sorting arrays of objects (still quick sort for primitives afaik). But does Python have different data structures or only arrays? What happens if you try to call timsort on a linked list in Python? (sorry if my question is stupid but I didn't use Python a lot).

Thanks a lot for the discussion :) It's a lot of fun to dig deeper!

rhymes • Mar 22 '18 • Edited

Python's main container structures are immutable arrays (tuples) and mutable arrays (lists).

I don't know all the details of the implementation but they are both C arrays.

Lists: How are lists implemented? and this

Tuples: stackoverflow.com/a/14135893/4186181

Thank you for the discussion! :-)

Frank Rosner • Mar 22 '18

I see. I learned a lot during the writing and discussion process :)

Frank Rosner • Mar 22 '18 • Edited

I wrote a simple merge sort which is not tail recursive and compared the Scala sorted and mine on linked lists. Our intuition seems to be correct, you can be more efficient if you put some effort to adjust the algorithm.

  def merge(xs: List[Int], ys: List[Int]): List[Int] =
    (xs, ys) match {
      case (Nil, l) => l
      case (l, Nil) => l
      case (xh :: xt, yh :: yt) =>
        if (xh <= yh)
          xh :: merge(xt, ys)
        else
          yh :: merge(xs, yt)
    }

  def sort(xs: List[Int]): List[Int] =
    xs match {
      case Nil         => Nil
      case head :: Nil => head :: Nil
      case xs =>
        val (left, right) = xs.splitAt(xs.length / 2)
        merge(sort(left), sort(right))
    }

  performance of "List" in {
    measure method "sorted" in {
      using(list) in { l =>
        l.sorted
      }
    }
    measure method "merge-sort" in {
      using(list) in { l =>
        sort(l)
      }
    }
  }