How JavaScript sorts?TimSort algorithm

#javascript #algorithms #python #sorting

The main reason that I decided to write this article is to explore what lies under the hood of the Array.prototype.sort() function. Firefox uses merge sort, while Chrome and Safari use Timsort. And there's a compelling reason to choose the latter.

Who developed it

Timsort was designed in 2002 by Tim Peters, primarily for use in the Python programming language. The algorithm was so effective that it was later implemented in other well-known programming languages such as Java. Timsort includes numerous implementation optimizations, a few heuristics, and some refined tuning. Its high-level principle is straightforward: The array to be sorted is first decomposed greedily into monotonic runs (sorted subarrays), which are then merged pairwise according to specific rules.

Comparisons with other sorting algorithms

Timsort outperforms both quicksort and merge sort in various scenarios.

Basic Overview

Timsort takes advantage of the fact that real-world data is often partially sorted. The algorithm iterates through the array once to identify already sorted parts, called "runs." These runs should be at least minRun in length, which we'll explain later.

The algorithm examines each element consecutively, determining if it is part of an ascending or descending run. When it encounters an element not part of the sequence, it registers the run and starts a new one. If a run's length is less than minRun, the algorithm extends the run by adding the next element and sorting the array using Insertion Sort.

After this loop, we have an array consisting of several runs, each with run.length >= minRun. These sorted runs are then intelligently merged.

Timsort is a stable sorting algorithm (order of elements with same value is kept) and strives to perform balanced merges (a merge that merges runs of similar sizes).

Insertion Sort

Insertion sort is a simple method of sorting a list by gradually moving each item to its correct position, just like arranging playing cards in your hand.

Here is insertion sort in action:

You can look at how it works step by step here: https://visualgo.net/en/sorting

Insertion Sort vs Merge Sort

Insertion Sort excels at sorting partially sorted and small arrays. Merge Sort, on the other hand, aims to call the merge() function as few times as possible, preferring to merge a small number of large sorted runs. This is where the minRun property comes in handy. The optimal run length balances the performance of Insertion Sort and the number of merges required. The sweet spot typically lies between 32 and 64.

MinRun Selection

The optimal minRun size varies for each array. Let's examine a scenario proposed by Tim Peters. Suppose we have 2112 elements in the array and a minRun of 32. If the data is randomly ordered, we'll likely end up with 66 runs, each 32 elements long. The first 64 runs merge perfectly into a single run of length 2048. Now, we have two runs with lengths 2048 and 64. This requires considerable data movement to put in place those 64 elements.

The solution is to choose a larger minRun — in this case, 33 — which is more likely to result in 64 runs, each with a length of 33. Consequently, all merges are perfectly balanced.

Runs Stack

Timsort also employs a clever merging pattern. For random data, it's nearly impossible to find a sequence with more than 32 elements. In this case, all our runs will likely have the same length, and merge() performs best when merging runs of equal size. However, merging them consecutively can lead to problems.

Imagine we have 1000 runs, each 32 elements long, and we've merged all but two of them. We'd have run A with a length of 999 * 32 and run B with a length of 32. This is not ideal. To avoid this issue, we examine the last three runs on the stack (let's call them X, Y, and Z, where Z is the length of the most recent) and check for these invariants:

Invariant 1: Z > Y + X
Invariant 2: Y > X

If both invariants hold true, the algorithm continues to add runs until one of them fails. When this occurs, it merges run Y with the lesser of Z and X. Merging Z and X directly could compromise stability, which ensures that elements with the same value maintain their relative positions after sorting.

This approach essentially ensures that the largest run is merged last.

Consider a stack of runs with lengths [128, 64, 32, 16, 8, 4, 2, 2]. Timsort will merge the two last runs, producing a new one with a length of 4, and keep going, making perfect merges, much like merging the squares in the 2048 game.

Galloping

Merging runs of different lengths is often necessary, particularly when data is partially sorted. To improve the speed of the merge, Tim Peters introduced the galloping technique.

Typically, the merge() function has a complexity of O(m+n), where m and n represent the lengths of the two runs. It must loop through one of the runs completely.

Consider the following case: we need to merge two arrays in ascending order, each containing 10,000 elements:

arr1 = [1, 2, 3...10,000] arr2 = [10,001, 10,002, 10,003...20,000]

Should we compare every value from arr1 to arr2[0] before merging the runs? Instead, we can detect that one of the arrays is consistently "winning" and trigger galloping. That way we move the whole winning portion of the run.

How can we do that? Rather than incrementing by 1 through the winning array, we double our increment until the winning streak breaks. When it does, we use binary search to find the new starting point.

In our case, we determine that arr1 is winning after the first seven comparisons. It's proven that a winning streak of seven comparisons is the best trigger for galloping. We move our pointer by 1, then 2, 4, 8, 16, etc. In our example, the winning streak will never end, so we will make log(n) comparisons.

All red elements are smaller than blue (here, 21). Thus they can be moved in a chunk to the final array

However, if the winning streak ends, we find the largest value in arr1 that is still smaller than the value under the arr2 pointer, push all elements before it into the resulting array, and start again from the pointer in arr2.

Summary

Timsort is ingenious algorithm. It combines multiple techniques from different sorts and improves them. It has more tricks up to his sleeve than I can cover in one article.
You can find the JS implementation here: https://github.com/lxsmnsyc/TimSort/blob/master/timsort.js

And finally, Timsort in action: