Jennie

Posted on Sep 2, 2020

Tim sort, the fastest sort used in v8 and Python

#timsort #v8 #sort #algorithms

Have you ever been curious about what algorithm is behind Array.sort()? After all, we've learnt so many sort algorithms, each has their advantages and disadvantages.

Chrome's engine v8 turned out to be using a non-classic sort algorithm - Tim Sort. It was created in 2002 by one of the Python major contributor Tim Peters for improving the Python list sort performance. The combination of the algorithms it uses was decided by analysing the real world data. In comparison of Quick Sort, which in the worst case executes O(n²) time, the worst case of it is just O(nlog n). And its best could achieve O(n). Here is a table of the famous sorting algorithms, and their complexities, stabilities:

Enable to keep the stability and achieve the best performance. Tim decided to use Insertion Sort when the list length is small.

Insertion sort is a classic sorting algorithm that basically do a forward comparison one-by-one to find the item smaller than or equals to it, and move it to the place after this smaller item (assume we are sorting in ascending order). Here is a straight forward GIF:

To reduce the operations, it's more common to use the "upgraded version" of Insertion Sort - Binary Sort, which combines Binary Search to speed up the progress of looking for the right place for each item.

When it comes to a long list, Tim sort will use the idea of Merge Sort - divides the long list into smaller "runs" and conquer one-by-one, then merge them together back.

Here's a straight forward GIF:

Divide to runs

When using the classic Merge Sort, we often simply divide list to runs in a fixed size - like 2. Tim experimented and compared different run sizes, and found slightly larger run size will perform better, which is around 32~64 inclusive. You may ask why 32 and 64? Where was these numbers come from? This was because the Python performance measuring tool Tim used could only test powers of 2 sizes data.

In the meantime, he also found in the real world data, there are many parts of data sorted either in ascending or descending order, which do not require sorting. Thus, he decided to divide runs in a "natural" size - the size of initial runs are not in a fixed number, but have a minimal size requirement - "min run".

Here is how it divides:

Set a descending flag by comparing the first 2 items, if there is only 1 item left, then set as false.
Loop the following items, and check if they are still in ascending or "strict" descending order, util finding the item which does not follow this order.
Now we have a run in either ascending or "strict" descending order. If it is in "strict" decending order, do reverse it. Only do reserse on the "strict" descending part to keep the sorting stable.
Then we check whether this run length satisfies my "min run" constraint. If it is smaller than my "min run" constraint, compensate the following items to make it achieve the min size. And do a binary sort starting from the compensated items, avoiding sort the sorted items again.

You may notice there is one edge case could happen to have "natural run" - the last run may not fulfil the min run requirement. This is expected and allowed.

Here is the pseudocode:

let nremaining = list.length;
const runs = [];
const minrun = merge_compute_minrun(nremaining);

function count_run(list) {
    let descending = false;
    let n = 1;

    if (list.length === 1) return n;
    if(list[n] < list[n-1]) {
        descending = true;
    }
    n++;
    while (n < list.length && (list[n] < list[n-1] || !desending)) {
        n++;
    }
    return {
      size: n,
      descending
    };
}

do {
  const startIndex = list.length - nremaining;
  let { size, descending } = count_run();
  if (descending)
    reverse(
      list, 
      // start index
      startIndex, 
      // length to revert
      size
    );
  if (n < minrun) {
    const force = nremaining <= minrun ? nremaining : minrun;
    binarysort(
      list,
      // sort start index
      startIndex,
      // sort length
      force,
      // index to start binary search
      list.length - nremaining + size
    )
    size = force;
  }
  nremaining -= n;
  runs.push({ startIndex, size });
} while (nremaining);

Enable to balance the merge operations, it's better to have powers of 2 count of runs. As a result, Tim sort divides the total length of the list with 2, util the number become lower than 64, and then rounding it to get the min run:

function merge_compute_minrun(n)
{
    var r = 0;           /* becomes 1 if any 1 bits are shifted off */
    while (n >= 64) {
        r |= n & 1;
        n >>= 1;
    }
    return n + r;
}

The balanced merge

The classical merge sort simply merges the adjacent 2 runs. Since it compares items in both sorted arrays, find the relatively smaller one, and then store the merged list in the memory, it will take quite some memory spaces. Tim find there could be some ways to deal with this kind of unnecessary waste.

So the first question we will need to solve is which 2 runs merge will have higher performance?

Assume there are 3 runs:

A: 1000 items

B: 2000 items

C: 1000 items

According to some blogs (I didn't find very clear proof), merging the list in similar size like A and C will be more balanced. But, there is a problem - if unfortunately there is one item both exists in A, B (or B, C, or A, B, C), merging A and C first then B will change the order of that item and affects the stability.

As a result, Tim sort still only merges the adjacent 2 runs, but not in a "fixed" order. Instead, it delays the unbalanced merge.

To achieve this, it compares the length of 3 adjacent runs. If they fulfil following constrains:

A > B + C
B > C

Then it basically means run B and C are the relatively smaller runs. Merging the smaller runs first, then we will have higher chance to be close to the length of the larger adjacent runs to continue the merge.

After this process, if there are still some runs left, then it will fallback to the normal merge order.

Merging with gallop mode

When we try to merge large runs, since they are all sorted, there's a big chance to compare a lot of times enable to find the right place for the item. For example:

A: [0, 2, 4, 6, 8, 10, ..., 100]

B: [57, 77, 97, ...]

If the first item of B - 57 start comparing with A from its first item - 0, it will compare 58/2 = 29 times to find its place. And then the next one - 77, again need to compare another (78 - 58)/2=10 times.

Other than the meaningless comparisons, it also wastes memory space for storing those data. Thus, Tim sort uses galloping to speed up the process. Comparing to Binary Search, their complexity is same, but galloping prefers to look for data which is not far away from the search starting position.

It starts with choosing a place(not necessary to be the first or last one) to start the search, and then choose searching direction based on whether the starting item is smaller or larger than the target data. If one search fails, next time it will try a farther place by powering 2 each time.

For example, if it starts to search from A[7], and A[7] is smaller than the target data, then it compares A[9], and next A[11], A[15], A[23], A[35]... util it finds one is larger than or equals to the target data. Assume the larger one is A[35], then it could know the target data must belongs to somewhere between A[23] and A[35].

Once got this rough range, then it switches to Binary search, to find the eact position.

Here is the pseudocode:

function gallop_left(
   key, // the data to merge
   a, // the run to merge to 
   startIndex
)
{
    let lastofs = 0; // starting point for binary search
    let ofs = 1;  // end point for binary search
    const n = a.length;
    if(a[hint] < key) {
        // compare from left to right
        const maxofs = n - hint;
        while (ofs < maxofs && a[ofs] < key) {
          lastofs = ofs;
          ofs = (ofs << 1) + 1;
        }
        if (ofs > maxofs) ofs = maxofs;
        lastofs += hint;
        ofs += hint;
    } else {
        // compare from right to left
        const maxofs = hint + 1;
        while (ofs < maxofs && a[hint - ofs] < key) {
            lastofs = ofs;
            ofs = (ofs << 1) + 1;
        }
        if (ofs > maxofs) ofs = maxofs;
        let tmp = lastofs;
        lastofs = hint - ofs;
        ofs = hint - tmp;
    }

    // do binary search at the end
    ++lastofs;
    while (lastofs < ofs) {
        let m = lastofs + ((ofs - lastofs) >> 1);
        if(a[m] < key) lastofs = m+1;
        else ofs = m;
    }
    return ofs; // return the found index
}

When to use gallop mode

It's likely to have big gaps between 2 runs when they just start to merge. For example these 2 runs:

A: [0, 2, 4, 6, 8, 10, ..., 100]

B: [57, 77, 97, ...]

So here we could use gallop_left(b[0], a, 0) (which starts from a left side point), to quickly locate where b[0] should be in a. And use gallop_right(a[a.length - 1], b, b.length - 1)(which starts from a right side point), to locate where a[0] should be in b, so it could shorten down the size of data that truelly need a comparison and merge.

gallop_left and gallop_right do almost the same thing just have some little different edge handlings.

Here is the pseudocode:

function merge(ms /* MergeState */, a, b) {
    let k = gallop_right(b[0], a, 0);
    if (k < 0) throw -1;
    let merged = a.splice(0, k);
    if (a.length == 0) 
      return merged.concat(b);
    else
      merged.push(b.shift());

    let nb = gallop_left(a[a.length-1], b, b.length - 1);
    if (nb <= 0) throw nb;
    const lastPart = b.splice(nb);
    lastPart.unshift(a.pop());

    if (a.length <= nb)
        merged = merged.concat(merge_lo(ms, a, b));
    else
        merged = merged.concat(merge_hi(ms, a, b));
    return merged.concat(lastPart);
}

Gallop mode is cool, but it only works well when there's large gaps between the runs, otherwise it will only lower down the performance. Hence, Tim sort introduced some constrains to enter the gallop mode.

In Python, it only enters gallop mode when a data compared 7 times and still not yet find its place. And then the most interesting thing is, this constrain will get lower if it finds the merge keep entering to gallop mode. And if it didn't trigger galloping, then reset the constrain a little bit. This feels like playing a fighting game - making a serious of moves will enlarge the damage, and if one move is not right then reset the extra damages.

Here is the pseudocode:

const MIN_GALLOP = 7;

function merge_lo(
  ms, // MergeState
  a, // the shorter run
  b // the longer run
) {
      let min_gallop = ms.min_gallop;
    let merged = [];

    for (;;) {
        let acount = 0;          /* # of times A won in a row */
        let bcount = 0;          /* # of times B won in a row */

        // do sequential comparing first, util it fulfils the galloping constrain
        do {
            if (b[0] < a[0]) {
                merged.push(b.shift());
                ++bcount;
                acount = 0;
                if (b.length == 0) return merged.concat(a);
            } else {
                merged.push(a.shift());
                ++acount;
                bcount = 0;
                if (a.length == 0) return merged.concat(b);
            }
        } while ((acount || bcount) >= min_gallop))

        // galloping and merging
        ++min_gallop;
        do {
            ms.min_gallop = min_gallop -= min_gallop > 1;
            let k = gallop_right(b[0], a, 0);
            acount = k;
            if (k) {
                if (k < 0) return -1;
                merged = merged.concat(a.splice(0, k));
                if (a.length == 0) return merged.concat(b);
            }
            merged.push(b.shift());
            if (b.length == 0) return merged.concat(a);

            k = gallop_left(a[0], b, 0);
            bcount = k;
            if (k) {
                if (k < 0) return -1;
                merged = merged.concat(b.splice(0, k));
                if (b.length == 0) return merged.concat(a);
            }
              merged.push(a.shift());
            if (a.length == 0) return merged.concat(b);
        } while (acount >= MIN_GALLOP || bcount >= MIN_GALLOP);
        ms.min_gallop = ++min_gallop;
    }
    return merged;
 }

There is another function called merge_hi actually. Since what it does is very similar to merge_lo , so I will skip that.

That's it

Well, on v8(I seriously don't have the mood to read the source code...), according to here, it does not start Tim sort once the list is larger than the 32-64 min run like Python do. It only starts when the list size is larger than 512, which means most of cases we will only entry Binary sort instead of Tim sort.

The way that uncle Tim thinks and the solution he chose is quite fascinated to me, but not his C code 😂. I think you get what you mean, hahaha. Hope you enjoy the reading.