DEV Community

Bob Lied
Bob Lied

Posted on

Let me count the ways (PWC 244 Count Smaller)

TLDR

One pass over sorted data is a better strategy than searching through random data.

The Problem

Perl Weekly Challenge 244, Task 1, Count Smaller asks us to categorize data for its relative values:

You are given an array of integers. Write a 
script to calculate the number of integers 
smaller than the integer at each index.
Enter fullscreen mode Exit fullscreen mode

Example 1

Input: @int = (8, 1, 2, 2, 3)
Output: (4, 0, 1, 1, 3)

For index = 0, count of elements less than 8 is 4.
For index = 1, count of elements less than 1 is 0.
For index = 2, count of elements less than 2 is 1.
For index = 3, count of elements less than 2 is 1.
For index = 4, count of elements less than 3 is 3.

Example 2

Input: @int = (6, 5, 4, 8)
Output: (2, 1, 0, 3)

Example 3

Input: @int = (2, 2, 2)
Output: (0, 0, 0)

Idea 1: Brute force search

The first idea, as usual, is brute force. For every element in the array, scan the array for values that are less than that. An optimization that immediately occurs to me is that the scan for smaller values would be easier if the numbers are sorted; we can quit early and don't have to scan the entire list every time.

sub countSmaller_A($nums)
{
    my @sorted = sort { $a <=> $b } $nums->@*;
    my @smaller = ();
    for my $i ( $nums->@* )
    {
        my $count = 0;
        for my $j ( @sorted )
        {
            last if $j >= $i;
            $count++
        }
        push @smaller, $count;
    }
    return \@smaller;
}
Enter fullscreen mode Exit fullscreen mode

Searching in a sorted list has an optimization. We could find the index of the first element in @sorted that is greater than $i using binary search (or, more realistically, using List::Utils::bsearch_index because my odds of getting a binary search algorithm right are slim). Maybe I'll come back to this, but this is basically an O(n^2) algorithm, so there are probably better ideas to explore.

Idea 2: Frequency counts

The second approach is to count how many times each value occurs, saving a frequency table. Then select from that table all the lesser numbers and add them up. This is a bit of a Perl tour-de-force, using array operations up the wazoo. The code is compact, but the complexity might not be worth it.

sub countSmaller_B($nums)
{
    use List::Util qw/sum0/;
    my %freq;
    $freq{$_}++ for $nums->@*;

    return [ map { my $i = $_; sum0 @freq{ grep { $_ < $i } keys %freq } } $nums->@* ];
}
Enter fullscreen mode Exit fullscreen mode

Building the frequency table is relatively obvious, but what's going on in that last line?

  • return [ ... ] -- the answer is going to be an array reference to the list of numbers
  • return [ map { my $i = $_; ... } $nums->@* ] -- the list of numbers is going to be generated by mapping each of the original numbers to its count of smaller numbers. Each of the original numbers is going to be assigned to $i while we're doing the transformation
  • grep { $_ < $i } keys %freq -- we're going to select all keys from the frequency table %freq where the key is less than the number we're looking at
  • @freq{ grep...} -- Using the list of keys from the grep, take a hash slice from the %freq table. This will give us the counts of each value less than $i.
  • sum0 @freq... -- The sum0 function adds up the values in an array, defaulting to 0 if the array is empty.

Idea 3: Make one pass over sorted data

Let's return for a moment to that sorted array we built back in Idea 1. Once the array is sorted, it's pretty easy to find all the smaller numbers -- it's a monotonically increasing sequence as we scan from left to right.

There are two complications: repeated values, and the fact that the answer has to be in the same order as the original array. Hashes for caches and hash slices are going to come to our rescue.

sub countSmaller_C($nums)
{
    my @sorted = sort { $a <=> $b } $nums->@*;

    my %smaller;
    my $lessCount = 0;
    while ( defined(my $i = shift @sorted) )
    {
        if ( ! exists $smaller{$i} )
        {
            $smaller{$i} = $lessCount;
        }
        $lessCount++
    }

    return [ @smaller{$nums->@*} ];
}
Enter fullscreen mode Exit fullscreen mode

The first thing we do is take the hit of doing the sort, on the premise that we're going to get that performance back with a simpler search loop.

The %smaller variable is a hash that is going to contain our answer, but in an arbitrary order. We will un-arbitrary it at the end with a hash slice that uses the original array for keys -- that will solve our ordering problem.

The $lessCount variable is an accumulator that will count up the number of smaller values as we pass over the sorted array.

The while loop will take elements off the sorted array from left to right. We saw last week that shifting elements off the array can be more efficient than using array indexing. A quick note: the test has to include defined because there could be a zero in the data, and that would test as false.

Within the loop, we use the %smaller array as a cache for values we've already seen. That solves our repeated-values problem.

Time for benchmarking

This is only an exhibition; no wagering, please. As I've been doing for the past week, I brought out the Benchmark module for some simple comparison.

sub runBenchmark($repeat)
{
    use Benchmark qw/cmpthese/;
    my @data = map { int(rand(100)) } 1..100;

    cmpthese($repeat, {
        "simple   " => sub { countSmaller_A(\@data) },
        "frequency" => sub { countSmaller_B(\@data) },
        "one pass " => sub { countSmaller_C(\@data) },
    });
}
Enter fullscreen mode Exit fullscreen mode

The benchmark generates some bigger data than task examples, and then runs the functions thousands of times to get something like a statistically valid result.

And here's the big reveal, sorted from slowest to fastest:

             Rate frequency simple    one pass 
frequency  1339/s        --      -63%      -95%
simple     3647/s      172%        --      -88%
one pass  29268/s     2085%      702%        --
Enter fullscreen mode Exit fullscreen mode

First of all, those compact array operations in Idea 2 look cute, but there's a lot of data access going on underneath. We were better off with using more readable loops from Idea 1.
But, wow! Look at the one pass solution! Literally ten times faster. Better algorithms beat micro-optimizations.

Top comments (0)