Matt Kenefick

Posted on

# Combining popular things and random things

Author's note: As always, if I glossed over something or the flow is confusing, let me know in the comments.

Weighted randomization is a way of selecting a random item with bias.

Letβs say you have a list of 1,000 movies ranked by popularity. You want to watch something different, but you also want something popular. You canβt simply randomize all 1,000 movies the standard way because then your selection has an equal chance of being good as it does of being bad. Nevertheless, we still want something βrandomβ that is likely βgood.β

The most basic (and incorrect) way to do this is by creating additional items in a list and then shuffle that list.

Letβs say you are going through numbers 1β5 and you want "1" to show up more often than the others.

`[1, 2, 3, 4, 5]` gives every option an equal chance.
`[1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 4, 5]` gives the 1s a far better chance.

Simple, right?

What do you do when you have thousands/millions of rows in a database and significantly varying weights?

If we use the list method above, the amount of items in our set will grow substantially and dramatically affect performance.

##### Light Example:

Let's say we want to rank 100 baby names starting at a popularity of 100 equally distributed down to 1.

Meaning James = 100, John = 99, Robert = 98, etc

We can use the formula for 100+99+98+97β¦+2+1. That would mean our basic set above would be 5,050 items in length just to do a simple weighted comparison of 100 items.

``````// This set would be 5,050 items long
['James', 'James', 'James', [...], 'John', 'John', ...]
``````

In a Javascript sense, that'd be `names.length` == 100, but when sorting it `namesForSorting.length` == 5050. That's unacceptable.

``````How we came up with that number: (100 = 100 names)

(X + 1) * (X / 2)
(100 + 1) * (100 / 2)
101 * 50 = 5,050
``````

What if we want to compare... 65,250 items?

Letβs try 65,250 items in the formula to get a random weighted item using the list method above.

``````Same formula, new number:

(X + 1) * (X / 2)
(65,250 + 1) * (65,250 / 2)
65,251 * 32,625 = 2,128,813,875
``````

Thereβs no way we should create a list with two billion one hundred twenty eight million eight hundred thirteen thousand eight hundred seventy-five items. That is gratuitous waste of resources that will only get worse. What about a million records? (500 billion, ouch)

Note: In this example, weβre only using equal distribution of popularity (1, 2, 3, 4, 5+). In reality, the ranks could be anything. (100, 100, 100, 95.6, 91, 85, 85, 85, 84, 84, 84,β¦] That means you should expect your sets to likely have much higher quantities than our basic example.

Fortunately, thereβs a much faster way to get the same exact results.

### Practical

Letβs use the Top 16 Football Teams in the AFC from 2017.

These are the four steps:

1. Sum up all of your rank/popularity into a variable: X.
2. Generate a random number between 0 and X. Weβll call it Y.
3. Iterate through your list of data. Subtract each rowβs rank/popularity from your random number (Y).
4. When Y β€ 0, that is your weighted randomized object index.

Hereβs a JS Fiddle that ranks the top 16 teams, distributing a score of 0β100 points equally to each team. We run the test 5,000 times and youβll see how often each item is selected and how consistent it happens.

Out of 5,000 randomizations, these are the selections:

``````603 β New England Patriots
520 β Pittsburgh Steelers
512 β Jacksonville Jaguars
472 β Kansas City Chiefs
447 β Tennessee Titans
405 β Buffalo Bills
384 β Baltimore Ravens
336 β Los Angeles Chargers
279 β Cincinnati Bengals
264 β Oakland Raiders
219 β Miami Dolphins
197 β Denver Broncos
150 β New York Jets
105 β Indianapolis Colts
70 β Houston Texans
37 β Cleveland Browns
``````

What the above results prove is that every team had a chance to be chosen at βrandom.β The Patriots were chosen 603 times whereas the Browns were chosen 37 times. It didnβt exclude the Browns for being unpopular, but they were certainly picked less often.

The benefit of this method is that instead of shuffling through 136 items (16+15+14β¦), we only run a subtraction operation anywhere between 1 and 16 times. Less operations is less computing power.

As per our first extremely simple example at the introduction of this article: Instead of an expensive shuffling operation of 2,128,813,875 items in a set, we only run a simple subtraction operation anywhere between 1 and 65,250 times.

Question: How much processing power does it take to subtract an integer ~50 timesβ¦ ~4,000 timesβ¦~10,000 times?