DEV Community

boB Rudis
boB Rudis

Posted on • Originally published at rud.is on

Wicked Fast, Accurate Quantiles Using ‘t-Digests’ in R with the {tdigest} Package

@ted_dunning recently updated the t-Digest algorithm he created back in 2013. What is this “t-digest”? Fundamentally, it is a probabilistic data structure for estimating any percentile of distributed/streaming data. Ted explains it quite elegantly in this short video:

Said video has a full transcript as well.

T-digests have been baked into many “big data” analytics ecosystems for a while but I hadn’t seen any R packages for them (ref any in a comment if you do know of some) so I wrapped one of the low-level implementation libraries by ajwerner into a diminutive R package boringly, but appropriately named tdigest:

There are wrappers for the low-level accumulators and quantile/value extractors along with vectorised functions for creating t-digest objects and retrieving quantiles from them (including a tdigest S3 method for stats::quantile()).

This:

install.packages("tdigest", repos="https://cinc.rud.is/")

Enter fullscreen mode Exit fullscreen mode

will install from source or binaries onto your system(s).

Basic Ops

The low-level interface is more useful in “streaming” operations (i.e. accumulating input over time):

set.seed(2019-04-03)

td <- td_create()

for (i in 1:100000) {
  td_add(td, sample(100, 1), 1)
}

quantile(td)
## [1] 1.00000 25.62222 53.09883 74.75522 100.00000

Enter fullscreen mode Exit fullscreen mode

More R-like Ops

Vectorisation is the name of the game in R and we can use tdigest() to work in a vectorised manner:

set.seed(2019-04-03)

x <- sample(100, 1000000, replace=TRUE)

td <- tdigest(x)

quantile(td)
## [1] 1.00000 25.91914 50.79468 74.76439 100.00000

Enter fullscreen mode Exit fullscreen mode

Need for Speed

The t-digest algorithm was designed for both streaming operations and speed. It’s pretty, darned fast:

microbenchmark::microbenchmark(
  tdigest = tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)),
  r_quantile = quantile(x, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
)
## Unit: microseconds
## expr min lq mean median uq max neval
## tdigest 22.81 26.6525 48.70123 53.355 63.31 151.29 100
## r_quantile 57675.34 59118.4070 62992.56817 60488.932 64731.23 160130.50 100

Enter fullscreen mode Exit fullscreen mode

Note that “accurate” is not the same thing as “precise”, so regular quantile ops in R will be close to what t-digest computes, but not always exactly the same.

FIN

This was a quick (but, complete) wrapper and could use some tyre kicking. I’ve a mind to add serialization to the C implementation so I can then enable [de]serialization on the R-side since that would (IMO) make t-digest ops more useful in an R-context, especially since you can merge two different t-digests.

As always, code/PR where you want to and file issues with any desired functionality/enhancements.

Also, whomever started the braces notation for package names (e.g. {ggplot2}): brilliant!

Oldest comments (1)

Collapse
 
ajwerner profile image
Andrew Werner

Hey, author of tdigestc here. Super happy to see you found it useful. Let me know if you have any questions or comments about the library. Also curious to hear if you had any trouble wrapping it for use in R.

The data structure in the library is based on the newer update to @ted_dunning's great work but the scale function is still using the older version. If you're finding traction, I'd be happy to extend it. In the meantime I had shifted my focus to a more featureful go implementation github.com/ajwerner/tdigest.