Lost In [SQL] Translation: Charting d[b]plyr Mapped SQL Function Support Across All Backends

#dplyr #r #rstats

Like more posts than I care to admit, this one starts innocently enough with a tweet by @gshotwell:

Is there a reference document somewhere of which dplyr commands work on various database backends? #rstats

— Gordon Shotwell (@gshotwell) April 9, 2019

Since I use at least 4 different d[b]plyr backends every week, this same question surfaces in my own noggin on occasion and I couldn’t resist going all Columbo on this mystery.

I should note that if you only really care about the backends that come with dbplyr @paleolimbot has you covered with this post, which also shows you the translated SQL!

Executing The Plan

There are at least 24 separate backends for dbplyr. Most folks won’t need more than one if their databases all have a decent ODBC or JDBC driver. To be able to use dplyr idioms with databases there needs to be a way to translate R code (e.g. function calls) into SQL. A ton of functions are pre-mapped in dbplyr already and most backend implementations start by relying on these defaults. Furthermore, since SQL is not nearly as “standard” across installations as one might think, some common tasks — such as string manipulation — have a default noop translation.

If you do have to switch across backends with any frequency, knowing which backend provides support for which functions might be nice, but there hasn’t been a reference for this until Dewey & I accepted Gordon’s challenge. What makes this a “challenge” is that you first have to figure out what packages provide a d[b]plyr backend interface then figure out what SQL translations they offer (they don’t necessarily have to inherit from the ones provided by dbplyr and may add other ones to account for SQL clauses that aren’t in functional form). So the first step was just a look through CRAN for which packages import dbplyr and also adding in some I knew were on GitHub:

library(stringi)
library(hrbrthemes)
library(tidyverse)

# All the pkgs from the home CRAN mirror that import 'dbplyr'
c(
  "arkdb", "bigrquery", "childesr", "chunked", "civis", "corrr", "cytominer", "dbplot",
  "dbplyr", "dexter", "dexterMST", "dlookr", "dplyr", "dplyr.teradata", "etl",
  "healthcareai", "hydrolinks", "implyr", "infuser", "ipumsr", "macleish", "mdsr",
  "mlbgameday", "modeldb", "MonetDBLite", "mudata2", "parsemsf", "pivot", "pleiades",
  "pool", "poplite", "RClickhouse", "replyr", "RPresto", "sergeant", "sparklyr",
  "sqlscore", "srvyr", "taxizedb", "valr", "wordbankr", "metis.tidy"
) -> pkgs

I ended up doing install.pkgs(pkgs) which was easy since I have a home CRAN mirror and use macOS (so binary package installs).

The presence of a dbplyr import does not mean a package implements a backend, so we have to load their namespaces and see if they have the core “tell” (i.e. they implement sql_translate()):

(map_df(pkgs, ~{
  tibble(
    pkg = .x,
    trans = loadNamespace(.x) %>%
      names() %>%
      keep(stri_detect_fixed, "sql_translate")
  )
}) -> xdf)
## # A tibble: 28 x 2
## pkg trans
## <chr> <chr>
## 1 bigrquery sql_translate_env.BigQueryConnection
## 2 civis sql_translate_env.CivisConnection
## 3 dbplyr sql_translate_env.ACCESS
## 4 dbplyr sql_translate_env.Oracle
## 5 dbplyr sql_translate_env.SQLiteConnection
## 6 dbplyr sql_translate_env.Impala
## 7 dbplyr sql_translate_env.OdbcConnection
## 8 dbplyr sql_translate_env.MySQLConnection
## 9 dbplyr sql_translate_env.PqConnection
## 10 dbplyr sql_translate_env.PostgreSQLConnection
## # … with 18 more rows

Now we know the types of connections that package has SQL translation support for. But, we’re looking for the actual functions they provide. To discover that, we’re going to make dummy classed connection objects and get the translations they offer.

However, some may take the defaults from dbplyr and not override them so we also need to test if they use the sql_not_supported() noop, which we can do by seeing if the function body has a call to stop() in it. We’re also going to ignore maths operators along the way:

(filter(xdf, stri_detect_fixed(trans, ".")) %>%
  filter(trans != "sql_translate_env.NULL") %>% # ignore NULL
  filter(trans != "sql_translate_env.Pool") %>% # ignore db connection pooling 
  filter(trans != "sql_translate_env.PrestoConnection") %>% # this one errored out
  mutate(ƒ = map(trans, ~{

    # get the sql translate functions
    con <- NA
    cls <- stri_replace_first_fixed(.x, "sql_translate_env.", "")
    class(con) <- cls

    env <- sql_translate_env(con)

    # but ^^ rly isn't a nice, tidy object, it's a list of environments
    # with functions in it so we have to iterate through it to extract
    # the function names.

    map_df(env, ~{

      part <- .x
      fs <- names(part)

      # but it's not just good enough to do that b/c a given function name
      # might just implement the "sql_not_supported()" pass through. So we have
      # to actually look to see if the function body has a "stop()" call in it
      # and ignore it if it does.

      map_df(fs, ~{
        tibble(ƒ = .x, src = paste0(as.character(body(part[[.x]])), collapse = "; ")) %>% # this gets the body of the function
          filter(!stri_detect_fixed(src, "stop(")) %>%
          filter(stri_detect_regex(ƒ, "[[:alpha:]]")) %>% # and we rly don't care about maths
          select(-src)
      })
    })
  })) %>%
  unnest(ƒ) %>%
  mutate(trans = stri_replace_first_fixed(trans, "sql_translate_env.", "")) -> xdf)
## # A tibble: 1,318 x 3
## pkg trans ƒ
## <chr> <chr> <chr>
## 1 bigrquery BigQueryConnection median
## 2 bigrquery BigQueryConnection gsub
## 3 bigrquery BigQueryConnection as.logical
## 4 bigrquery BigQueryConnection is.null
## 5 bigrquery BigQueryConnection case_when
## 6 bigrquery BigQueryConnection is.na
## 7 bigrquery BigQueryConnection if_else
## 8 bigrquery BigQueryConnection str_replace_all
## 9 bigrquery BigQueryConnection as.integer
## 10 bigrquery BigQueryConnection as.character
## # … with 1,308 more rows

The rest is all just ggplot2 basics:

 mutate(xdf, db = glue::glue("{pkg}\n{trans}")) %>% # make something useful to display for the DB/conn
  mutate(n = 1) %>% # heatmap block on
  complete(db, ƒ) %>% # complete the heatmap
  arrange(ƒ) %>%
  mutate(ƒ = factor(ƒ, levels=rev(unique(ƒ)))) %>% # arrange the Y axis in the proper order
  ggplot(aes(db, ƒ)) +
  geom_tile(aes(fill = n), color="#2b2b2b", size=0.125, show.legend=FALSE) +
  scale_x_discrete(expand=c(0,0.1), position = "top") +
  scale_fill_continuous(na.value="white") +
  labs(
    x = NULL, y = NULL,
    title = "SQL Function Support In Known d[b]plyr Backends"
  ) +
  theme_ipsum_ps(grid="", axis_text_size = 9) + # you'll need to use the dev version of hrbrthemes for this function; just sub out a diff theme if you already have hrbrthemes loaded
  theme(axis.text.y = element_text(family = "mono", size = 7))

Which makes:

(WP wouldn’t make the featured image linkable so I had to stick it in again to enabled the link so folks can make it full size which is absolutely necessary to see it).

FIN

If you do play with the above, don’t forget to go one more step and incorporate Dewey’s actual SQL mapping to see just how unstandardized the SQL standard is.

Contiguous code for the above is over at SourceHut.

DEV Community

Lost In [SQL] Translation: Charting d[b]plyr Mapped SQL Function Support Across All Backends

Executing The Plan

FIN

Top comments (0)

Read next

ReportLite Practice: Multi-layered cross-reports with complex formats

React Controlled Components V/S Uncontrolled Components

Unveiling React Server Components

Dose nestjs support auto instrumentation with Opentelementry