Scott Chamberlain

Posted on Aug 17, 2020 • Originally published at recology.info on Aug 17, 2020

taxizedb: an update

#r #taxonomy #database

taxizedb arose from pain in using taxize when dealing with large amounts of data in a single request or doing a lot of requests of any data size. taxize works with remote data sources on the web, so there’s a number of issues that can slow the response down: internet speed, server response speed (was a response already cached or not; or do they even use caching), etc.

The idea with taxizedb was to allow users to do the same things as taxize allows, but much faster by accessing the entire database for a data source on their own computer. The previous versions of taxizedb used a variety of different databases (MySQL/MariaDB, PostgreSQL, SQLite), so the technical barrier to entry was pretty high. In the newest version just released, we’ve drastically simplified the database situation, among other things.

taxadb was developed as an alternative approach to taxizedb and should also be considered when dealing with taxonomic names. It takes a different approach for the data, with tabular files hosted on GitHub releases, but is similar in that after downloading the data is put into a SQL database, SQLite by default (with other options for databases). taxadb user facing functions are different from those in taxizedb, and largely don’t overlap.

taxizedb quick links:

Install the latest version, if you don’t get v0.2.0 with

install.packages("taxizedb")

then use

install.packages("taxizedb", repos = "https://dev.ropensci.org")

Load the package

library(taxizedb)

All SQLite!

SQLite is shipped in nearly every device these days, so taxizedb now uses only SQLite for the database backend for each data source. Every person that installs taxizedb should have SQLite already installed. In addition, there’s no usernames/passwords/ports to setup with. How we’ve accomplished this is partly through automation:

NCBI: SQLite built within R from tabular files
ITIS: they provide a SQLite dump
Plantlist: is no longer updated; we build a SQLite manually from csv files
COL: a SQLite is built once a day via GitHub Actions
GBIF: a SQLite is built once a day via GitHub Actions
Wikidata: SQLite built within R from a tabular file
World Flora Online: SQLite built within R from a tabular file

Some of the databases have indices to speed up queries, making them a bit larger relative to no indices, but these days most people likely are willing to use up a little more disk space on their computer to have faster queries.

New data sources

Three new data sources were added:

NCBI taxonomy - all of this work was done by Zebulun Arendsee
World Flora Online - the replacement for The Plant List
Wikidata - the table wikidata-taxon-info, extracted taxon objects from Wikidata, last updated April 2018, on Zenodo

New functions: taxize equivalents

Three new high level functions matching those in taxize were added: children, classification, downstream. The taxize version of those functions are still good for smaller requests, but with larger requests, its probably best to use taxizedb. The most common problem where taxize becomes frustrating is with downstream where a user wants all species within a high taxonomic rank like phylum. The original work for these functions was done by Zebulun Arendsee.

Here’s a comparison of taxize vs. taxizedb with downstream - getting all species within the genus Bombus (bumble bees)

id_tx <- taxize::get_tsn("Bombus")
system.time(taxize::downstream(id_tx, db = "itis", downto = "species"))
#> user system elapsed 
#> 2.144 0.130 20.533

id_txdb <- taxizedb::name2taxid('Bombus', db = "itis")
system.time(taxizedb::downstream(id_txdb, db = "itis", downto = "species"))
#> user system elapsed 
#> 0.132 0.051 0.186

In addition, three new “mapping” functions were added that are similar to those in taxize, but with different names: name2taxid (scientific or common name to taxonomy ID); taxid2name (taxonomy ID to scientific name); taxid2rank (taxonomy ID to rank).

We saw name2taxid above. Below we get the taxonomic ID for COL, ITIS and GBIF for Bombus

name2taxid('Bombus', db = "col")
#> [1] "3993765"
name2taxid('Bombus', db = "itis")
#> [1] "154397"
name2taxid('Bombus', db = "gbif")
#> [1] "1340278"

Get the scientific name from a taxonomic ID

taxid2name(3993765, db = "col")
#> [1] "Bombus"
taxid2rank(3993765, db = "col")
#> [1] "genus"

These functions are quite fast too:

x <- taxize::names_list(rank = "species", size = 10000L)
system.time(name2taxid(x, db = "gbif", out_type = "summary"))
#> user system elapsed 
#> 0.074 0.014 0.092

Thoughts?

Get in touch if you have any feedback at https://github.com/ropensci/taxizedb/issues

DEV Community

taxizedb: an update

All SQLite!

New data sources

New functions: taxize equivalents

Thoughts?

Top comments (0)

Read next

Research AI Assistant Application - RAG

AI Bot: Intelligent Financial Query Assistant Powered by PostgreSQL and Ollama

Deploying PostgreSQL, MySQL, and MongoDB on Kubernetes: A Step-by-Step Guide

How to Set Up Prometheus Exporters for PostgreSQL and MongoDB in Kubernetes