John Colagioia (he/him)

Posted on Mar 28, 2020 • Originally published at john.colagioia.net

Recutils, the Plain-Text Database

#database #plaintext #linux

I originally posted a version of this to my blog as the first of my "tech tips" that are 90% notes to myself so that I don't need to keep researching. To my surprise, it has become one of the most popular posts, there, so maybe it'll be useful to someone here, too.

Background on Databases

Before I jump into recutils itself, if you're not already conversant in databases, I'll try to give a capsule version. For example, If you've taken the undergraduate database class or have written more than a few lines of SQL, you can probably safely skip down to Recutils, in General.

For the rest of you, here are some definitions, and then I'll go over some of the basics.

A database is a collection of related data, regardless of the form.

A relational database is a database (surprise!) where we store data elements based on their relationships (surprise again!) to other elements. Sometimes those relationships are described as records and sometimes they're described by metadata that connects separate records. The full definition centers on Codd's Rules.

A database management system (DBMS) is software that mediates access to a database to minimize the chances of a user or program shooting itself in the foot. For example, it might (depending on the system) moderate concurrent access or prevent you from deleting a record that another record points to.

Every database has a universe of discourse or mini-world that explains how to interpret the data, whether anybody bothered to write it down. We all somehow agree to never talk about this, even though "what does this data actually mean?" is one of the most important questions we can be asking, so I'm bringing it up here...

Instead of a universe of discourse, the thing we really get is a schema, the layout of what we store in each kind of record and how they relate to each other through foreign keys. I'll explain foreign keys in a bit, rather than try to assemble a coherent definition.

With that out of the way, record sets are called "relations" by theoreticians and implemented as tables, with records implemented as rows and individual elements as columns. The terms are mostly interchangeable, but you'll find table/row/column used by just about everybody in industry.

The last thing you probably need to know (unless you want to implement a DBMS) is that finding data ("querying") has what amounts to three steps.

First, we join the relations/tables we need, by which I mean taking records/rows that are connected through a common element/field/column. That commonality is (and should be specified as) a "foreign key," a kind of pointer into that other table's space. Even if we only care about one table, we still need to specify it.

Second, we filter the available records down to those whose elements are interesting to us, very similar to writing an if statement in general programming. We might want all the records, of course, which makes this step trivial. Or this might be extremely complicated, depending on what we want.

Third, we project the remaining records to just the fields that interest us. Again, we might want every field/column, making this part easier to specify.

If you're familiar with SQL, you'll probably recognize the SELECT column-or-columns FROM table-or-tables WHERE conditions-are-true format, where the FROM might list tables arbitrarily or might explicitly use JOIN to connect the tables. If you're not, recutils doesn't use SQL, so don't worry about it.

Now that we're all up to speed on databases, I can rant for a bit about why I care about recutils...

Recutils, in General

When I store information, I'm a big fan of light-weight, text-based formats for a variety of reasons.

They store and compress well, when that's important.
It's possible to quickly inspect or edit whatever I'm working on.
I can quickly search anything I'm working on with a tool like grep.
I'm not tied to specific applications.
It's easier to convert to another format, making it closer to future-proof.
Version control understands text better than anything else, so I can easily analyze changes over time.

For example, my blog, my novel, and my personal notes are generally all written in Markdown files. I store most of my spreadsheets as sets of CSV files, unless the formatting or calculated cells are important. In the rare cases I need something like a flow chart, it's usually in pic format.

Databases don't usually do any of this well, however. For the sake of efficiency, you typically dump all of your data into a binary file that the database management system has optimized for its own access and analytical purposes.

So, I got a little bit excited when I discovered recutils, a system to manage plain-text relational databases. It's not perfect (unless I'm missing something important), but it's definitely close to the sort of thing I would generally reach for, closer than using a free-form JSON file, for example.

File Organization

Unlike a traditional relational database management system, recutils needs you to make decisions about where to store records. Normally, the files are obscured from your vision, if only so that you're not tempted to monkey around with the content in an editor. Here, however, you need to decide what files you want and what you want to store in each.

And in fact, one of the clunkier aspects to recutils that I found is that it's difficult (impossible?) to work with an arbitrary subset of files, such as asking it to operate on two (and only two) files at once. So, there seem to be two options:

Put all the record types into one file, or
Separate the record types into different files, while specifying all of them with a wildcard (such as "*.rec") whenever working in situations that require more than one record type (like a Join.

Neither of these is ideal, of course, but the single file is probably acceptable for small databases and the latter is probably the most intuitive. I'll use the one-record-type-per-file approach, here.

Obviously, recutils doesn't actually care either way, and it's possible to combine the two, so that record types that are often used together can share a file, while the other clusters of types or individual types can be in their own files. You don't even really need to separate the files by type, and instead could assign records to files by any criteria you like, like your favorite color at the moment you're adding the records. You shouldn't do that, because it will make your life harder, but recutils isn't going to care, is my point.

Record Schema

While it's not strictly necessary to have a schema for records to conform to, it makes life significantly easier, if for no other reason than having some documentation. So, the header of a file of records, optionally, shows the types of each field, keys, and so forth.

It looks to me like unspecified fields are all arbitrary strings, multi-line with a plus sign (+) at the beginning of every line after the first. Other than that, if we are looking to store something like application users, we might have something like...

# The name of the "table" and a human-readable
# description
%rec: user
%doc: User information

# The non-multi-line string fields
%type: id int
%type: login line
%type: email email
%type: password line
%type: name line
%type: created_at: date
%type: updated_at: date
%type: role enum Guest User Moderator Administrator
%type: deleted bool

# Roles for fields; note that we sneak in "profile,"
# which is our multi-line string.
%key: id
%mandatory: login email password
%allowed: name profile created_at updated_at role deleted
# We could also %prohibit fields we want to ensure don't
# exist until we're ready to set them.

# Special restrictions
%auto: id created_at updated_at
%confidential: password
%unique: login email

That's probably more extensive than it needs to be for an example, but we see the record type name (%rec), decent coverage of the available types (int, line, email, date, enum, and bool, the different kinds of fields (other than prohibited), and constraints (%auto, %confidential, and %unique). I expect all of them are self-explanatory, except maybe line for a single line of text and confidential encrypting the field.

There's one useful type not shown here, that I'll get to in a bit.

Records

Each user record, now, is going to need to have our specified fields and the right types, all in name: value format. For example, I created the following user via the Fake Name Generator.

id: 7760
login: Brouch
email: VanessaJStolz@dayrep.com
password: $2a$11$WBkGS/r5/0tn2muqPXreVO40wy6dxdEzKj/lY5.y7BQ5YoOHKerNW
name: Vanessa J. Stolz
profile: A 78-year-old Libra from Atlanta, Georgia, who works
+ for De Pinna as a machine feeder and drives a blue 2000
+ Toyota Pod.
created_at: 2020-02-05 06:48:55-0500
updated_at: 2020-02-05 06:48:55-0500
role: Moderator
deleted: False

As a brief aside, recutils doesn't care about extra fields. So, if you want to add a birthday, significant other's name, and hat size to Vanessa's record above, but not anybody else's, that's probably going to be somewhat difficult to work with, but not destructive.

Beyond just editing our users.rec file manually, we have a few options to populate our file with data.

Conversion

If you have a pre-existing data source, recutils comes with two utilities to convert from common database types:

csv2rec turns each line of a CSV file into a record, with the header line providing the field names.
mdb2rec converts a Microsoft Access database (or a single table) into records.

I can confirm that the CSV program works well, but couldn't find an Access database to test.

Insertion

The recins utility, probably unsurprisingly, inserts records. Continuing with our users, it can be used something like...

recins --type=user --field=login --value=Gode1962 \
  --field=email --value=PhillipSCollins@teleworm.us \
  --field=password --value=password \
  --field=name --value="Phillip S. Collins" \
  --field=profile --value="Suburban driver of Dodge Rams" \
  --field=role --value=User \
  --field=deleted --value=False \
  --verbose users.rec

Assuming that none of the constraints have been violated, here, this will add the record to the users.rec file. If it fails, the --verbose argument will produce an error message explaining what went wrong.

If you have one record definition/schema in each file, recins will figure out where to store it, thankfully.

Exploring

We can get some basic information, now. The following gives us a count of each record type in the file.

recinf users.rec

The difference between that and retrieving the schema is a matter of an argument.

recinf --descriptor users.rec

And then we can check into more general queries of the data, though the syntax isn't SQL. We can pick a random set of records.

recsel --random 5 users.rec

We can get a "projection" of all data by specifying columns. The columns can also be renamed for convenience, which will be important, later.

recsel --print-values=login,email users.rec
recsel --print-values=login:Username,email:EMail users.rec

Replacing --print-values with just --print produces output that looks like a record, with name: value pairs. Also, if there are multiple types of records, we would need to specify --type=user.

We can also run arbitrary queries.

recsel --expression="id = 23573" users.rec
recsel --expression="email ~ gmail.com" users.rec

The expressions can get fairly sophisticated and this doesn't support SQL's syntax.

Deleting Records

Now that we understand insertion and query expressions, we should be able to figure out how to delete a record.

recdel --type=user --expression="login = Selits" --comment users.rec

The --comment argument deletes the record by commenting it out in the file, so that it can be found or restored later.

Multiple Tables

Now that we've settled our users, we can give the users something to work with, like favorite websites. We'll keep this website.rec schema simpler than our users.

%rec: website
%doc: Favorite websites submitted to the database.

%type: id int
%type: url line
%type: owner rec user
%type: created_at: date
%type: updated_at: date

%key: id
%mandatory: url owner

%auto: id created_at updated_at

The important field, here, sets up a foreign key:

%type: owner rec user

That is, owner is a reference to a record of type user. Because the user record type's key is an integer, the owner field is an integer, as well.

So, we have an example record.

id: 2031
url: https://www.gnu.org/software/recutils/manual/Foreign-Keys.html#Foreign-Keys
owner: 7760
created_at: 2020-02-05 06:48:55-0500
updated_at: 2020-02-05 06:48:55-0500

The 7760, in this case, maps this URL to the user table through its key, the id field.

Joining Tables

Now, we get to the important stuff. We can now get a list of URLs and the name of the person who submitted each one. Remember, we can't explicitly specify multiple files, so we either have a single file or a wildcard.

recsel --type=website --field owner \
  --print-values=url:URL,owner_name:Owner *.rec

And, of course, we could add a query expression to the command, if we wanted, or we could involve additional tables.

Note, incidentally, that the join is an inner join. One of the problems with recutils seems to be that other kinds of joins are not permitted.

Programming Support

In the grand scheme, a database that only allows ad hoc queries isn't particularly useful. Instead, we need some kind of API that can be used by software. This is only an API if you're writing shell scripts, obviously.

For programming in C, there's the librec-dev library package, and a lot of other languages can (indirectly) use that. But I only see bindings available for Python and PHP, with neither library updated in several years.

Also, the library documentation is "go read the source code," specifically the main header file. Apparently, the important functions in the library approximately map to...

librec Function	recutils Utility
`rec_db_query`	`recsel`
`rec_db_insert`	`recins`
`rec_db_delete`	`recdel`
`rec_db_set`	`recset`
`rec_int_check_db`	`recfix --check`

That's not spectacular, but still has some potential to improve, at least. Maybe at some point, when other things settle down, I'll revisit recutils specifically to document the library.

Other Capabilities

This just scratches the surface, honestly. Queries can be grouped. Aggregate functions are available. Subtypes are available, such as strings that conform to a regular expression or integers in a limited range, and can be named with %typedef. Records can have compound fields or be sorted. There are more encryption options than just %confidential. Output can be formatted with templates. Records can be updated. The recfix program performs diagnostics and repairs some issues.

In other words, it's a sophisticated system with a lot going for it, despite the handful of problems I've called out in the preceding discussion. It won't replace CSV files for my simple data sets, but for anything more complicated than that (other than calculation, which still needs a lightweight solution that doesn't have the overhead of a typical programming language), it looks pretty good!

As a good example of where I'll start using it, for the Silver Bat stories (more in that universe are definitely coming), I currently just have a Markdown file that lists all the characters. It includes their profiles from my profile generator, plus a brief biography and possible plot threads. When the time comes to share that information with other people, it would be nice to manage that data better and also turn it into a kind of concordance to be able to see where all the characters have been used.

For much more information, you can read through the documentation. It's not particularly friendly to newcomers who don't already understand how things work---which is hopefully nobody who got to the end of this post---but it's extensive enough to answer most questions that aren't related to the library.

All things considered, this seems like a decent foundation for projects where transparency is important.

Bonus: SQLite Export

Since I wanted to get up and running quickly with data with recutils, I decided to use a database I was already working with to populate my records. SQLite, which is where the data is, has a mildly clunky way of exporting tables to CSV files.

First, get into the SQLite shell.

sqlite3 database.sqlite3

Then, decide what you want to export and go for it.

.tables # <- This lists all tables
# Now that we know what tables are available,
# we can export them.
.headers on
.mode csv
.output users.csv
SELECT * FROM users;
# The table and file names don't need to match.
.output websites.csv
SELECT * FROM websites;
# If you don't issue a new .output directive,
# all your data will end up in the same file,
# which is going to make everything else more
# complicated than it needs to be.
.quit

Now, we have files that can be converted to record sets with csv2rec as described above.

Credits: The header image is work, technology, vintage, wheel, retro, clock, by an anonymous PxHere photographer, is made available under the terms of the CC0 1.0 Universal Public Domain Dedication.

DEV Community