I wanted to take a moment during my lunch break to talk about how to manage configuration in apps. Configuration is a topic near and dear to my heart, and I don't actually know of much writing I can point so and I wanted to try to fill in the gap. This is a bit of a brain dump, but I thought it was more important to get it out there than it was to get it polished.
There are lots of different kinds of programs out there in this crazy world, and the considerations out there are different depending on what kind of program you're working on!
Today I want to focus on "apps" in distributed systems, by which I mean web servers, stream processors and to an extent data ETL jobs.
Those readers who are either steeped in webapp culture or who have gone down the Kubernetes rabbit hole may be familiar with a concept called 12-factor apps. The name comes from a manifesto published by the people that invented Heroku and, for those of you who are as impatient as I am, you can skip reading all 12 factors and pretend I said "apps like ones on Heroku". If you're not familiar with Heroku, "apps that can run on Kubernetes" is a pretty close substitute. The apps I'm talking about today are ones that can comfortably be written with the 12-factor app paradigm in mind.
Other programs I'm not covering here include desktop apps (where you're gonna use folders in your home directory), mobile apps (where your platform has a blessed answer for this anyway) and browser apps (not my realm of expertise).
These days, most apps use environment variables as a standard interface for base configuration. I believe that in most cases, this is the "right answer" for app configuration.
The core reason why this is such a good idea is that you're practically guaranteed to have the ability to set environment variables even in "weird" environments. Any Unix process - and as far as I know any Windows process as well - has access to environment variables, the same way as all processes have stdin/stdout/stderr and integer exit codes. Moreover, things like Heroku, Kubernetes, Docker and even systemd all have good support for environment variables. Finally - and this is pretty important - environment variables will work even if your filesystem is nonexistent or FUBAR.
It's worth noting that environment variables have a major weakness: the only data structure you get is strings. In other words, the type for os.environ is
Mappable[str, str]. This means that a) nesting/namespacing is really hard; and b) that you will need to do some post-processing to convert those string values to other data types.
Put a pin in this; I'll circle back on how to deal with this effectively in a minute.
There are lots of cases where you still want to use files for configuration. Here are some use cases I think are compelling:
- The data in the configuration is "large". Environment variables don't necessarily have a problem here - I was double-checking this with my team's devops engineer the other day and I don't remember the limit because it was so big it didn't matter. Nevertheless, large pieces of data can be unwieldy, and having the option to offload that to a file can be handy.
- There's a standard file format for the configuration that other tools know how to use. For example, GCP client tools load creds from a file by default and, while it's possible to sidestep this, Google doesn't support it. Google might be wrong here - they are, but let's not get too distracted - but in these scenarios it's best to shrug and roll with it.
There are also a lot of cases where you want to store configuration in a database. Truth be told, most systems have some level of configuration in the database - after all, your users don't have access to your environment variables, and using a database means you can edit configuration in one place and have it accessible by all of your services more or less immediately.
In order to load configuration from files or databases, you need to know where they live - for files, this is filesystem paths, and for databases this is connection information. This data can live in environment variables; what your app will do, then, is load the paths and connstrings from the environment and use that to bootstrap everything else.
Environment variables are all well and good from the perspective of your running process, but they're set when the application is started and don't exist anywhere outside the context of a running process. Therefore, you have to have something that lives outside of your app that sets the environment variables.
The punchline is that in most cases your environment variables are stored either in flat files or in a special database. As examples, on Heroku there's a web UI where you can edit environment variables for your app, and in a Kubernetes-based platform you're probably storing most environment variables in a yaml file somewhere, with sensitive secrets being stored in a special store such as Hashicorp Vault, where they can be injected by your platform when the program is started.
For deployment, this system is often supplied for you by your platform (Heroku, the Kubernetes-based thing your devops team runs for you). However, for local development, you'll need to solve this problem yourself.
There are a number of solutions to setting up environment variables locally - a wrapper bash script, setting env vars in your IDE, writing a custom tool that loads them from yaml, etc - but the most popular way of doing this is probably dotenv.
The general strategy of dotenv is that it has a semi-standard file format (an "env file") stored at a well-known location (
.env in your project root), combined with a library that loads that file when it exists and convinces your runtime that the values in that file were actually environment variables. Basically you import the library, call the load method, and go about your life as though the environment variables were set externally.
A good thing about dotenv is that it's widely supported - most languages have an implementation of the idea somewhere, and many tools (say foreman or PyCharm) support it.
There are some bad things about dotenv, and while I don't think they're show-stopping they are worth calling out. One is that the format is poorly specified and not always useful for other tools. I've ran into so many problems where an "env file" meant for one tool was completely broken in another tool that ostensibly supported the format - super annoying. Bash in particular doesn't know what to do with env files - naively sourcing them in your shell won't do what you want it to do - since you need to call the
export keyword in order to expose the variable to your app. Another issue which is significantly worse is that any secrets stored in your project root can accidentally get committed to source code. Whenever this happens it's Not Great, because anybody that has access to git now has access to the secrets, where often the list of people who should know the secret is not the same as the list of people who should be able to see the source code. This doesn't even account for hacks! Nothing feels worse than accidentally "sheeping your creds" and having to spend a day or more frantically cycling them.
If you use dotenv, make sure you add
.env to your gitignore. This isn't absolutely foolproof - you can force commits - but it'll help a lot. Don't forget to do this!
At some point, you need to actually read the environment variables into your app. Often - and this is where we go back to our pin - you'll need to make sure that required values are defined, set defaults where necessary, convert them into other data types (like booleans, integers, floats, or lists) and ensure that the values they have make sense. A lot of people do this stuff in-line and in an ad-hoc fashion, but I highly discourage this strategy because it gets messy real quick.
What I recommend is creating a centralized
Config class and a centralized procedure for loading the config from the environment. This can manifest itself in a lot of ways depending on what you're working on, since you have to account for the opinions of your existing framework as well as your own sensibilities at any given time. I have not settled on a preferred approach for this just yet! But I can sketch out one way this can work.
I like to use a library in Python called attrs to create what you can call "data classes". These are loosely related to algebraic data types from functional programming, and like them they're light on behavior and are usually treated as immutable. Many other languages have answers to this, and some have more than one. For example, the Python standard library now has a module called
dataclasses that's a lot like attrs; scala has case classes, and C# now has record types. Anything along these lines will work well.
Classes created with attrs have default constructors that you don't want to fiddle with, and it's useful to separate the construction of the class instance from the loading of environment variables. The way to do this is to use a factory, potentially in a class method. For example, you may call
config = Config.from_environment() in your code. This structure is worthwhile even if you're not using an attrs-like class.
If you're familiar with domain driven design - abbreviated to DDD - then you'll probably be comfortable with this strategy. If you're not familiar with DDD but this reminds you a little of the models in your MVC apps, that's not a coincidence - MVC apps implement a lot of the general principles from DDD, though usually using the active record pattern instead of the repository pattern. The flippant summary for this crowd is that DDD means "have a modeling layer", and - basically - your config object is part of your domain model. As such, the general principals of DDD, which encourage nice tidy entity objects that delegate behavior to things like repositories and services, definitely apply.
I like to create the config once and pass it to anything that needs it using dependency injection, but you can also instantiate it and export it from your module as a singleton 1. This strategy works well with Flask, which is designed in such a way that it's tough not to do this. Note to self, write "why I hate Flask" someday.
from_environment class method, iterate over the environment, do any parsing you need to do for the values, and return a new
Config instance with everything set in there. You can brute force this, but with something like attrs you can reflect off the class definition by iterating over the
__attrs_attrs__ class property. This property is a list of the attributes you can expect, and from there you can get both the name of the attribute and any type information they're annotated with. You can also set custom properties, if you need them.
When loading environment variables, you want to treat variables of the same end types consistently. For example, if you accept
false as values for one boolean, accepting
0 for others would be less than ideal. With that in mind: write a single implementation and reuse it for everything of the same type. You can do this either by looking up the type information for the attribute and using that as a key for looking up a conversion function, or by using attr's converters functionality. Both strategies work - the former might be "cleaner" to some eyes and decouples it from class construction, but the latter is less effort and makes it easier to introduce minor customizations between variables (such as different defaults for blank/unset booleans) if need be.
When it comes to container types, environment variables struggle to do anything more complicated than lists. For lists, my recommendation is to split the string on a character.
: is a good choice, since it matches how POSIX does it with the
PATH variable, but
, is OK too. Beyond that, reach for a file if need be.
An interesting technique for managing namespaces is to denormalize them into the name of the environment variables. For example, if you wanted a data structure that looked like this:
dict( A=dict( U="foo" ), B=dict( V="bar" ) )
you can name the corresponding environment variables
B_V. You can build up fairly complicated structures this way! The downside is that the character you pick can't be used in naming things - this is particularly unfortunate for
_, since I like it to represent spaces between words. In practice, I tend to be less clever here and let things get a lil' leaky. Your mileage may vary.
At this stage, my brain dump is complete and I'm not sure I have anything else important to say here. If anything comes to mind I'll update this post accordingly. Otherwise I GOTTA GET BACK TO WORK! I hope this was helpful.
Note that when I say "singleton" I'm not talking about playing Gang of Four straight. In this case, the module system ensures that a given module only gets loaded once, so you don't need to guard against double-instantiation and in fact doing so is a bad idea. ↩