Dimitri Merejkowsky

Posted on Sep 9, 2017 • Updated on Oct 15, 2017 • Originally published at dmerej.info

Parsing Config Files The Right Way

#python

First published on my blog

Parsing configuration files is something we programmers do everyday.
But are you sure you're doing it the proper way?

Let's find out!

In the rest of this article, we'll assume we want to parse a configuration file containing a github access token in a command line program called frob.

In Javascript

You may write something like this:

/* in config.json */
{
  "auth":
  {
    "github":
    {
      "token": "ab642ef9zf"
    }
  }
}

/* in frob.js */
const config = require('./config');
const token = config.auth.github.token;
...

Well, that's assuming we are using node. Making this work in a browser or in any other Javascript context is left as an exercise to the reader :)

There are several issues with the above approach, though. To explain them, we are going to switch to a language I know a lot better and see a list of problems and potential solutions.

Syntax

First, using JSON for configuration files may not be such a good idea. So we're going to use YAML instead. Here are a few reasons why:

Like JSON, we can map directly to "plain old data" Python types (lists, dictionaries, integers, floats, strings and booleans)
Syntax is well-defined and all implementations behave the same. (It's not the case for JSON, see Parsing JSON is a Minefield for the details)
We can have comments in the configuration file.
File is easier to read for humans. Compare:

{
  "auth":
  {
    "github":
    {
      "token": "ab642ef9zf"
    }
  }
}

auth:
  github:
    token:  "ab642ef9zf"

Elements can be arbitrary nested. (.ini files only have one level of "sections", and .toml only two)
There are several ways to express the same data, so we can choose whatever is the more readable:

shopping_list:
 - eggs
 - bacon
 - tomatoes
 - beans

tags: ["python", "testing"]

Whitespace is significant, so the file has to be properly indented.

Location

Second, the config.json file is hard-coded to be located right next to the source code.

This means it's possible it will get added and pushed into a version control system if we are not careful.

So instead we'll try to be compatible with freedesktop standards.

Basically this means we should:

Look for config file in $XDG_CONFIG_HOME/frob.yml if XDG_CONFIG_HOME environment variable is set.
If not, look for it in ~/.config/frob.yml
And if not found in the home, look for the default in /etc/xdg/frob.yml

Doing so will help us follow the principle of least astonishment because, since many programs follow those rules today, users of our implementation will expect us to do the same.

Fortunately, we don't have to implement all of this, we can use the pyxdg library:

import xdg.BaseDirectory

cfg_path = xdg.BaseDirectory.load_first_config("frob.yml")
if cfg_path:
   ...

Error handling

Sometimes the file won't exist at all, so we'll want to inform our user about that:

cfg_path = xdg.BaseDirectory.load_first_config("frob.yml")

if not cfg_path:
    raise InvalidConfig("frob.yml not found")

Sometimes the file will exist but read_text() will fail for some reason (like a permission issue):

import pathlib

try:
   config_file = pathlib.Path(cfg_path)
   contents = config_file.read_text()
except OSError as read_error:
    raise InvalidConfig(f"Could not read file {cfg_path}: {read_error}")

Sometimes the file will exist but will contain invalid YAML:

import ruamel.yaml

contents = config_file.read_text()
try:
    parsed = ruamel.yaml.safe_load(contents)
except ruamel.yaml.error.YAMLError as yaml_error:
    details = format_error(yaml_error.context_mark.line, yaml_error.context_mark.column)
    message = f"{cfg_path}: YAML error: {details}"
    raise InvalidConfig(message)

Schema

That's where things get tricky. What if the file exists, is readable, contains valid YAML code but the user made a typo when writing it?

Here's a few cases we should handle:

# empty config: no error

# `auth` section is here but does not contain
# a `github` entry: no error
auth:
  gitlab:
    ...

# `auth.github` section is here but does not
# contain `token`, this is an error:
auth:
  github:
    tken: "ab642ef9zf"

A naive way to handle this would be to write code like this:

parsed = ruamel.yaml.safe_load(contents)
auth = parsed.get("auth")
if auth:
    github = auth.get("github")
    token = github.get("token")
    if not token:
        raise InvalidConfig("Expecting a key named 'token' in the
                            'github' section of 'auth' config")

This gets tedious very quickly. A better way is to use the schema library:

import schema
auth_schema = schema.Schema(
  {
    schema.Optional("auth"):
    {
      schema.Optional("github") :
      {
        "token": str,
      }
    }
  }
)

try:
    auth_schema.validate(parsed)
except schema.SchemaError as schema_error:
    raise InvalidConfig(file_path, schema_error)

Saving

Last but not least, sometimes we'll want to automatically save the configuration file.

In that case, it's important that the saved configuration file still resembles the original one.

With ruamel.yaml, this is done by using a RoundtripLoader

def save_token(token):
    contents = config_file.read_text()
    config = ruamel.yaml.load(contents, ruamel.yaml.RoundTripLoader)
    config["auth"]["github"]["token"] = token
    dumped = ruamel.yaml.dump(config, Dumper=ruamel.yaml.RoundTripDumper)
    config_file.write_text(dumped)

Conclusion

Phew! That was a lot of work for a seemingly easy task. But I do believe it's worth going through all this trouble: we covered a lot of edge cases and made sure we had always very clear error messages raised. Users of code written like this will be very grateful when things go south. Cheers!

Thanks for reading this far :)

I'd love to hear what you have to say, so please feel free to leave a comment below, or read the feedback page for more ways to get in touch with me.

Top comments (9)

Massimo Artizzu • Sep 10 '17 • Edited

File is easier to read for humans. Compare: follows overly spaced JSON data

Well that's just your preference of writing a JSON file. I'd have written something like this:

{
  "auth": {
    "github": {
      "token": "ab642ef9zf"
    }
  }
}

More verbose, yes, but not that much. Node's console.log would have inlined it all.

Syntax is well-defined and all implementations behave the same

?!
Are you sure about that? That's quite a statement since YAML is a superset if JSON (really!), so any problem with parsing JSON is transmitted to YAML. Plus you have YAML's own syntax.

Moreover, YAML is extensible, meaning that it can be impossible to parse a YAML file that's been extended for another platform.

There's plenty of reasons why one could prefer YAML over JSON, and you explained quite some, but IMO these two aren't ones.

Whitespace is significant, so the file has to be properly indented.

While this sounds nice for readability, this also means that you need a validator to be fairly sure that your config file is ok, because if you mess up with the indentation the file is still considered valid YAML. If you miss a parenthesis in JSON or a closing tag in XML it wouldn't parse.

If you don't want any, the rule of the thumb is to avoid deeply nested YAML documents, which means this perk loses most of its meaning:

Elements can be arbitrary nested

Dimitri Merejkowsky • Sep 10 '17

Thanks for the feedback! Few remarks.

About JSON not being well-defined

See seriot.ch/parsing_json.php. True, most of the time you won't have any problem using different implementations of JSON parsers, but the devil is in the details. (Things like dates, text encoding, trailing comas or not-a-number floats).

YAML is a superset of JSON but its specification is more precise.

you need a validator to be fairly sure that your config file is ok,

Not sure what you mean by that. Personally, whenever I'm editing json, xml or yaml files, I have a linter that tells me if the syntax is OK.

If you mess up with the indentation the file is still considered valid YAML.

True. That's an argument that always comes back when you talk about whitespace significance. Python has the same problem, but personally I don't care that much. Fortunately, if this is an issue for you there are lots of alternatives.

avoid deeply nested (YAML) documents,

This is good advice and it applies to any configuration file ;)

Massimo Artizzu • Sep 11 '17 • Edited

YAML is a superset of JSON but its specification is more precise.

I'm not sure I'm following you here. The format specification of JSON is flawless (as far as we know); the parser implementation specification, on the other hand, is left with more freedom as it's intended as an interoperable format and thus the result depends on the language that has to deal with it.

Now, I doubt YAML is re-defining JSON format spec, but maybe you're talking about the implentation?

Anyway, the page you posted, although it provides a lot of useful tests, it's all about edge cases. Now, it's probably very odd if you have an edge case in a configuration file.

Not sure what you mean by that. Personally, whenever I'm editing json, xml or yaml files, I have a linter that tells me if the syntax is OK.

I have to clarify indeed: I meant that you need a schema to validate your YAML. As an example, consider the following:

{
  "brands": {
    "BMW": [ "Z4" ],
    "Chevrolet": [ "Matiz" ],
  "Ferrari": [ "458" ]
  }
}

If I mess up the indentation, I still have a good JSON. If I do the same with YAML it's quite different:

brands:
  BMW: [ "Z4" ]
  Chevrolet: [ "Matiz" ]
Ferrari: [ "458" ]

A linter wouldn't catch it. That's why I suggest to keep config files as flat as possible :/

Dimitri Merejkowsky • Sep 11 '17 • Edited

you're talking about the implementation?

Oh yes. Sorry.

If I mess up the indentation, I still have a good JSON. If I do the same with YAML it's quite different

Right. I see what you mean. Again, nothing new under the sun. You have exactly the same problem in Python.

Trivia: in a big Python project I was working on for quite a long time I only had a few bugs caused by incorrect indentation, but I can see why it's a big deal for lots of people :)

And the solution is the same: keep your code "flat" by using nice little helper functions.

Adrian B.G. • Sep 9 '17

My suggestion is do NOT store tokens, auth, sensitive info in config files, even if they are in .gitignore. Switch to env variables, it's better from many perspectives.

Keep in your config files the things that do not change between your environments (staging, production, dev). See the 12factor for some Pro reasons and here are some against reasons

I think JSON is better for JS projects, the main reason: is simpler. You don't need a parser and simple is better.

it maps directly in your code
it's javascript
your team does not need to learn a new config schema
it's already widely used, ex meteor

Dimitri Merejkowsky • Sep 10 '17

All good advice, and the link about pros and cons about environment variables is very interesting.

But note that the context is a bit different: the 12 factor app is a software as a service, and my article is about a command-line tool.

Also, you're right, if you're using Javascript already, having config files in json (or even in javascript code!) is certainly a good idea.

Adrian B.G. • Sep 11 '17

Sorry, my bad. I thought the app is nodejs web based.

When you start using hosting services, and your team is getting bigger, project grow, I always hit the config files problem.

You can also fix some of the "cons" of ENV vars by using a stub file with all the ENV vars, that can also be used as local DEV env. Some hostings also allows keeping ENV vars in files, but beats the purpose.