The “let it crash” error handling strategy of Erlang, by Joe Armstrong

#erlang

This post was motivated by this tweet:

Buy Luciano Ramalho's Fluent Python: Clear, Concise, and Effective Programming 2nd Edition.

Click here to read the following text by the late Joe Armstrong in its original context.

In 5 Mar 2003, Luke Gorrie wrote:

... cut ...

Hope that clears things up for someone else who learned the other
definition than the Erlang guys :-)

Yes :-)

I have two "thumb rules"

Check inputs where they are "untrusted"

at a human interface
a foreign language program

Or when you want a better error diagnostic than the default one - in this case just exit with the better diagnostic.

For example if I'm parsing an integer I'd write

I = list_to_integer(L)

    case (catch list_to_integer(L)) of
        {'EXIT', _} ->
        exit(["Most honored user I regrettably have to inform you
              that your input on line", Ln, "was not an integer
              in fact it was ",L, "which IMHO is wrong
              have a nice day 
              Mr. C. Computer"]);
        I -> I
    end.

The latter is "an industrial quality" error message :-)

Note (important) the semantics of both are to raise an exception in the
event of an error.

Aside: I once saw code like this:

x(a) -> 1;
x(b) -> 2;
x(X) -> 
    %% what do I do now
    io:format("expecting a or b").

The programmer had actually added a comment (What do I do now) - of course they had done the wrong thing.

The program:

x(a) -> 1;
x(b) -> 2.

Is correct.

Evaluating x(c) generates an exception as required.

In their modified program x(c) evaluates to the atom ok (i.e. the return
value of io:format) - which is incorrect.

If they had wanted a better diagnostic they should have written:

x(a) -> 1;
x(b) -> 2;
x(X) -> exit({x,expects,argument,'a or b'}).

If you do nothing to your code you get a good diagnostic anyway:

If x is in the module m and you call this in the shell
you'd get:

(catch m:x(c)).
{'EXIT',{function_clause,[{m,x,[c]},
                          {erl_eval,expr,3},
                          {erl_eval,exprs,4},
                          {shell,eval_loop,2}]}}

function_clause means you couldn't match a function head.

[{m,x,[c]}, ...

means you were calling function x with argument c.

So in this case NOT programming the error case results in

shorter code
clearer code
no chance of accidentally violating the spec by introducing ad hoc "out of spec" code to correct the error
perfectly acceptable error diagnostic

IMHO (3) is a big gain - specifications always say what to do if everything works - but never what to do if the input conditions are not met - the usual answer is something sensible - but what you're the programmer.

In C etc. you have to write something if you detect an error. In Erlang it's easy - don't even bother to write code that checks for errors - "just let it crash".

Then write a independent process that observes the crashes (a linked process) - the independent process should try to correct the error, if it can't correct the error it should crash (same principle)

each monitor should try a simpler error recovery strategy - until finally the error is fixed (this is the principle behind the error recovery tree behaviour).

Why was error handling designed like this?

Easy - to make fault-tolerant systems you need TWO processors. You can never ever make a fault tolerant system using just one processor - because if that processor crashes you are scomblonked.

One physical processor does the job - another separated physical processor watches the first processor, fixes errors if the first processor crashes - this is the simplest possible way of making a fault-tolerant system.

This principle is mirrored exactly in the Erlang process structure - this is because we want to have "location transparency" of processes - in other words at a certain level of abstraction we do not wish to know which physical processor an individual Erlang process runs on.

This is the fundamental reason why we use "remote error recovery" (i.e. handling the error in a different process, to the process in which the error occurred) - it turns out that this has beneficial implications for the design of a system; mainly because there is a clean separation between doing a job, observing if the job was done
and fixing an error if an error has occurred.

This organization corresponds nicely to a idealized human organization of bosses and workers - bosses say what is to be done, workers do stuff. Bosses do quality control and check that things get done, if not they fire people, re-organize and tell other people to do the stuff. If they fail (the bosses) they get sacked etc.

Note: I said, idealized organization, usually if projects fail the bosses get promoted and given more workers for their next project.

Joe Armstrong