Introduction to the Fail fast! Principle in Software Development

Christian Neumanns

2016-10-25

Abstract

This article introduces the Fail Fast! principle. What is it? When should we use it? How does it help us to write better code?


Whenever an error occurs in a running software application there are typically three possible error-handling approaches:

  1. The Ignore! approach: the error is ignored and the application continues execution

  2. The Fail fast! approach: the application stops immediately and reports an error

  3. The Fail safe! approach: the application acknowledges the error and continues execution in the best possible way

Which approach is the best one?

Which one should you apply in your application?

Before answering this vital question, let us first look at a simple example.

Suppose we have to write a (rudimentary) web application that displays a warning message near a water fountain to warn people that the water is polluted.

The following HTML code does the job:

<html>
   <body>
      <h2 style="color:red;">Important!</h2>
      <p>Please <b>DO NOT</b> drink this water!</p>
   </body>
</html>

The result displayed in the browser looks like this:

Fail fast principle: no problem

Now let us insert a small bug. Instead of </b> we write <b> after DO NOT, as shown below:

<p>Please <b>DO NOT<b> drink this water!</p>

Two interesting questions arise:

The second question is easy to answer. We just have to feed our browser with the buggy HTML code. This is the result - as displayed in Chrome, Edge, Firefox, Internet Explorer, and Safari at the time of writing:

Fail fast principle: no panic

Before reading on, ask yourself: "Which approach has been applied by the browsers?" ...

Obviously, the Fail fast! approach has not been applied because the application continued and did not report an error. The only difference to note is that more text is now displayed in bold. But the message as a whole is still displayed correctly and people are warned. Nothing to worry too much!

Let’s try another bug. Instead of <b> we write <b before DO NOT, as shown below:

<p>Please <b DO NOT</b> drink this water!</p>

This is the result - again as displayed in the browsers mentioned before:

Fail fast principle: panic

Panic! Now the program does exactly the opposite of what it is supposed to do. The consequences are terrible. Our life-saving application has mutated into a killer-application (but not the kind of killer-application we all dream to write one day).

It is important to be aware of the fact that the above example is not just a theoretical, exaggerated example. There are a good number of real-life cases with ‘little bugs’ having catastrophic consequences, such as the Mariner 1 spacecraft that exploded shortly after lift-off due to a ‘missing hyphen’. For more examples, see: List of software bugs.

As we can see from the above example, the consequences of not applying the Fail fast! approach vary largely and can range from completely harmless to extremely harmful.

[Note]Note
Unless we look at the browsers' source code, we don't know whether the Ignore! or Fail safe! approach is applied. My guess is that the tokens DO and NOT in the HTML code are interpreted as attributes of tag b, but without values (such as DO="foo"), and without being part of standard HTML. Therefore they are ignored. The closing </b> is probably also simply ignored, because "drink this water" is displayed in bold.

So, what is the correct answer to the important question "What should happen?"

Well, it depends on the situation. There are, however, some general rules.


The first rule is:

This rule is well known and doesn't need any further explanation.

Remember rule 6 of The 10 commandments for C programmers, eloquently written in old English by Harry Spencer:


The second rule is:

The rationale behind this rule is easy to understand:

Failing fast is commonly considered as a good practice in software development. Here are a few supporting quotes:


However, the situation can change radically when the application runs under production mode. Unfortunately, there is no one-size-fits-all rule. Practice shows that it is generally better to also apply the Fail fast! approach by default. The final damage resulting from an application that ignores an error and just continues arbitrarily is generally worse than the damage provoked by an application that stops suddenly. For example, if an accounting application stops working suddenly, the user is angry. But if it silently ignores an error and continues and produces wrong results (such as an unbalanced balance sheet), the user is very angry. ‘Angry’ is better than ‘very angry’. Therefore, in this case the Fail fast! approach is better.

In our previous HTML example, the Fail fast! approach would also be much better. Suppose that, instead of continuing execution, the browsers displayed an error message. Then the developer(s) would immediately get aware of the problem and the code could be fixed quickly and easily, without causing any harm. But even if the buggy code went into production (for strange reasons), then the worst case scenario would be less terrible. Displaying "Please drink this water" can be dreadful. On the other hand, not displaying any message, or just displaying an (incomprehensible) error message, would probably just result in a very low percentage of people daring to taste a small quantity of water.

In practice, each case must sometimes be studied individually and carefully. This is especially true if the greatest possible damage is high, such as in medical applications, money transfer applications or space invader applications. For example, applying the Fail fast! rule is obviously the right approach as long as a rocket to Mars didn’t take off. But as soon as the rocket has started, stopping the application (or, even worse, ignoring an error) is no longer an option. Now the Fail safe! approach must be applied in order to do the best we can do.

A good option is sometimes to fail fast, but minimize the damage. For example, if a run-time-error occurs in a text editor application, the application should first automatically save the current text in a temporary file, then display a meaningful message to the user ("Sorry, ... but your current text is saved in a temporary file abc.tmp"), optionally send an error report to the developers, and then stop.


Hence, the third rule:


To summarize:

The same idea is expressed by the excellent Rule of Repair in The Art of Unix Programming, written by Eric Steven Raymond:


[Note]Note
Further information and examples are available in Wikipedia under Fail fast, Fail safe, and Fault tolerant computer system.

In any case, it is always helpful to use a development environment that supports the Fail fast! principle. For example, a compiled language supports the Fail fast! rule because compilers can immediately report a whole plethora of bugs. Here is an example of a stupid bug that easily escapes the human eye and can lead to 'unwanted surprises', such as a hanging system due to an infinite loop:

var row_index = 1
...
row_indx = row_index + 1

Typos like this (i.e. writing row_indx instead of row_index) are common and are immediately caught by any decent compiler or (even better) by an intelligent IDE.


Luckily there are a good number of very effective Fail fast! features that can be natively built into a programming language. They all rely on the following rule:

Examples of powerful Fail fast! language features are: static and semantic typing, compile-time null-safety (no null pointer errors at run-time!), design by contract, generic type parameters, integrated unit testing, etc.

Even better than detecting errors early is to not allow them by design. This can be achieved if the programming language doesn't support error-prone programming techniques such as global mutable data, implicit type conversions, silently ignored arithmetic overflow errors, thruthiness (e.g. "", 0 and null are equal to false) etc.


Therefore, we should always prefer a programming environment ( = programming language + libraries + frameworks + tools) that supports the Fail fast! principle. We will debug less and we will produce more reliable and safe code in less time.