Fran Iglesias

Posted on Sep 9, 2018 • Originally published at franiglesias.github.io on Aug 19, 2018

Testing in context

#testing

Software quality assurance is a very wide discipline and, till now, I've been blogging about unit testing, which happens to be one of the levels or kinds we could use.

So, in this context, I'd like to take a more general look, to put things in context.

Tests tanonomy

Software tests can be grouped in two big global categories:

Funcional tests
Non functional tests

Functional tests

Functional tests refer to tests that prove that software do whatever it was designed for. That is, for instance, an application to manage second hand product selling allows its users to buy and sell second hand items and all the related tasks, such us register products with their prices, descriptions and photos, contact with vendors and buyers, manage payments and all the rest.

This can be proved in different levels:

Unitary tests: trying the basic units in which code is organized working isolated.
Integration tests: trying sets of related units, to prove that their relations are working as intended.
Acceptance tests: trying entry and output points of the software to prove that their behavior is the one defined by stakeholders (the people interested in use it).

This three levels together conform the so called "test pyramid" that we'll comment
later.

We also consider the following amongst the functional tests:

Regression tests: they are tests that can detect the consequences of changes in the code that lead to undesired behaviors. We could say that all tests become regression test once they pass, because they will fail if our code interventions change the expected behavior.
Characterization Tests: they are tests we write when software has no other tests and normally we create them running the software under certain conditions and observing the results. We use this kind of tests as a security net to analyze legacy code so we are able to make changes and better tests.

Test Driven Development can be practiced at the three levels. The basic idea is that we define first the tests and, then, we write the code to make the test pass. Once achieved, and after we remove redundancies, those tests become regression tests.

Non functional tests

Non functional tests refer to how the software works. Apart from what a functionality does, it is mandatory that software offers liability, performance, response capacity, low latency, etc., something that is transversal to all kinds of applications and programs and that can be measured in several ways. Those tests prove things like (among others):

Speed: It tells us if the software return results in the desired time.
Load: It tells us if the software can support certain workload, which could be measured in ways such us: simultaneous connections, data size that can be managed at the same time, etc.
Recovery: It tells us if a system is able to recovery the right way after a failure.
Fault tolerance: It tells us if a system react well if there are failures in a system in which it depends upon.

In some cases, (functional) acceptance tests can help us to roughly control some aspects that belongs to non functional tests. For instance, how our system reacts if it can not access to information in another systems.

Non functional tests aim to check that the system acts inside certain limits and that it reacts in a proper way in response to certain circumstances that happens out of its scope.

These tests follow the same pattern than functional tests:

We define an scenario or known initial system state (Given).
We execute some action on the system (When).
We observe the system response to see if it matches with expected (Then).

For example: we could decide that some particular page must be ready for the user to interact in less than a second at some specific network speed (or a table with several network conditions and desired response times). So:

We put the system in a known state assuming certain conditions.
We measure the time the page takes to be ready to accept an input.
We check if that time is lower than desired.

The (functional) tests pyramid

The functional tests pyramid is an heuristic to decide the amount of testing that we perform at each level.

The idea is:

Unit tests: they are in the basement of the pyramid and we should have lots of them.
Integration tests: they are in the middle of the pyramid and we should have less.
Acceptance tests: they are in the top of the pyramid, so we should have only a few.

Putting it in another way:

Given a feature of our software we should have:

A few acceptance tests, covering the scenarios defined by stakeholders.
A greater number of integration tests, ensuring that the systems components work interacting as intended and that they know how to react when other components fail.
A huge number of unit tests proving that software units do what they should do and that they are able to manage different input conditions, reacting in the right way in front of invalid inputs.

But…, why?

Lots of unit tests

A unit test, being focused in a single, isolated software unit, should be:

Easy to write: a software unit should have to manage relatively few use cases and react to problems in a simple way, like throwing exceptions.
Fast: in the case of having dependencies, they should be doubled (for test purposes) with simpler and less costly objects, allowing fast execution of tests.
Replicable: we can repeat the tests all the times we need, getting the same results if the software unit behavior has not changed.

This conditions allow us to do things such as:

Run the tests every time we introduce a change in the software, providing us with immediate feedback in the case of any behavior alteration.
Given that every test exercise one unique concrete piece of software in a specific scenario if they fail we get an immediate diagnostic of the problem and where it happened.

By this reasons, the logic thing is to have many unit tests that can be executed fast many times a day: when we want to fix a bug, when we introduce a change, when we commit, when we deploy, etc.

More is better? As in another life things, the more is not always the better, but we are interested in that all the tests we apply to a software unit cover all the relevant use cases, so when one of them fails we can know what exactly change we did caused the problem.

In unit tests, behaviors from other software units participating are simulated with test doubles, which limit themselves to return preprogrammed responses so the output of the software unit under test can only be attributed to its own behavior, keeping collaborator behavior under control.

Moderate quantity of integration tests

Integration tests exercise several software units related to a given process, isolated from the rest of the application.

In this case we don't simulate any behavior with doubles given that our goal is check that this units interact in the expected way. It is posible that we have to mock the behavior of external systems (for instance, an API that provide us with data). And, of course, we use data crafted for the test environment.

Use cases number grows geometrically with the quantity of units acting.

The problem is that besides the growing number of cases, integration tests are slower than unit tests, so we must take a different approach.

An integration test doesn't have to verify that each software unit performs its job, we just verified that with our unitary tests. It will prove the behavior of the software units working as a whole, specially cases where any of the units fails by this or that cause, so we can ensure that the others are able to manage the situation.

Only a few acceptance tests

Acceptance tests check the system from the users or stakeholders point of view, so, they exercise all the system components in action. In acceptance tests we don't simulate any of the system components, excepts when we need to communicate with external systems or when we need to mimic some infrastructure conditions, faking them.

Our acceptance tests should be run in a specific environment, identical to production.

In general, with acceptance tests we are interested in to prove some scenarios that are significant for stakeholders. For example:

An user wants to sign up for a service and she provides all needed data so she should receive a confirmation that she is effectively signed up to the service.
An user wants to sign up for a service but he doesn't provide needed data or they are invalida, so he should receive information showing what data to correct and that the process has been aborted.
An user wants to sign up for a service should receive adequate information in case of system failure that prevents the process to be finished.

Most of acceptance test could be performed with the following model in mind:

Correct Input from user + Correct System -> Correct output from the system and process effectively done
Invalid Input from user + Correct System -> Informative output from the system
Correct Input from user + Failing System -> Informative output from the system

Obviously, many processes have a diversity of scenarios to be taken into consideration that increase the needed tests. Nevertheless, their number will be lower that all use case and condition combinations covered by unit tests that exercise the same software units as a whole.

Acceptance test can be written with Gherkin language, a structured way to define features with scenarios using natural language. This way, stakeholders or product owner can contribute to define them teaming with developers. After that, they are translated to a programming language using tools like JBehave, Cucumber or Behat.

Usefulness of the test pyramid

First utility of the test pyramid is to help us to define how many tests we need in each of its levels. Anyway is only an heuristic because the proportions between the three levels are also important, but it is no easy to choose the right one.

Test pyramid give us three resolution levels when we want to analyze the behavior of our application.

Acceptance level allows us to observe the process of the application as a whole.
Integration level allows us to observe the processes in subsets of the software and fails in these tests point to bad communication between units.
Unit level allows us to observe software units and fails at this level allow us to diagnose our algorithms.

Ideally, with a good proportion of tests, we will find that a test that fails at the acceptance level is reflected in tests failing at the others.

Failing tests at the integration level, but not at the unit level, would be indicating that some software units are not properly communicating, for instance because they are sharing information with an invalid format.
Failing tests at the unit level will be reflected at integration and they are indicating that some software unit is working badly.

Sometime the absence of fails is more informative:

Failing acceptance test not reflected in the other layers tell us that we lack tests in those layers. Accordingly, we should write new tests to ensure we cover that functionality at the integration and unit level.
Failing unit tests not reflected in the other layers are telling us that we don't have tests at them. Failing unit tests should be reflected in integration and acceptance also failing.

The pyramid of tests, on the other hand, helps us to control that the execution of the tests is maintained at a level that makes it practical

Acceptance tests are very slow. If we have relatively few (provided they are sufficient, of course) we will get them to run in the shortest possible time and we will be able to launch them automatically before each deploy. If we need a huge number of acceptance test we should run them as a separated periodical task.
The integration tests are fairly fast, if we keep them at an appropriate level we could execute them automatically in each pull request and deploy.
The unit tests are very fast, so we could execute them in each commit.

For this structure to be truly effective, we would have to make sure that a failure at one level is reflected in the other two.

Obviously we can optimize the execution of tests using tools that help us to execute only those affected by the latest changes.

Smells in the pyramid of functional tests

Observing the proportion between the tests in the three levels we can diagnose if our battery of tests is well proportionated. If that is not the case, the goal should be to increase the number of tests at the levels that need it, as well as review those levels that could have redundant tests.

In general, an excessive number of tests of the level of acceptance with respect to the unit allows us to detect failures, but not to diagnose them accurately.

On the other hand, too many unit tests with few acceptance tests will probably overlook many errors that will be revealed in production.

Inverted pyramid

The inverted pyramid indicates that there are few unit tests and many acceptance tests.

Many acceptance tests could indicate that unnecessary scenarios are tested or that an attempt is made to verify the operation of specific units of the system from the outside trying to supply with acceptance tests the need for unit tests.

Few unit tests make it difficult or impossible to easily determine where the problems are when higher level tests fail.

Crushed pyramid

The crushed pyramid would indicate that there are too few acceptance tests with respect to unit tests. If we suppose a situation in which the coverage of unit tests is adequate, what this * smell * is telling us is that we have to carry out more tests of integration and acceptance.

In this situation the tests do not tell us much about how the application behaves as a whole and we are probably relying too much on manual tests, less confiable and less accurate. Consequently, we will not be able to identify problematic cases that will be related to poor communication between the various software units.

Shape of diabolo

It would indicate that we have few integration tests and that it is the acceptance tests that are doing their job. We would have to analyze the acceptance tests and move tests at the integration level.

In case of failure in the level of acceptance we would find that if unit tests do not fail we can not know what interaction of our units is malfunctioning.

Diamond shape

The diamond shape tells us that the integration tests are doing the work of the unit tests. A failure at this level does not clarify if it is due to an integration problem or a problem in a particular software unit.

The solution is to create more unit tests that help us to discriminate better.

Square shape

If instead of a pyramid we have a shape similar to a square or rectangle, we have a similar number of tests in each level. This indicates that either we have few unit tests or we have too many integration and acceptance tests that are probably doing lower level work.

DEV Community