"The terms 'unit test' and 'integration test' have always been rather murky, even by the slippery standards of most software terminology."
-- Martin Fowler"...there is no such thing as a Unit Test"
-- Michael Belivanakis
What is a Unit Test?
I typed the above question into a popular search engine, and the first three results I got back were as follows (emphasis mine)
A unit test is a block of code that verifies the accuracy of a smaller, isolated block of application code, typically a function or method. -- aws.amazon.com
A unit test is a way of testing a unit - the smallest piece of code that can be logically isolated in a system. In most programming languages, that is a function, a subroutine, a method or property. -- smartbear.com
Unit is defined as a single behaviour exhibited by the system under test (SUT), usually corresponding to a requirement. While it may imply that it is a function or a module (in procedural programming) or a method or a class (in object-oriented programming) it does not mean functions/methods, modules or classes always correspond to units. From the system-requirements perspective only the perimeter of the system is relevant, thus only entry points to externally-visible system behaviours define units. -- Kent Beck via Wikipedia
A problem I often see with modern software testing is that we lean toward the second definition too often: a unit is "the smallest piece of code that can be logically isolated in a system". The word "can" is carrying a lot of weight in this sentence. We can logically isolate just about anything.
"Is a unit a file?"
-- "Definitely not", I can hear you say.
"Assuming an object-oriented program, how about a class?"
-- "Probably not", you say with slightly less conviction.
"How about a method?"
-- "Probably", with a bit more confidence, agreeing with the first two results above.
"What if that method is 300 lines of code?"
-- "Ooh, yeah, you should probably break that out into smaller methods."
Suppose we do this. Let's break our 300-line method into, say, 10 methods of 30 lines each, as some CS professors seem to teach their students that this is a good rule of thumb for function length.
// before
def original(x: String, y: Int): Boolean = {
// ...
// hundreds of lines of code
// ...
}
// after
def improved(x: String, y: Int): Boolean = {
val intermediateValueA = a(x)
val intermediateValueB = b(intermediateValueA, y)
val intermediateValueC = c(intermediateValueB)
// ...
val intermediateValueH = h(intermediateValueG)
val intermediateValueI = i(intermediateValueH)
j(intermediateValueI)
}
private def a(x: String): Long = {
// ...
}
private def b(z: Long, y: Int): Double = {
// ...
}
// eight more private functions...
All of these methods could be private
(or whatever your language's equivalent of that is). In that case, they can only be accessed by the class which contains them. They used to be all in one method anyway, so we can be sure nobody else is using this logic anywhere else.
But now we face another decision: should we write unit tests for all of these individual methods? For many of us, our gut reaction will be "yes". This could make refactoring more difficult in the future, though, because "the closer your tests are to the implementation the more sensitive they are to changes".
Anecdotally, I've worked on codebases with hundreds of tests like this, all tightly coupled to the production implementation. Adding one field to a class meant updating a hundred or more tests which didn't care about this field at all, but needed the new field in order to compile. The test changes regularly took longer to implement than the production changes.
Writing unit tests for these smaller methods might also require us to make them more public
than they need to be; the world outside this class doesn't care about these individual methods, all it cares about is the one improved
method which now ties them all together.
The real question is, are these functions "units" of code?
The answer is no.
As Kent Beck would say, these are not "entry points to externally-visible system behaviours".
The only externally-visible entry point in the refactored example above is the improved
function, just as the original
function was initially. But these pervasive ideas that...
- large functions should be broken up into smaller ones, and
- a "unit test" is a "test of a single function"
...combine to produce a result that is much worse than the sum of its parts: huge suites of tests tightly-coupled to unnecessarily-public production code that take too long to write and are difficult to maintain.
Outcomes like this lead many developers to believe things like...
"Most Unit Testing is Waste"
-- James O. Coplien
Kinds of Tests
Rather than thinking of tests along the traditional unit / integration / end-to-end spectrum, I think it's helpful to think along a few other dimensions
- is this test fast or slow?
- is this a black-box test or a white-box test?
- is this test informed by development or does it inform development?
Fast and Slow Tests
Let me start by asserting that "fast" is not synonymous with "good", and "slow" is not synonymous with "bad" in this context.
Fast tests are tests that run in a few seconds, milliseconds, microseconds, or less. Fast tests, therefore, must be entirely in-memory. They do no disk IO and they make no network calls. They can be run every single time a code change is made without being a roadblock to development speed, and should therefore be run as part of the developer's inner loop. Every time you compile, you can run these tests.
Slow tests take several seconds, minutes, or hours to run. The dividing line between fast and slow tests is somewhere around 2-5 seconds. Slow tests may require reading large input files from disk, doing lots of computation, or communicating across the network. That is: they are IO, CPU, or network bound. Contract tests (which often spin up Docker containers) and performance tests (which may run gigabytes of data or thousands of requests through the system) are examples of slow tests. These tests should be run less regularly, as they can impact development speed: before each commit to main
/ master
is probably fine for tests shorter than a few minutes, daily or weekly might be a good cadence for tests much longer than that.
Black-Box and White-Box Tests
Black-box tests make no assumptions about the internals of the thing they are testing. They provide inputs and assert on observable outputs, and that's it. The observable output is usually a return value from a method, but a black-box test might instead assert that a side effect has occurred, like that a log line has been written, or that a metric has been recorded, or that some state has been mutated.
White-box tests specifically test the internals of the thing they are testing. They are introspective. Tests which have assertions like "when 'x' happens, function a() should call function b()" are white-box tests. They are explicitly testing how something should happen (how some code is implemented), rather than testing only that it has happened. Tests which rely heavily on mocking frameworks are often white-box tests, asserting that such-and-such a method has (or hasn't) been called in response to some inputs.
If you don't care about how something is implemented -- just that that it does what it's supposed to do -- you should write a black-box test. This is usually the case, so opt for black-box tests as a default.
Development-Informed Tests and Development-Informing Tests
Development-informed tests are written reactively, in that the production code is written first, and the tests are written afterward. Development-informed tests codify the behaviour of the system as-is. Traditional "unit tests" are almost exclusively development-informed tests.
Development-informing tests are written proactively, in that a test is written first, and the production code is written after. Test-Driven Development (TDD) is a software development methodology which encourages writing only development-informing tests, ensuring that 100% of the system's behaviour is always codified in tests.
Development-informing tests can also provide confidence that some tricky piece of logic has been implemented correctly. For example, you might write a regex to parse U.S. phone numbers, and -- at the same time -- add a handful of tests to ensure that you catch things like
- area codes surrounded by parentheses
- spaces vs. no spaces vs. hyphens
- the presence or absence of a
+1
country code
It can be hard to be sure -- just by staring at the regular expression -- that it catches all of these cases. Usually it's more convincing to just write a handful of simple tests to convince yourself that the most common edge cases are being handled correctly.
I always write bug fix tests in a development-informing way, as well. First, I write a test which should pass, but which I expect to fail due to the presence of a bug. Then, I fix the bug in the production code, ensuring that the test now passes. This process shows that -- had the test existed originally -- it would have caught the bug. This gives confidence that the bug should not reappear in the future.
"Most Unit Testing is Waste"
The three ways of looking at tests outlined above can provide insight into why developers like James O. Coplien believe that most unit testing is a waste of time.
Most Unit Tests are Development-Informed
In my experience, TDD is not practiced by most developers.
Most tests, therefore, are development-informed. A developer writes some production code and then writes a test, usually to ensure that some code coverage minimum is reached.
These tests are not written to catch bugs, and they are not written to help a developer think through some difficult implementation, and so their value is not immediately apparent.
Most Unit Tests do not test "Externally-Visible System Behaviours"
As mentioned earlier, the twin practices of (1) breaking large functions up into smaller ones and (2) writing tests for each function rather than for each externally-visible system behaviour leads to a proliferation of tests tightly-coupled to the production implementation. These tests are, by their nature, fragile. They must be updated whenever the smallest implementation detail is changed, even if the externally-visible system behaviour is identical.
This often happens when using mocking frameworks, since every method called on a mocked object must be declared, with its return value specified.
In the worst-case scenario, developers will sometimes copy-and-paste the production implementation directly into the test, asserting that the "expected" result from the test implementation equals the "actual" result from the production code. This kind of white-box test unquestionably adds no value, even if it does increase "code coverage".
A New Test Pyramid
The traditional test pyramid aims to emphasise to developers that they should be mostly writing "unit tests", with fewer "integration tests" and only a handful of "end-to-end tests". Although different formulations of the pyramid may use different terms for the latter two levels, almost all agree that the base of the pyramid should be composed of "unit tests". Google recommends a 70% / 20% / 10% split of unit / integration / end-to-end tests.
The idea is that you should cover most of the code's logic in small, fast tests which can be run over and over during the inner loop of development. Your integration tests should cover interactions between units; and your end-to-end tests should validate that an end user's actions result in some expected overall outcome.
That advice is fine, provided all developers agree what constitutes a "unit test" or an "integration test". Clearly this is not the case. (See the search engine results at the top of this blog post.) However, we can use the objective criteria above (fast vs. slow, black-box vs. white-box, development-informed vs. development-informing) to construct a New Test Pyramid.
The Base
Opt for black-box tests wherever possible. Where external dependencies are required, prefer fake implementations rather than mocks (and add corresponding contract tests to ensure that external dependency behaves as you think it does). This keeps the entire test in-memory, making it fast enough to run before each commit to main
/ master
. You will find that most of your tests are these fast, black-box tests.
Note that this is not synonymous with "unit test". As discussed above, traditional "unit tests" are usually fast, but are sometimes white-box tests, and are often development-informed.
The Middle
Prefer development-informing tests over development-informed tests (prefer a TDD style of development). Development-informed tests are often written by rote and offer little value.
Prefer slow black-box tests over fast white-box tests. The former are easier to maintain as they are less tightly-coupled to the production implementation.
Traditional "integration tests" and "end-to-end" tests both fall into the "slow, black-box tests" category.
The Top
Write as few white-box tests as possible. That is, do as little code introspection as possible. Only test observable outputs.
Write development-informed tests only when necessary. If the production implementation works, it works. If it doesn't, you will find a bug, write a development-informing test, and fix the bug. This process is described above.
Conclusion
The traditional unit / integration / end-to-end categorization of tests is fuzzy at best. Differing interpretations of what constitutes a "unit test", combined with well-intentioned but misapplied advice on keeping code readable by reducing the number of lines per function, class, etc. has led to a proliferation of hard-to-maintain, low-value test suites that negatively impact developer productivity.
Categorizing tests objectively, using the three criteria described above, can lead to more maintainable tests which provide more value.
Top comments (33)
This is an interesting article, that covers a lot of ground -- which is great for people who are new to the pros and cons of unit (and other) testing.
But it also buys into several myths that can blind us to the value of testing.
MYTH #1. Tests should only call publicly available methods (functions, whatever). A la Kent Beck's "only entry points to externally-visible system behaviours define units".
Balderdash! The whole private/public/visible debate is arguing about how many angels dance on the head of a pin! The meaning of private or public depends ENTIRELY on the context. Just because some ancient languages (e.g. Java, which I love) have crude granularity for privacy levels, doesn't mean that's how reality works!
I used to build robots to inspect jet aircraft engines. We don't test aircraft engines by flying the entire plane! And yet the engines should certainly not be "public" to, say, the passengers! We remove the engine and put it in a special test-bed. The engine is "private" to the plane, but "public" to the test-bed.
E.g. Michael Feathers noted that in, in the public-vs-private debate, often a particular set of "private" methods (say that we really, really, want to be able to test) is really a class looking to be extracted into a library of 'public' methods.
We build software in layers... like onions (tears included). The innermost layers need be visible to the next layer out, and so on and so forth. Someday our language implementors will realize this, and we'll have more things like C++'s "friend" classes.
"We build software in layers... like onions (tears included)."
I think that's a reasonably good approach.
The product I work on is not written like that.
It's written like a big ball of mud. (The mud is probably an admixture that includes tears.) That is, in part, a byproduct of having a code base that is almost 4 decades old. And which (at least in great part) has evolved organically.
My programming language is C++.
It does not have unit tests as part of the core language. So there are many testing frameworks to choose from, like Google Test, Boost Test, Catch2, (my favorite) Doctest, and many others. I wish C++ had unit tests as part of the core language. Using a test framework causes lock-in, which is unfortunate.
My product does not use a testing framework. It uses a dozen testing frameworks. Each of those testing frameworks are not redundant. Each are used for a particular aspect of the product. Some of those testing frameworks are focused on testing the system, some are focused on testing functionality, and some are for testing APIs. A couple are used for performance testing in a very narrow domain. The ones that could be focused on unit testing have been co-opted to be used for integration testing.
I also wish that C++ had design by contract as part of the core language. If the C++ language had contracts, that would obviate the a good part of the need for unit tests. Not all unit tests, just the where unit tests are used to express contracts.
Alas and alack. C++ core language lack unit tests. And C++ core language lacks contracts.
(The context of Coplien's "most unit testing is waste" is about maintaining and running unit tests for code that is pretty much settled. It isn't against writing unit tests in the first place. TDD-style unit tests serve as a forcing function upon the programmer, which in turn makes their code abide by OO principles such as SOLID, or DRY/WET, or YAGNI, or GRASP, or KISS, et al. All of those OO principles were discovered in the OO era to shore up deficiencies in OO. Maybe if/when we move from OO era to an FP era or a DLS era, we can dispense with those principles. But we'll probably discover FP or DLS deficiencies, and discover new principles to shore up FP deficiencies, or DSL deficiencies. To be determined. There is no silver bullet.)
Eljay, could you say more about what you mean by "unit tests as part of the core language"?
I'm not aware of ANY language that has unit-tests as part of the "core". E.g. Java, where sometimes we have to fight with the poor granularity of public/private, in order to make useful unit-tests.
But I would sincerely love to hear more about what this means, or might mean, to you (and others).
The D programming language has unit testing as part of the core language.
It also has design by contract (precondition, postcondition, invariant) as part of the core language.
Fascinating. Thank you!
Great analogy! Thanks for your insight, Charles.
Vidmate APK allows you to download videos from various platforms effortlessly.
That's a great article.
By the way, when someone submits a regular expression in a PR, to avoid the "now you have two problems" issue, I have a rule that you must have a (fast blackbox) unit test for the regexp.
Excellent job on this article! Sets the record straight on doing a lot of irrelevant work in the name diligently "unit" testing all the things, and I love the premise that we should classify tests as merely white box or black box instead, and that definitely black box tests should be preferred. Good job. 🙂👍
There is only one thing I want to contest here.
This is often repeated by people who insist on TDD meaning "test first", and I don't buy it.
Writing the test first only "proves that the test would fail" in the same sense that running your program and looking at the output proves that the program works - you are essentially arguing in favor of "hands-on testing" this fact about your test suite.
Conducting a hands-on test by running your program only proves that this momentary version of the program worked - similarly, running your test (which is running a program) only proves that your test can fail for this exact momentary version of your program.
Just as running a program does not guarantee it will keep working, running your test and watching it fail does not guarantee that the test can still catch bugs in future versions of the program and/or test.
This is after all why we write automated tests in the first place.
The red/green approach can give you a sense of confidence, and probably increases the odds of writing a test that can catch bugs, but hands-on testing a test provides no guarantee of that. (If you want that guarantee, look to mutation testing.)
This feels hostile towards developers who don't practice TDD.
Personally, I never write tests to satisfy a code coverage metric - I only use code coverage to highlight potential areas for testing, and I will never write a low value test to increase a metric.
I reject the assumption that tests "aren't written to catch bugs" because someone typed in the code in a different order than you did. That's silly.
Tests can definitely help me think through difficult implementations - as I'm writing the code, I am always thinking about whether this code will be easy to test. As the code takes shape, I am always concerned with the structure and form as it relates to testing.
My added/updated code and tests are always in the same commit, as I'm sure are yours - my process is different from yours, but you can't look at my commits and see any difference.
I have tried the test-first approach, and it simply does not work for me. For one, I prefer strongly typed languages, and often, if I were to write a test first, it wouldn't even compile or run. Fighting an IDE the whole way because it can't complete code or statically check anything, is simply not a productive use of time for me.
The assumption you make about tests "not written to catch bugs" or "not written to help a developer think" because I chose to write the code first and test second, it sounds a lot like the kind of doctrine you're actually trying to dispel with this article.
I think you should reconsider that position. It's not true.
Writing tests first or code first, or writing bits of code and bits of tests, or changing or refining those bits as you progress -- whatever your approach -- it's simply a matter of difference of thought process. Our minds don't all work the same, and this kind of doctrine isn't helpful to those of us for whom this approach does not work.
I definitely encourage people to try the test-first approach. It works great for some people, and that's wonderful for them. But if something else works better for you, and if the outcome is quality code and high value tests, you do you. ✌️
That's fair! Thanks for sharing your viewpoint! I am constantly trying to remember that there is no "right" or "wrong" way of doing just about anything when coding. It's just that my opinions have been shaped by my experiences, good or bad, as have everyone else's. Thanks for reminding me of that :)
Whenever we talk about testing strategies we always talk too little about code that is being tested. Different code architecture and different software peaces need different approaches to testing. The part that some developers don't agree on what "unit" is thus isn't problem at all as long as we inform reader what "unit of code" means and looks like in our testing strategy that suggests certain % of unit tests.
If there is any single definition of a “unit” of code, it’s what Kent Beck described above. But, like you say, there are differently-sized units for different kinds of programs. But most discourse on the internet treats it as a standard chunk of code (usually a single function). I think this notion is harmful.
great...
MYTH #2. Tests should (or should not)... insert your opinion here.
Also balderdash! The primary purpose of tests is to help write (and maintain) WORKING CODE. Everything else is about context... or religion.
And that's why Reflection mechanism exists, for example PHP Reflection allows to "unlock" the method so it can be called even if private/protected.
So the method visibility is not a blocker for Unit Testing
But I think this tells you that you are testing at too fine-grained a level. If a method is invisible from “the outside”, it is not a “unit”, according to Beck’s definition. I think tests like this will make it harder for you to refactor that code.
i don't agree, the visibility of a method is a security mean not a level of granularity.
Assuming we are running unit tests on our own code : there could be places where methods are private, still they need to be tested.
Flagging this as a High quality article because it most certainly is!
In my casse I test php application that storing data correctly upon db is crucial. What I do is I mock and REST/XMLRPC API calls and any 3rd party service EXCEPT db.
For that upon each test I make a dedicated test db.
Hi @awwsmm ,
Thank you for your insightful article on the complexities of unit testing! Your perspective on the definitions and implications of unit tests versus integration tests is refreshing and thought-provoking. I particularly appreciated your emphasis on the importance of black-box testing and the distinction between development-informed and development-informing tests.
This has given me a lot to consider regarding my own testing practices and how to improve the maintainability and effectiveness of my test suites. I look forward to applying these concepts in my work!
MYTH #3. There's a right answer to all of these questions.
Wrong. EVERY engineering decision has pros and cons. If you can't articulate them, then you're not being honest. All of the issues described in the original article occur on a spectrum. For any particular context (size of project, size of team, expected lifetime of project, etc., etc.) there is a "sweet spot" on the spectrum.
E.g. To do TDD or "development-informed" ("DI")? It's not an either-or question! Which approach, in context, produces better code? I was the unit-test evangelist for a project with 1M LOC and 20 developers... and under my nudging/wrist-slapping/rewarding, we went from 40% coverage to 75% converage. We used TDD, DI, and what I call TND (Test Near Development: sometimes before, sometimes after, sometimes during)... and we had 20 THOUSAND tests.
They saved our bacon many, many, times.
I think this is the most important thing to impress on new developers: "it depends". I read somewhere recently that newbies tend to have a mindset that there is a single "correct" implementation, but as you gain experience, you learn that everything is negotiable. This is what I was trying (poorly, apparently) to get across in my original article: there is no such thing as a "unit test", a single definition which can be used to categorise tests into "unit" vs. "not unit". Rather, we should look at different aspects of tests, like their intention, what kind of testing style they use, etc. You've raised lots of great points, particularly that the things in my blog post are not to be taken as gospel, either.
Question everything!
Some comments may only be visible to logged-in visitors. Sign in to view all comments.