As much as computer programming has advanced over the past two decades, developers and operators are still dealing with “works on my machine” problems — an application that works great on the laptop but is completely non-functional in production or on a colleague’s laptop. Why are we still having this problem?
I think of “works on my machine” as a function of how much control developers have over the production environments and how identical the development and production environments are. Over the short history of computer science the pendulum has swung a couple of times, leading to more or less “works on my machine” problems.
Let’s think back to the early days of computer programming, when programming a computer involved punch cards. Any mistake on the punch card meant you had to punch those cards again. Developers were coding in production, and the cost of each mistake was high. But mistakes were immediately apparent and developers were working as close to production as possible. Everyone was working on the same machine, so there were no “works on my machine” issue.
As developers started using client servers and then programming on their own machines, the distance between the production environment and the development environment started increasing. This is when “works on my machine” started becoming a serious issue for software engineering teams.
The shorthand “Works on my machine” is a function of how much control developers have over the production environments and how identical the development and production environments are.
Then came the cloud. At first, cloud was really Shadow IT, used and configured by developers to run non-critical applications. At that stage, developers had control over the cloud and “works on my machine” problems decreased.
Now, though, as cloud has moved from Shadow IT to mainstream and more layers of control have been put on how cloud environments are set up, the distance between what developers are doing in their IDEs and what the production environment looks like is increasing.
You can’t really work directly in the cloud — and there are good reasons that we don’t have developers working in the production environment like in the mainframe era. Now we have isolated systems for developers so that they can safely make mistakes while developing. At the same time, developers are being woken up at two in the morning because their code doesn’t work in production — they don’t have the tools to easily debug the problem if everything worked perfectly on the laptop. Bugs coming home to Roost, someone may say.
At the moment, most companies are addressing the “works on my machine” problem with a mixture of the following techniques:
Reducing velocity. More robust testing is one strategy for catching potential problems before they reach production. We would like to think that all testing is 100% automated and instantaneous, but that is not true. A more robust testing procedure will slow down development velocity and still not ensure that all “works on my machine” problems are caught before production.
Trial and error. Organizations talk about getting through issues in production by deploying more frequently or by using advanced deployment techniques like canary deployments. This is a euphemistic way of saying that they are using trial and error to solve “works on my machine” problems.
Establishing more stringent deployment procedures. Organizations also try to address “works on my machine” problems by establishing increasingly rigid deployment procedures and putting in both guardrails and roadblocks on the deployment pipeline, hoping that problems will be caught before production.
The problem with these approaches is that neither of them are actually solving the problem or giving the developer a better way to proactively ensure that the service will work correctly in production before it even enters the integration process.
After all these years and all these late nights of frustration, you’d think that the software engineering as an industry would have figured out a better way to prevent “works on my machine” problems. The real solution, though, has to involve decreasing the distance between the development environment and the production environment so that developers are automatically able to develop in an environment that’s identical to production, including having access to the latest versions of upstream and downstream dependencies and running with the same configurations. As an industry, we talk a lot about shortening the feedback loop. Developers should be alerted that there might be a service compatibility issue or that an update won’t work in production before it leaves their machine, not after a failed canary deployment. That’s the only way we’ll end up eliminating the “works on my machine” problem for good.