Discussion on: Staging Environments Are Overlooked - Here's Why They Matter

View post

What do you think about using feature flags/feature toggles as sort of "staging" environment. I.e. having only one production environment and deploy risky and big changes under feature flag. Enablr flag only for test user, or small amount of beta customers.

And what about that staging can give "false sense of safety". Ive seen many times devs to test on staging thoroughly just to then think that they can "push to production and go home", while actually they missed bug that was only present on production, because of specific production data, setup or scale.

IMHO even with staging you still need a way to limit blast radius with production deployments. And if you have that, do you need staging any more?

Jason Skowronski • Oct 21 '19

Lukáš these are all great issues to point out! IMHO phased releases using future flags are a great way to test major functionality changes out. In fact there is a popular library called scientist that does just that in a controlled way github.com/github/scientist.

You're also right that's important to minimize the blast radius of missed problems. Using future flags exposes customers to potentially broken code, and possibly incompatible versions of dependent services. It can even lead to data loss or data corruption. Having a staging environment with proper test cases and that closely mirrors your production environment will hopefully catch the most obvious problems before they impact customers, and before your dev team leaves for the day.

It's great that your dev team trusts your staging environment so well that it gives them a false sense of safety. However, it's only as good as the test cases and how well it mirrors production. You can test for "known knowns" but tests miss "unknown unknowns". These missed cases can and do cause major production outages, even at established companies.

A common rule of thumb is to never make major changes and leave, especially on a Friday if you want to enjoy your weekend! I think it's important to have post-deployment monitoring in place. If there is a critical change in KPIs/SROs they should be alerted to fix the problem or rollback ASAP.

If you're at a small startup you might not have all this infrastructure and policies in place and its more common to cut corners for efficiency. It becomes more critical at larger companies where major outages or bugs can cause thousands or millions of dollars of damage through lost sales, damaged reputation, etc.