Its hard to stay positive all the time, life at work is more complex than what the blog posts mention, or the YouTube tutorials that try to teach you something from zero to success, work life has more troubles and has more pain that what most of the influencers try to show us on their videos.
Today we had to revert, again, the work of months of a team, this work involved of hours of work from DEVs, the QAs, PMs, and other leaders in the organization, we like to think that we have a Magnificent Monolith, as @DHH describes Hey (his email service), but, we know that we are not that magnificent, because it was made by humans, and has technical debt to pay.
I want to tell you this story, which is the story of my team, if they read this, they will be able to know who is who in this story, I just know that all of them are equally important in this task, that currently may seem like a failure, but we just learned more from ourselves by deploying it again, and removing it again.
It all started months ago, when internally we decided to work on the upgrade of our web framework, we had postponed this, to stay focused on building on features, but we decided it was time to pay this technical debt. The whole development team have been adding automated tests, to our repo, unit tests, functional tests, selenium tests, we increased the amount of tests added week by week.
We assigned the task to prepare the upgrade to one of our ninja devs, yes we have several of those, and his solely work was to remove the manual dependencies we had on old libraries, and ensure our product was still functional, easy? Not at all... he spent hours and days, and months working on this, and every y time he thought he was closer, new product issues were found and he had to correct them, he was a solo player, part of a successful band, but preparing our next way of working, his solely work was to lift us up, not simple at all.
After working multiple weeks, he finally got his branch (talking about Git) ready for our Quality Assurance team, and they began scrapping, inside down, upside down for bugs or issues, that's their job, they began posting issues in the application of unexpected behavior, then other inconsistencies, and all of them were documented in this "super Jira issue".
He worried that he had broken important elements of the system, he even worried about his reputation, and some issues were, caused by him, but some others were existing issues, that we forgot were there, on modules with low usage, or in rare scenarios that really are not meant to perform.
All this lessons learned, just made the team in general worry more about the work that was being planned. Leaders had to get into meetings and review case by case to asses the criticality of this work, or even think if it should happen.
After deliberating, we decided to go on, and deploy this into PROD, at this moment, there were several hours already invested by DEV, QA, Tech Leads, Managers, etc... about this process. So we felt confident of this.
We planned this on a weekend, we didn't want to cause any product disruptions, it was time, it was a Saturday, so... we (the team), deployed into PROD!
The first set of minutes of monitoring, went well, a couple of minutes after the QA team noticed that the sessions, that used to never expire, were expiring, they thought it was related with something else.... after hours of testing, we marked the release as stable, and had a good weekend.
The Monday morning come, and we got the first users complaining that their sessions were expiring, that they were not able to finish the courses that they were trying to complete in the system, Support agents unable to finish tasks, because the system will log them out.
All why this was happening, we were looking at what was different, if we supposedly have identical environments in PROD and DEV, how could we miss this, what was different....
At 11am, we decided to revert, we hold on for like 4hrs of phones and email tickets, and toss the work of months.
What was the issue? We find out that our database team found that our session table had performance issues, and could use some indexes, and a combined field, well, we never implemented this change in DEV, why? Because we didn't thought it was relevant, why did the Tech Lead of the DEV didn't consider this? He thought it was not relevant... it was just a cascade of assumptions.
We got other priorities internally, the ninja DEV was working on the migration of Bitbucket to Github, and implementing Jenkins to stop using the BItbucket Pipelines that were now very expensive, it was a matter of cost savings, in pandemia times, we could hold that technical debt for some weeks more.
The DEV was frustrated, because the work he spent months on, was removed, and all teams moved on, he was congratulated at moments, but he knew he hadn't take this project to the finish line, so he kept working on the side on it, figuring out the session issue, which eventually was reverted by our database team.
After the failure, the project was even more scrutinized by some of the teams, there were talks about how much re-work was put on, the cost of this, if it was even worth it, if the DEV was a ninja or just a mariachi, etc.
Him and his Lead decided to give it one more try.
They brought to me the project, I'm the the DOE btw, again, and I decided to buy in, another attempt, this time with a different strategy, instead of waiting all the weekend, we will do the deploy at 5am, before our customers, whom mainly work on office hours, log in, and we will a quick test on the known issues, session duration, system loading times, etc, and if we see everything under control, we will just leave the upgrade stabilize, and then resume adding the new features and bug fixes that we have in the pipeline.
I decided to wake up at 4:45am, take a shower, perfume on, put a shirt, and look like I was just ready for anything on the morning vide conference.
The team deployed and began to testing, it was an experience similar to launching a rocket, everyone on the line, monitoring the services, and even telling jokes, while there was idle times.
At 7am, we started to have customers, and everything looked normal, I tweeted about this celebrating the effort, and the bold of the team of doing this on a Friday morning, against the no-deploy Friday rule that exists in the industry.
5pm, we call it a day, everything looked normal, we started to see an issue related with the garbage collection process that was new in this version, the database processes were starting to pile up, we noticed that the query to handle them, had changed, and we touched base with the database team about some performance recommendations...
6pm, we decided to have our ninja dev, work on a patch on the weekend which consisted in bringing in the improvements of the latest version, into this still 2yrs old version, we are making small improvements, to make sure we don't brake things. And he did.
We started receiving database alerts, there were problems piling up, and we had to begin manually killing processes.
The pull request for the patch was approved and deployed to a sandbox environment, and the QA team jumped very fast into testing it, all this happened while our PROD server was suffering, the response time was higher than normal.
The patch got approval from the QA team, we were ready to deploy, but. We hesitated, our CTO began to ask us about consequences, and reminded us that rushing things in emergency scenarios was not really worth it, we had an stable product before this upgrade, so we should just stay with that for the moment.
I personally didn't feel like it was the right decision, we had the patch in our hands, but I knew he was right, it was a big patch, and consequences could be worst than just this slow time we were experiencing.
At 6pm on Monday, the revert was completed, the upgrade was removed from PROD, and now, we will work on our next attempt to make this happen.
The team hasn't give up at all, when we discussed some of this in other forums, people jumped to tell us things we could have prevented, or things we could have done differently, we know them, but also know that from each experience we learn, and we are not making the same mistakes.
As the popular phrase goes:
What doesn't kill you make you stronger.