Avoid mutating data if it's not needed. This is a long story as to why mutation is bad. Jump down below if you don't care about how we got to the point of discussing mutation of data and how to avoid it for our specific task.
side note: sorry for those with OCD, the images were created quickly so I didn't align them perfectly 😬
So we are currently tasked with seperating out a highly used service and moving to AWS for high availability. Also to target any possible issues and eliminate any single points of failure. While ensuring that we fully test the functionality and have a proper rollback plan if anything to were arise from this.
One of our methods we use for rolling back, is feature flipping, via Launch Darkly. This allows us to slowly roll out new features in many different manors, as well as turning off failed/broken features. It helps us to avoid the worries of deploying code that breaks something and having to roll everyone else code back as well.
We do weekly releases and have many developers, so a roll back would mean everyones code gets rolled back, when only 1 part of it may be broken.
With feature flipping, we can just flip off their broken code, then push up a fix on the next release.
Unfortunately there are some fallbacks that come with this practice.
Not all of our systems use Launch Darkly and fallback to config based feature flipping.
On top of that, you can run in to scenarios of deep nested feature flipping even when using proper deprecation practices of those features flips.
Or the scenario of where we can't use feature flipping due to too many risk, and we can't just easily do a rollback because there are to many systems that need to move changed over, which means we are all or nothing.
So we realized that we cannot safely feature flip the moving of this service as too many moving parts. On top of that due to the sheer number of systems that will be communicating with this service, we need to make it a one time move.
So how are we to ensure that we can trust moving to the new system? That it would handle the load as intended. That we have tested every current scenario that the current system can?
As we always do here, it's time for some serious design discussion and planning. We need to create a system to allow us to fully test our current traffic as well as other stress test.
I'll keep the design details limited for the sake of company privacy and to get to my original point I'm trying to make.
We came up with a plan to duplicate the entire traffic flow, record that information, then replay it on to the new system concurrently and then compare the findings.
We basically have the existing service, the re-player service and the system clone.
Don't worry, we have better names for them but for now I'm keeping it simple for discussion sake.
Basically the idea is as below, we have our different http request [GET, POST, PUT, DELETE, ...] which hit the original service. We use an event listener to reply those to the re-player service, which will record that data, and then replay to the clone service.
We will take the response back from the service clone and record both the original call and the response from the clone, so we can compare. We also have the ability to modify the request before sending it to the clone (think delete, we need to know what sessions match up).
So we have two designs on the table for the re-player. The first one would be the design that is pretty standard. You have data you need to send to a service, so you wrap it up in a var and pass it as a post.
So if we are looking at only the re-player service to the clone, it would look something like so.
It may look pretty straight forward, but there is a catch to it. We just created a service that requires you to pass data to it via post only.
What if we are calling GET, PUT or so on? That is when you start to see the issue.
We just mutated the data to fit the required structure of the re-player service. Is there another option to avoid mutating the data?
Lets avoid changing the call and see if we can still accomplish the same goal, to replay the data from the original service to the, re-player service and then to the clone service
By telling your re-player service to accept any method, and using a little more logic with in it to set up the data, you no longer are mutating the data in between services.
I do!! The why is when you go to test the service! Lets make this as simple as possible.
We will point curl or postman or what every else we want at each service. The exact same payload for each 3 services, which ones will work and which won't?
Using design two, we will try first.
All looks good! We hit all three with different types of request, and all passed.
Now let's try it with the design one!
Yikes!! What is up with the nasty red? Wait, so we can't hit the re-player without modifying the data to a custom post request? Vomit!
Well sure for now that is the goal, we will just make it go away and then all is well!
Hey so what are we going to do when we have request from the current service, call over to the re-player service but then clone service fails? We should probably replay those, ya?
What would be awesome is if we just logged all the http request and then had an automated system replay it anytime we wanted to. Or some type of queue that would retry them later.
Dang it, we can't! At least not without telling those other services how to talk to the re-player service.
Now we are stuck doing any work with the re-player service unless we adhere to it's signature.
And that! That is why I care.
Thoughts? Comments? Am I wrong? Better design in mind? Let me know!