In one of my recent projects, we had to replace a part of the code which calculates price with a new pricing service that has more functionality.
Calculating price being one of the most critical part of our service, we were faced with the following questions—
- How do we know if the new pricing service is doing same as calculations we have?
- If it is not working as expected, what are the inputs for which it is not working?
- How do we know if we have integrated with the new service correctly ?
- What is the performance impact of calling the new service? Is it better or worse?
We could have tested with a bunch of scenarios with known inputs but production is all about unexpected situations. We wanted a way to test the integration before we can release the changes to the users. That is when we came across Scientist.
Scientist is a small ruby library which gives necessary framework to test your new code without releasing it. The main building block is an Experiment. It has two blocks of code control and candidate. Both blocks are executed every time an experiment is called. But only the result from control block is returned. Exceptions in candidate block will be rescued. And at the end, results from both the blocks can be compared and logged for further analysis.
The idea was to create a Scientist experiment with control block as the existing local price calculation and candidate block as the call to the new pricing service. And compare the results before we can go live.
Couple of things we wanted to do before using Scientist:
- Refactor the existing code into one class. This way we can use Class Polymorphism to switch between using existing implementation, new implementation or the scientist experiment. We use Figaro for managing application configuration. So this refactoring to using class polymorphism would help us pass the class to use as a configuration to our service. I will show below how we did that.
After refactoring, our existing pricing calculations were moved into the LocalPricingEngine with estimate method:
- We needed a place to log the input params, what the control/candidate blocks returned, and the time it took to execute each block. There are number of places we can log this information like Redis, Postgres DB etc. But we chose to log it in our log file. We already had json logging using Lograge. So putting in log file would make it easy to retrieve this information in one line.
- Although low level information of logging per input difference is useful, it would be hard to get a summarised view of what happened. We need to be able to find out how many mismatches happened and at what time of the day (our price calculation varies by time of the day). We already use DataDog for monitoring/alerting and it made sense to put this information right there. I will show code snippets on how we did this later.
We created three polymorphic implementations of PricingEngine:
- LocalPricingEngine — Existing implementation of price calculations.
- PricingServiceEngine — New implementation which calls the pricing service for calculations.
- ExperimentalEngine — Scientist experiment which calls the above classes for control and candidate blocks.
And also a factory which would choose the right implementation based on our environment configuration:
We replaced the call to LocalPricingEngine with the call to PricingEngineFactory.engine.
We now have the ability to change the pricing engine from our environment variable PRICING_ENGINE. It would have one of these values: LocalPricingEngine or PricingServiceEngine or ExperimentalEngine. And not having this environment variable would choose LocalPricingEngine by default.
This is how the ExperimentalEngine looked like:
I trimmed down the variable names for better formatting. e represents an experiment object and p1, p2, p3 are param1, param2, param3 respectively.
This is the step I spent most time to figure our inspite of Scientist’s good documentation.
By default Scientist doesn’t publish the results anywhere. If we want to publish the results, we can do so by creating a new Scientist::Experiment and overriding publish method.
As mentioned earlier, we want to log two things:
- Experiment results to log file.
- Summaries to DataDog.
We needed a way to communicate the experiment results to Lograge. And since Lograge custom options can only be passed at controller level, we used RequestStore as a way to pass the results from Experiment to Controller.
This is how the controller configuration looked like:
We used statsd client to publish the summaries to DataDog.
This is how the experiment class looked like. You can replace MyService in MyServiceExperiment with your service or application name:
There are two methods to implement:
- enabled? — This controls whether the candidate block should run or not. In our case, we always enabled since we controlled this behaviour from the class polymorphism we implemented.
- publish — This method is called after both control and candidate blocks are run. We are publishing the results to DataDog and to Lograge through RequestStore.
Finally, we needed to tell Scientist to use this new experiment we created. You can do so with the following code:
You can put this in initializers folder if you are using Rails.
Using Scientist for this use case helped us in the following ways:
- Test whether the new service can handle the load. Our new service was production ready, so decided to take the entire load. But if you dont want to take all the load, you can use the run_if or enabled?block to control how much percentage of load you want to run the experiments for. More info here.
- Uncover edge cases. We had a rule to increase the prices at certain time of the day. We misconfigured the new service to stop increase of price one minute early. We uncovered this when we saw a lot of mismatches only during this one minute. There was also a rounding issue on how the final price was rounded. Although it was small for one price calculation, it would add up to a lot when thousands of calculations were made.
- It helped us give instant feedback about the difference in prices when changes were made either in local price calculation or in the new service.
- control and candidate blocks run sequentially. So this would increase the overall execution time. It would have been nice if they could run in parallel.
- control and candidate results are compared using ==. You can override this behaviour by defining e.compare method. More info here.
We migrated to the new service without resulting any user issues and with greater confidence. We incrementally built the integration releasing one or two times every day without the risk of causing any inconvenience to the users. We caught bugs early and our production started giving us feedback as we built.
Using Scientist was pretty smooth. It is a very small library. You can read through its code in less than an hour. And the documentation is pretty good which covers most of the commonly asked questions.