Failed deployment
It happened at my previous FinTech job. The new feature was critically important. Financial regulations required us to start screening third parties participating in transactions with our client against sanctions or PEP lists. If the person was on these lists, transactions with them as sender or beneficiary had to be investigated by Compliance. Before these new requirements we already had in place some mechanisms to block transactions for investigation, for example when a transaction amount exceeded a certain threshold.
So, one beautiful morning after development and testing were finished, a new release was rolled out and traffic started coming to the new build. Monitoring showed that API calls to a company that provides screening service were all successful, everything looked fine. However, within the hour the Compliance department raised an urgent issue – transactions that were blocked for investigation started to land on the clients' accounts! If left unresolved, that could lead to revocation of our license, so a rush began: accounts that received money in this manner were blocked, release was rolled back and transactions were manually adjusted. Quite a nervous and unpleasant day for everyone involved!
Let’s dig to the roots of the problem
How did it happen? We had automatic tests, code reviews, sandbox testing… let’s take a look at the code of the business logic. Disclaimer: real code is proprietary, so I will show some pseudocode that looks approximately the same but does not represent all of the real procedures. So here’s the high-level implementation of the process of initiation of a payout.
seqAnd(
createTransaction(),
label(“CollectData”),
displayPage(“CollectPayoutData”),
when(
seqAnd(
not(processHasAttribute(“Admin”)),
not(productFlagIsSet(“NoOtpConfirmation”)))
).then(
sendOtpSms(),
displayPage(“EnterOtp”).onError(reset(“CollectData”))
),
unloadAccount(),
when(
volatile(
pepSanctionsScreening())
).then(
enqueueApprovalRequest(),
displayPage(“ComplianceCheck”)
),
when(
transactionHasStatus(“ToBeRejected”)
).then(
loadAccount(),
failure()
),
initiateTransfer(),
…
If your impression was: “What the hell is going on here?”, then we are on the same page, this is how I felt all the time while working with this code.
This example shows that in order to describe the business logic a DSL was created. This DSL could control both backend operations and display of UI pages. Backend was written in C++, frontend was a web-UI with pages generated by the Bottle framework, written in Python. A business process written in this DSL was called an “Algorithm”. Every step of an “Algorithm” was an instance of a polymorphic class. Base interface of this class included methods: start()
to invoke a function, stored in the instance, getNextSteps()
to, who could have guessed, calculate next step (a key method for a flow control operator) and getState()
that returned one of five possible states of a step: NotInvoked
, Running
, Success
, Failure
, Reset
.
Evaluation of “Algorithms” result went as follows. First build a graph, representing all the steps. Then query the DB for the state of steps that have already been completed. Mark completed steps in the graph and find the next step by evaluating the flow control steps. Evaluation of the flow control statements was not quite straightforward, as the five states mentioned above had to be mapped to the binary logic. Every operator did it in a slightly different manner, but in the end somehow the next step was calculated. If that step is a background step, then a function contained inside it is invoked and considering its result the next evaluation iteration runs. If a step is a UI step, then “Algorithm” evaluation is interrupted until the result of user interaction comes through yet another entity, a “mediator”. Just to make reasoning about program state more fun there was a volatile
wrapper, that forced step reevaluation on every “Algorithm” run, then there was reset
wrapper that could mark some steps as NotInvoked
returning to a specific label, just as the good old goto
operator.
Are you saying: “This solution does not look particularly great”? Well, there's more. The system was not just your average, boring product, it was a modern and fancy distributed system, so it used a framework called Zeroc Ice. This is an RPC framework that also provides deployment configuration, service discovery, SSL encryption and more. By itself it’s rather interesting, but in our case it was used almost everywhere. Majority of the functions in the system were exposed as RPCs, which made them effectively globally visible. And they could call other RPCs and they called even more. And every call produced some side effects that were quite often used in condition expressions of “Algorithms” defined with DSL. These conditions were of course quite heavily nested. As a result, even though we could see in the monitoring interface which steps were already passed, it was almost impossible to tell what would happen next.
The error described in the beginning of the article was caused by multiple factors. The step that implemented screening was wrapped in volatile, so it was evaluated on every run of the “algorithm”. It was necessary as without that once the manual compliance control branch was taken, the other “happy” path could have never been selected even if compliance would approve the transaction. The other factor was a timer that periodically reran the “algorithm” so that it could check the relevant side effects and continue forward. When the new version with screening was invoked with the data from an already existing transaction that was blocked for compliance investigation, the sanctions screening request was sent (because it was not already done for the old transaction, right?). When the response came that screening was passed, the transaction was marked as clean, because the other checks (for amount and other attributes) that could have sent the transaction to investigation were already complete so they did not run again. So the now “clean” transaction proceeded happily to load money to the account, putting the whole company in danger. Nobody anticipated that, neither the author (me) nor reviewers, nor QA, nor automatic tests. I must say, test coverage was not the greatest, as in the abundance of long chains of RPCs, one had to mock tons of them for every test and there was never enough time for that.
I can not take it any more
I thought, and started to design a refactoring. Based on experience of work with the old “Algorithms” framework I defined the following problems:
- Full-fledged flow control structures incite developers to move business logic on the DSL level.
- Each step does not know anything about the others, so the only way to pass data between them is side effects, and that leads to implicit dependencies, obscure logic and overall fragility of the system.
- Controlling frontend from backend bloats the code of a business process and different behavior of UI and backend steps produces two intertwined simultaneous flows inside the same process which is difficult to follow and modify. In other words the rich feature set of the DSL framework was just too much and led to overcomplication of the system. Extension and modification of the processes with their complicated logic, implicit dependencies and unpredictable side effects became too difficult and dangerous.
Another big architectural problem not related to the framework itself was direct dependency of business logic from the communication framework and DB access API.
Given these problems I set the following refactoring requirements:
- Process state should be persistent, and it should be possible to interrupt and then resume the process, for example for retry or after receiving an asynchronous callback from a third party service.
- New framework should not allow expressing business logic in the structure of a process. All branching should be done in “normal” code, and process structure should remain extremely simple. In fact, a linear sequence that just progresses from one step to the next was totally enough in our case.
- It should be possible to pass data between steps in order to reduce reliance on side effects. Frontend should be separated from backend and interact with the latter through an API and not through some weird additional entities.
- Business logic should be separated from implementation details, like an RPC mechanism or DB queries.
There are some existing utilities and frameworks that I checked. One is boost::future, that supports continuations, another is AWS Step Functions. The former does not provide capability to continue execution from an arbitrary point, the latter, despite having a nice UI for development and debugging, has the same problems that our framework had - business logic would creep into the Step Functions configuration language. Also, we did not want to be bound to a specific vendor. FSM framework from boost::statechart could be used, but it also has vastly excessive feature set compared to what I needed and some abstractions, like events, do not really fit in our use case.
Frontend separation was a challenge in itself, because we did not have any free frontend developers. In the end I had to learn React and compose the new UI myself. In fact, I did a lot of work myself because I was just a common (senior) developer, so I did not have any formal authority to implement my ideas with someone else’s hands. I had to persuade my colleagues and management that this refactoring is worth doing, and it’s hard to persuade anybody by just describing how great the system will become. People might even agree and say “yes, it would be cool to implement it”, but nobody will do anything. This is exactly how my first presentation failed.
What works much better is demonstration of an already working solution or at least a working prototype. I put in a lot of hours, including my free time, made a working MVP (fully redesigned one of the major business processes), and then it was a completely different talk. For example, it turned out that one of the backend developers also knows React and is willing to help with new UI for other processes. Simplicity of writing, testing and debugging of the new framework made a very good impression on other developers. Management liked that now it was possible to easily implement some money-saving features, and a large part of the work was already done, so risks and required time are quite clear, plus they suddenly got extra frontend devs, so they said yes and officially allocated time for that.
What did we get in the end
So instead of being part of complex class hierarchy each step became just a free function, taking a pair of arguments. First argument is a struct with input parameters. Second is a “context” or “environment” – an object that encapsulates all DB and RPC details. The return value of a step passed to the next as a first argument.
Here’s a simplified example of implementation of a single step. Its purpose is to load money back to a user’s account in case his outgoing transaction failed or was canceled by compliance.
// .h
// Each step defines an interface that business logic
// needs to interact with other modules.
// This implements the Dependency Inversion principle -
// instead of depending on other modules,
// business logic defines an interface, on which these
// modules will depend.
class PossibleRevertCtxI {
public:
virtual bool loadAccount(int consumerId, int amount) = 0;
virtual int transactionAmount(int transactionId) = 0;
};
std::optional<TransactionData> possibleRevert(
ComplianceData data, PossibleRevertCtxI& ctx);
// .cpp
std::optional<TransactionData> possibleRevert(
ComplianceData data, PossibleRevertCtxI& ctx
) {
if (data.complianceDecision == REQUIRED) {
// We have to wait for external input, so stop
// the process on this step.
return std::nullopt;
}
if (data.complianceDecision == REJECT) {
ctx.loadAccount(
data.consumerId,
ctx.transactionAmount(data.transactionId));
throw std::runtime_error{
"Transaction rejected by Compliance." };
}
// decision is APPROVE or NOT_REQUIRED
TransactionData result;
result.consumerId = data.consumerId;
result.requestId = data.requestId;
result.transactionId = data.transactionId;
return result;
}
A small library was created to take care of combining individual steps into chains. Change of state is dead simple. If a step returns value - chain switches to the next step. If a step returns std::nullopt
, the chain does not switch, and execution stops. Naturally, if a step throws an exception chain is also interrupted. It is possible to run a whole chain in one go, or run just one step at a time, and obtain current step index and serialized parameters after each step in order to persist them. Library requires that the structures that hold parameters must be serializable, which means they must provide a std::string serialize() const
method and they must be constructible from a string. Another limitation is that a step function must accept one or two arguments (either only the input data or input data and a “context”).
A business process implementation became something like this:
steps_chain::ChainWrapper payoutProcess(
std::shared_ptr<ApiMock> api,
std::shared_ptr<DbMock> db,
std::shared_ptr<TimerMock> timer
) {
// Steps are desined with the Interface Segregation
// principle in mind, so they accept different
// types as their 'context'. So we have to wrap
// the steps in lambdas here, as ContextStepsChain
// requires identical type of 'context' as a second
// argument of all steps.
return steps_chain::ChainWrapper {
steps_chain::ContextStepsChain{
[](InitialData d, std::shared_ptr<PayoutContext> c) {
return unloadAccount(d, *c.get());
},
[](TransactionData d, std::shared_ptr<PayoutContext> c) {
return sanctionsScreening(d, *c.get());
},
[](ComplianceData d, std::shared_ptr<PayoutContext> c) {
return possibleRevert(d, *c.get());
},
[](TransactionData d, std::shared_ptr<PayoutContext> c) {
return startTransfer(d, *c.get());
}
},
std::make_shared<PayoutContext>(api, db, timer)
};
}
And these chains are executed in the following manner:
void runProcess(
steps_chain::ChainWrapper p,
const std::string& requestId,
std::shared_ptr<DbMock> db
) {
const auto data = db->fetchProcessData(requestId);
p.initialize(data.parameters, data.stepIdx);
try {
while (!p.is_finished()) {
if (p.advance()) {
const auto [idx, params] = p.get_current_state();
db->setProcessData(data.requestId, idx, params);
}
else {
// Log the interruption of the process.
return;
}
}
}
catch (const std::exception& ex) {
// Log the error and update process metadata.
return;
}
}
In this code chain state is initialized from the DB, but in case we receive additional data from an API or a callback it can be merged into the serialized parameters and thus passed to the next step. I used JSON as a serialization format partly because it’s easy to merge two JSON objects.
Overall, I believe, we got a pretty clear and simple system with nice separation of responsibility and isolation of business logic from details of RPC framework and other IO. Working with the new framework became way more pleasant, as my colleagues have said.
You can find detailed examples of code in my GitHub repository along with the public version of the library under MIT license.
I had to use a couple of interesting metaprogramming tricks during the development of the library along with a nice idiom of external polymorphism, but they probably deserve an article of their own.
Thank you for your attention.
Top comments (0)