The importance of observability in every modern software development effort cannot be overemphasized. This new rule of engagement is as important as the software development endeavor itself. Checking under the hood is key in every performance optimization drive. Being able to tune software services with the right insights into the performance metrics and the triggers of those numbers is as critical as the software engineering efforts themselves.
This belief was further strengthened when recently we embarked on what is an audacious attempt to modernize a couple of services that until last week, powered the funds transfer capability of a financial institution. The legacy service consisted of a proprietary solution on IBM infrastructure which handled about 50% of the 5M plus transactions per day of the organisation and some monolithic services built using Java and distributed across six (6) fairly resourced linux servers(VMs) for performance and resilience.
Our effort was to collapse all of this into micro services, add new features which had become critical to the business based on a new digital technology product recently introduced into the market. This product was gaining wide acceptance in the industry and the business needed to move fast to close out on existing limitations.
In total, 11 micro services were designed and developed to be the new heart of the funds transfer feature for the financial institution's digital channels (many thanks to the brilliance of Emmanuela Ike , Uchechi Obeme , Michael Akinrinmade , Ayodele Ayetigbo MBA). The first deployment was aimed for USSD channel alone ahead of the introduction of traffic from the Mobile app channel which boasts of more usersβ transactions volume and value. The first deployment seemed largely successful as we didnβt receive any major complaints from customers, I mean it was USSD right and users hardly see RED coloured error messages. A seeming successful pilot run for a month without major complaints from customers encouraged us to push for a full-scale deployment, alas we were mistaken and very mistaken at that.
The introduction of the mobile traffic failed within a few hours as several performance issues were revealed. Although we had a few guesses as to what went wrong, but one of the most brilliant DevOps engineers you can find around Azeta Spiff had earlier highlighted the need to implement Application Performance Monitoring (APM) from ElasticSearch to support the post implementation management of the services. His brilliant idea and this beautiful tool provided the required observability metrics that were key to unearthing bottlenecks that were mostly outside of the realms of coding circa contention at database layer, failures of external dependencies and of course some of the services required a rework for optimisation.
Armed with the insights from APM, we were able to meticulously deal with the issues reported. In one finding we leveraged database partitioning to shave off over 15 seconds from the response of a database query. Although the query had a low cost, it still performed badly in our production environment. In another instance we had to design our autoscaling metrics on our Kubernetes cluster to give the services the required resources to attain cruise flight.
Our target was off course to achieve an average latency sub 500ms for most of the services especially where dependencies outside of the organisation's network were not involved. The past two weeks of putting performance icing on what has been a very audacious adventure in software engineering were rewarding.
Over the weekend of March 15, 2024, we successfully rolled out the mobile app traffic again and this time we couldnβt have been prouder of the success recorded. From the insight from APM we could see success more than 99% on the average across the micro services with an average latency for transactions completions standing at less than 3 secs (this is inclusive of calls to dependencies outside of the organisation's network which stood at 2secs). This is a huge performance optimisation, not to mention the cost savings the organisation would derive from the new implementation as the resource utilisation from the new services have been a fraction of the legacy capabilities.
APM from ElasticSearch (an open source software) is a fantastic tool for observability and would save time and money from groping in the dark when faced with performance-related issues. Being open source makes it even cooler as there is a large community for its' support and we all can add insights from our environments to make it a more robust software. Every software engineer and technology shop wishing to deploy a performant system particularly with micro services must of necessity place the right value on observability and ElasticSearch is making that journey seamless and affordable. So why run blind when you can easily check under the hood with APM?
Top comments (3)
Welcome, and thank you for sharing you success !
I like the Elastic Stack too, I wrote an article about it's installation and usage on Kubernetes Clusters.
How long did you activate the APM in production if you did ? Did it put pressure on the system ?
Hello Benoit
Thank you and apologies for the delayed response.
The APM is deployed as part of every container running in Production. We use this to track the performance metrics of the containers so we can identify any issue for resolution. So far we haven't noticed any real performance overheads from the APM.
The only issue we had was with a conflict with how the multi-processing feature of one of the services was working but that we easily fixed with a parameter on APM.
Regards
Great, thanks !