This article was originally posted on Bizongo Engineering
As a DevOps Engineer, we’re often curiously asked by people on what we do on a daily basis. My go-to response is doing all it takes to ensure the organization’s infrastructure running flawlessly. Providing a dashboard to help everyone have a bird’s eye-view of vital metrics is one of them.
You cannot guarantee reliability if you don’t know when things are broken.
In simple terms, observing applications for key parameters and collate data to spot and troubleshoot any errors, slowness thereby improving uptime in adherence to SLAs.
To monitor key parameters across the spectrum, like
- Host Metrics (CPU, Memory, Load, Disk IO, Network IO, etc)
- Application Metrics (Throughput, Response Time, Anomalies, Background Jobs like Sidekiq, Error Rates, etc) for APIs
- Have data collected in a time-series dashboard to analyse trends.
- Trigger alerts if any metric exceeds or drops below certain thresholds.
To incorporate any of the above, we realised we needed to have stringent criteria of features to hit the ground up and running quickly. Below’s a gist of what we were looking for:
- Features like Host / Application Metrics in an easy to view dashboard.
- Alerting via Email, Slack & other 3rd Party Integrations such as JIRA, GitHub and custom Web-hooks.
- Data Retention & Privacy to be compliant with regulations.
- Pricing Models along with applicable taxes.
Bizongo’s previous provider Site24x7 while comprehensively featured, didn’t fit our requirements in the long run along with having multiple issues.
Considering the criteria listed in the segment above — we ruled out open-source, self-hosted solutions such as Prometheus, Netdata currently since it’d be an additional overhead to build, debug & improve them.
During the course of this activity, we had comprehensive discussions with a few of the above SaaS providers plus proof of concept demos as well. Our initial reaction was experimenting with NewRelic, highly recommended by the software industry. However, it turned out to be economically infeasible for Bizongo’s scaling infrastructure in spite of having best-in-class features.
The next option we considered was DataDog that very nearly solved our problems but then raised concerns similar to NewRelic so it wasn’t actively pursued by us in the long run.
Eventually, we discovered AppSignal and after a preliminary talk with their team, we proceeded to evaluate it in a trial mode of 30 days. We installed it on our staging environment and subsequently one of our production service backends. This was done in order to check its compatibility with our application and thankfully, nothing went south. Meanwhile, we also raised support queries regularly like we were doing for all of the above providers.
By the end of the trial period, we concluded it was fulfilling all our requirements for performance monitoring. One notable drawback was it supports only Ruby/Elixir Applications as of now but since we’re primarily a Ruby-on-Rails workshop, we felt it could be overlooked.
However, features such as affordable pricing based on requests than servers, frequent releases of new versions, a minimum of 30-day retention and stellar support (queries are routed directly to engineers who build it than a technical help-desk) convinced us to satisfaction.
- For Backend Ruby Apps — via the AppSignal Gem authenticating via a push API key (A best practice they recommend is to not commit it via code but defined as an Environment variable APPSIGNAL_PUSH_API_KEY)
- Further tweaking via
config/appsignal.ymlin each App can be done to define Environments (qa, staging, production) and parameters that should be included or excluded.
- Once the app server (Puma, Unicorn, Passenger, etc) is restarted, an agent service as a Gem extension starts relaying metrics.
- For Frontend / non-Ruby Apps — via an agent using OS Package managers (We’ve automated this via Ansible for newly provisioned servers)
- This can be modified via
/etc/appsignal-agent.confwherein the agent runs on localhost UDP port 8125 to relay metrics.
- Additionally, since versions 1.0 of the Ruby Gem, custom dashboards can be created in YAML format (v2.8.x provides out-of-the-box metrics for Puma, Sidekiq without any setup)
- Once you set it up, below’s how the interface looks like (images for representation only)
Right after AppSignal, we spotted a few slow-performing APIs that were fixed to improve their performance from a couple of seconds to milliseconds. Also, we now have a platform monitoring our infrastructure regardless of the number of servers since we’re accounted only for traffic. All this with precise alerting, historical data, nifty 3rd party integrations (Slack, JIRA, Github, PagerDuty, etc) & actionable metrics.
We hope our learnings through this post guide you one step closer towards making monitoring more reliable, are there any other ideas you would like to add to make it even better?