DEV Community


Posted on

Why is it hard to build a metering pipeline for Usage-Based applications

AWS best practices mention the metering aspect in their saas boost user guide

In 2019 at an AWS reinvent session I noticed a very interesting diagram:

Image description

Many great companies rise to the occasion and tackle the cloud-native deployment automation, monitoring, identity, and metrics but somehow the metering and billing aspect was neglected. The system of record built on top of a metering service is fundamentally different than one built on monitoring data.

In a microservices architecture, you can find a metering service in many companies that is responsible for this functionality. From our experience, it’s not easy to build such a service. Let’s go over a few of the challenges of building a scalable and reliable usage metering service:

Count once and only once

Make sure your customers are not overcharged (or undercharged)

In a distributed environment, preventing duplicate records is not an easy task. You would like the usage metering service to ignore duplicate records, even if the client retries the same record. This makes the usage metering service idempotent
Scaling to various workloads is tricky. Usage patterns can fluctuate and records must not be dropped

Performance with large data scan ranges

Showing a user their usage over a longer period of time (for example 6 months+), in a UI that loads quickly (< 1–2 seconds) requires a backend architecture and infrastructure for scale, throughput, and low latency .

If you try to sum the total number of events, across a large timeframe, it can quickly reach into tens of millions of events for a single query. If you try to do this on a time-series database, your costs are going to be prohibitively high, and in standard data warehouses, the latency is going to be very high. For example, looking at aggregated data over the last 12 months.

Using elasticsearch (or any other logs-based system) for example cannot run aggregation on the last 6 months per customer in a performant latency.

You will have to run an aggregation on a cadence (usually daily) that sums up the events. This data pipeline is error-prone and could be tricky. The price of an error directly impacts the customer and the revenue.
Presenting the usage in near real-time, with auditable data (pointing to the exact transaction) which can be aggregated to multiple periods.
Check in real-time for quotas. For example, an account should not run more than 1000 jobs in 1 day.
Scaling could drive the cost very high when using naive solutions

Alerts and notifications

Automating moving between pricing tiers
Notify when a large customer is using the product less than the week before
Send a Slack message to the account manager when a quota is exceeded
Letting your customers define alerts based on their cost and usage

Out of the box dashboards

The usage data is critical to run a cloud business. There are many personas in the organization that would like a view to the usage data:
Track major customers
Analyze feature usage over time
Analyze churn
Predict pricing plan impact

Developer artifacts

In a microservice environment, there are multiple programming languages/ deployment models etc. Integration with a usage service will need multiple SDK’s with constant maintenance
The usage service will be used from multiple services and by different teams. A well documented and robust API will simplify the team’s onboarding

Pricing plans models

Decouple engineering from product pricing decision

As the engineering teams instrument and build this metering pipeline, the product team needs to innovate on pricing plans and makes data-based decisions.
Democratization of the usage data enables product teams to choose the right pricing dimensions and test new pricing models based on real data

Using Metering-as-a-Service you will get a robust, reliable, cost-effective fully managed metering service.

When products adopt consumption-based pricing models, they encounter what may seem simple task of metering and tracking usage across their customers. It turns out, it is a heavy lift.

Discussion (0)