Many SaaS providers will happily sell you a turn-key log management system, and in the early days of a startup when you value time over money, purchasing one makes a lot of sense. AWS Cloudwatch logs are another good solution if you are already hosted on AWS. With any of these paid services, the cost is eventually going to start raising a few eyebrows and it might become prudent to build your own in-house. The Elasticsearch-Logstash-Kibana cocktail is the most popular deployment for this, but one SaaS provider I found calculated the TCO over running a moderately-sized ELK stack for a year at nearly $500,000!
We've developed our own system using Fluentd, AWS S3, and AWS Athena to ingest, store, and query our logs. With this combination, I estimate that ingesting 640GB per day of log data and keeping it searchable over the course of the entire year would cost only $79,248 — a savings of 83 percent! Not to mention the time spent monitoring, scaling, and operating the cluster. The performance of our system is good; queries that return the data for a single request complete in less than 3 seconds. Searching the full text of our logs for specific markers or references can take longer but it's still very manageable.
This system not only gives us fast access to the logs that we do have, but it also enables us to log additional information per request and keep logs searchable for longer than we otherwise would if we had to pay more for storage. But having this additional data is invaluable for helping us solve problems and answer questions in a rapid and responsive way.
In this post, I'm going to walk through what we did, provide some code and discussions to show our approach, and talk about opportunities for future work that have opened up. I'm also going to mention some "gotchas" and road-blocks that we encountered along the way.
There are four stages to a logging system: ingestion, aggregation, storage, and query. We're going to talk about them in reverse order.
Assuming we have our logs in a structured format, what would be an ideal interface to gain access to it? While Kibana is the preferred choice for many companies, Kibana provides a lot of things that we don't really need. Pretty charts and graphs are nice, but when debugging an error state, what we really need is quick access to our logs. Additionally, exposing something that resembles SQL allows writing sophisticated queries to answer questions as they emerge.
Amazon Athena is the technology we settled on to query our logs. Athena searches data that already exists in S3, and it has a few properties that make it ideal for the task at hand:
Athena is really inexpensive. We pay only for the data that we scan in the process of completing a query: only $5 per terabyte at the time of writing.
Athena understands and can query the parquet file format out of the box. Parquet is column-oriented, which means that it is able to scan data only in the columns we are searching. This is a huge boon to us since the vast majority of our searches are "get me all of the logs for this single request id". That's only a 128bit UUID, and Athena is able to quickly determine which chunks of data contain the value under search, saving us both time and money. Additionally, parquet is compressible, and Athena is able to query it while in its compressed format, saving even more money on scanning costs.
We wrote a small custom log search engine to act as a front-end to Athena's interface for our team. This interface has fields, basic query parameters, a date selector to help generate the SQL-like query language that Athena uses, as well as a few endpoints that kick off a query immediately based on request uuid, pod, and time. We embed links to our tool in Datadog for rapid log analysis of poorly performing endpoints.
As mentioned, Athena queries data that is stored in S3. It's a durable, reliable medium that ensures millisecond access to our data when we need it. S3 also comes with another benefit. We can set up lifecycle rules to move files to lower cost tiers over time. These lifecycle rules can be set up with just a few clicks (or a few lines of AWS CloudFormation) and then it's done.
Aggregating the logs from our services turned out to require some work. Instead of using an Amazon product off the shelf, we opted to use Fluentd to ingest, aggregate, and then process our logs. The configuration itself is very simple. We created a docker image that built Fluentd with
libjemalloc to keep the memory usage in check and
lib-arrow to generate the compressed data in Parquet format. This container could then be deployed to our standard ECS cluster and then treated like any other service.
Now that we have our query, aggregation, and storage services ready, the last step is to get data from our application into the Fluentd service. There are a couple of options for this. One was to set the docker logging provider to Fluentd and point it at the aggregation cluster we have deployed. This allows us to log to stdout, which would keep the logger relatively simple. However, we decided it was worth it to make our application aware of the aggregation service and install a client-logging library for Fluentd. Fluentd maintains a Ruby version of their logger, which works as a drop-in replacement for the logger on rails applications: fluent-logger-ruby, but this is language-agnostic and could be used anywhere.
Unfortunately, there is one area where we cannot rely on this system for logging — it's not able to log to itself! The ECS agent requires that the new container be able to log to its log provider immediately upon boot. This is a problem when the logging service itself needs to boot, initialize, and then pass health checks to start taking requests.
One way around this would be to bake a "forwarding" ECS service into the AMI for every ECS cluster instance. This forwarder could receive and buffer logs, then send them on to a private hostname configured to point to our new logging service.
Athena has a "sweet spot." Because of the columnar format of data and the ability for Athena to read the headers out of the parquet file before it fetches the whole object, Athena works best when the size of scanned files is around 200MB. Once it gets larger, Athena is missing out the ability to parallelize. When it is smaller, each query worker spends a lot of time reading and parsing the initial headers instead of scanning the actual data. We had to tweak our time keys and chunk sizes to get this just right.
There is a balancing act between chunk size and capacity. The more aggregation containers we were running, the more log pressure we needed to get the right amount of saturation to create files in the sweet spot. In our experience, the cluster really wants to be run at about 80% CPU. Higher than that and we risk the TCP log forwarding protocol providing upstream back-pressure to the services logging to it. Less than that and we end up creating log files that are not big enough to get full performance out of Athena. Since our use-case is an internal developer-facing tool and the risk of slowing down our main application is not worth the slightly better throughput, we have opted to slightly over-provision our containers. A second Fluentd container layer serving as a forwarder in between our application servers and our aggregation containers could help buffer this substantially. This will give us great performance without risk of back-pressure at the cost of slightly more complexity and cost overhead.
We first attempted to create an AWS glue table for our data stored in S3 and then have a Lambda crawler automatically create Glue partitions for Athena to use. This was a bad approach.
Partition projection tells Athena about the shape of the data in S3, which keys are partition keys, and what the file structure is like in S3. We partition our data by service, shard, year, month, day, and hour. In our testing, we found that partition projection was essential to getting full value out of Athena. Without it, many of our queries would take up to thirty seconds as Athena consulted our AWS glue tables to determine which data to scan. With partition projection, no such lookup was necessary, and the query could begin immediately, scanning only the relevant data.
Keep in mind that partitions are great for scoping down the amount of data to access, but by partitioning too aggressively, we noticed our chunk sizes dropping below the "sweet spot" discussed above. We noticed a similar problem partitioning our data by EC2 instance. Clumping all hosts together counterintuitively made our queries to the system a lot faster.
The power of this system is enormous. We push a staggering amount of log data into it, pay relatively little for it, and query it in seconds. Additionally, the ingestion is managed by a service that we run and provision ourselves. So it is easy to autoscale down when demand is low and scale up when we need the additional horsepower to keep up with demand. But now that we have this system running, other use cases have presented themselves that look promising:
- Analytics — We are able to log to Fluentd with a special key for analytics events that we want to later ETL and send to Redshift. We can create a new rule in our Fluentd config to take the analytics tag, and write it into the proper bucket for later Athena queries to export to Redshift, or for Redshift itself to query directly from S3 using Redshift Spectrum.
- Request correlation — At Aha!, every request to our service is issued a request_uuid. When we queue a background job or make call to another service, those services can all log using the same framework and can include that originating request_uuid in the appropriate field. This means we can trace everything that was involved or triggered from a single user request through all of our backend services through a single query interface and interlace all of the logs together in one timeline.
- Finding contemporary logs — Using this system, it is trivial to search for all logs from all requests with a very small timestamp window. If we know that we had a deadlock around a certain row in a database, this would allow you to find all processes that attempted to update that row, and then trace back to find the controller action that caused the problem.
If finding innovative solutions to problems like this sounds interesting, come work with us as we build the world's #1 product roadmapping software. We are not only creating a powerful platform for the biggest companies in the world to plan and manage their strategy, we are also building sophisticated tooling to help our talented and friendly development team continue to excel.
1: Assuming 35 Fargate container instances with 1.25 VCPU and 2GB memory each, 640GB ingested per day, 60-day S3 Standard Access storage, 335-day Infrequent Access storage, and 100 Athena queries per day each covering 100GB scanned. (This is a generous allowance.) Reserved ECS cluster instances drop this cost down further.