Yury Bushmelev

Posted on Feb 16, 2024

Random thoughts about logs, delivery pipelines, and everything

#logs #rsyslog #devops

While writing my previous article about logs, I found a few random notes with my ideas about the subject. Here, I'm going to share the most useful of these. Interesting (but expected) enough, most of these ideas are about logs reduction.

Let me start with the most thought-provoking idea first. Usually, people consider log delivery to be a forwarding-only plane. E.g., we can have a group of fluentbit instances that pick up the logs. Then these logs are delivered to a Logstash cluster, for example. And then they finish in Elasticsearch. This is typically the expected data flow. And that works as long as you have enough resources (hardware and money) to process the amount of logs.

Imagine you have thousands of servers with hundreds of microservices producing terabytes of logs per day. Do you still consider those logs useful?

Imagine now that one microservice was deployed with a bug that caused it to log five error messages per request. Let's say we have 10 other microservices depending on this one that are somehow affected by this bug. Those are emitting two error messages per request because of the issue. Then we have 100 instances of each affected microservice deployed in our infrastructure. So in the end, we have this:

1 broken service × 5 error messages + 10 dependent services × 2 error messages = 25 error messages per request

25 × 100 instances = 2500 error messages per request.

Impressive enough, I'd say. In real life, it's not that simple, because not every request is processed by all of those 11 services. Dependencies are not that simple, either. But you get the idea now. How useful are hundreds of those log copies per request?

Here the main idea comes…

Log delivery control plane

It's getting obvious to me, that having the ability to manage the log delivery based on metrics and/or events is a really useful feature! This is where we might introduce the log delivery control plane.

I see at least two reasons to have such an entity in the infrastructure. Firstly, it should allow for a reduction in the overall amount of logs. Secondly, it should give some operational abilities to control the log flow.

Let's start with the operational abilities first.

Manage the log flow

Below are some cases we might want to handle during day-to-day operations:

Enable or disable logs globally or by an attribute.
Rate-limit logs globally or by an attribute.

This may use the following data as an attribute:

a service name
the service's environment (dev/staging/prod)
the service's location (country, city, or datacenter)
a log message's attribute (severity, etc.)

So if a service goes mad, it's possible to quickly disable its log collection on a whole fleet, or just on staging, maybe. We may also want to drop log messages with a severity less than warning by default. Then we may quickly enable or disable debug logs if we need them.

Speaking of rate limiting, I have a strong opinion that it should be enabled everywhere by default. It's usually hard to select the proper numbers, though.

Reduce the log amount

Do you remember those 2500 messages coming from 100 instances and telling you about the same bug? Let's see what we can do to reduce the number of logs without reducing observability significantly.

Kind of DISCLAIMER:

The ideas below are not a general solution suitable for everyone. Consider your requirements (e.g., compliance and/or security) before applying anything described below.
It becomes really important to provide an easy-to-use UI to enable or disable logging for your developers. For example, they should be able to disable the automation if there is an ongoing incident.
It's assumed further that the metrics collection is already implemented in the infrastructure.

Sampling

I don't like logs, which are collected when everything is good. Nobody is going to read those logs, I believe. But they still consume your storage and waste your CPU cycles.

The simplest solution that comes to mind is to stop logging everything 24×7. As we have metrics, we can say if a deployment is good or bad based on them.

As long as things are good, we can have log collection disabled by default. Then we can enable logging for 10–15 minutes at some random time point to have a sample for analysis. The exact time range you can skip logging for depends heavily on your logs' nature. Doing sampling at least once per hour might be a good start.

Moreover, we should enable logging for 10–15 minutes after a deployment, because this is the point of the highest chance to see an error. Maybe we don't need to enable logging for everything everywhere, but just for affected services (the deployed service and its dependencies, at least).

Also, we should enable logging, if we see something wrong in the metrics. Here, I assume that on a 100-instance fleet, you'll see the same issue very soon.

Service dependencies

Now let's think about service dependencies. In the example above, we had 10 dependent services, which multiplied the error message at least twice each. Imagine if we could verify that the service depends on another service that has a known issue at the moment. Then we can skip collecting logs of dependent services until the incident is resolved. I.e., we'll have 500 error messages per request instead of 2500.

One may argue that a dependent service issue can be hidden in this way. I'd say you'll see it immediately after the original incident resolution. That may prevent you from fixing all visible issues at once, though. It's up to you to decide what is more important in your case.

"Pre-shooting" buffer

There is a feature in modern photo cameras. You may shoot some pictures, then go back in time and select a photo, that was made a few moments BEFORE you pressed the button. What if we apply the same idea to log delivery?

Imagine our log collector has a circular buffer that stores messages incoming. As long as there is no error message, we don't deliver anything. Older messages are silently dropped. Well, not really silently. The log collector counts everything and exposes it to metrics.

Boom! An error message is received! All messages will be delivered immediately, and delivery will continue until no errors are detected for a certain period of time.

It's really that simple. The biggest issue here is a long-term logs storage requirement you may have. In such a case, this method can still be implemented just for the short-term storage delivery route.

It'd be useful to have the ability to switch the log collector's operation mode quickly. I.e., it can do buffering by default, but can be switched to the normal delivery mode for 15 minutes after a deployment.

Reduce a message size

This part is not really related to the control plane idea, but I decided to include it here also to complete the log reduction topic. Consider this as a bonus for those who reached here.

Imagine we have 100k JSON log lines per second from our infrastructure. In every message, we have an “is_complete” Boolean field, which may be false in about 75% of cases. Let's do some calculations.

len('"is_complete": false') × 100 000 msg/s × 75% = 20 × 75 000 = 1 500 000 bytes/s

1 500 000 bytes/s = 129 600 000 000 bytes/day = 120 GiB/day

So, just having the field cost us 120 GiB of traffic and storage daily (without compression, replicas, or indices). As this is a Boolean field, we can stop adding it if it's false and save 120 GiB per day. Moreover, there should be a comma after or before the field, usually. Add another few GiB per day!

Out of curiosity, let's calculate how much data a single char generates:

1 × 100 000 msg/s = 100 000 bytes/s = 8 640 000 000 bytes/day = 8 GiB/day

Remember, every single character you have in your logs costs you some traffic and storage!

If you think 100k msg/s is a lot, it's not. I saw 250k msg/s and even more during the peak load at Lazada. I know a few companies, where it's even higher.

Impressed? Take the numbers with a grain of salt. Traffic and storage are typically compressed, so numbers are 2–10 times lower. On the other side, storage is often replicated and has some indices to speed up the search, so numbers are increased by some factor.

Implementation

So you feel enthusiastic and would like to introduce the log delivery pipeline into your infrastructure. Unfortunately, I'm not aware of any ready-to-use solution that can do most of the things above. I believe FAANG-level companies should have something similar developed in-house and tied specifically to their infrastructure and their needs. So you can do this as well. Check your orchestration and configuration management engine features. Maybe it's good enough to start with. If you ask me to implement such a tooling, I'd go with rsyslog, Puppet, and Choria, mostly because I know this software very well, and it's flexible enough.

We're engineers here, right? ;)

DEV Community