Master the art of log management: tips for handling high-volume logs

#devops #logs #observability

Managing logs in the modern world is a serious challenge. Logs, or information captured at a point in time about an action that took place, are constantly being generated by applications, servers, and network devices. These individual messages about events or errors provide you with more information and context when things go wrong, which sounds incredibly helpful. But as your infrastructure grows or your application becomes more complex, at some point you may feel like you’re buried in a mountain of never-ending information.

Using a log management solution can provide visibility into the performance data you need to reduce the mean time to resolution (MTTR) and mean time to detection (MTTD). Monitoring can help you to navigate this constant inflow so you spend less time looking and more time problem-solving. But, how can you best sort through this mountain of data to get the crucial insights you need when you encounter an issue?

In this post, I’ll discuss tricks and tips for managing and scaling your log management.

Factors contributing to high log volume

First, let’s understand why you might be running into a high volume of logs. There’s a variety of reasons, but some common reasons are:

The growing complexity of your application architecture: As your architecture becomes more complex, interactions between all of the components in it increases. All these interactions mean there will be more events and your customers may encounter more errors. For example, your app might use hundreds of microservices that interact with each other, generating massive amounts of log data.
Not using log rotation policies: You shouldn’t need to and probably won’t want to keep logs on a long term basis. Instead, you’ll want to archive or compress them, or in some cases, delete them. You may want to set rules about what size logs you will retain, how long you'll keep them, when to compress them, and when to notify stakeholders about rotation. For details on rotation, read about the lifecycle of application logs.
The granularity of information in your logs: In cases where your logs are too detailed, you might run into issues creating, transmitting, and storing them. You might see that this has an effect on application response times or even network bandwidth, depending on the situation. It’s important to figure out what level of granularity is helpful and what level of granularity becomes an unnavigable soup.

It's important to understand the reasons behind the log volume, because this can help you decide how to manage them. While you might not have the power to change the complexity of an entire application, you can take some steps to reduce the complexity of the logs themselves.

Best practices for managing log volume

Perhaps the most important best practice with logs is to forward your logs to a centralized location. This will help you sort, find, and use the information that logs provide, in the most effective way. You can do this in our application performance monitoring (APM) capability as well as in our infrastructure agent. But besides locating your logs in one place, there are some other best practices you should be aware of:

Standardize your log levels: Consider standardizing and defining the log levels your organization will use. For example, the New Relic infrastructure agent uses a subset of the industry-standard Syslog severity levels, to simplify the categories that you’ll need to sort through.
Keep a consistent log format: In logs that are the same level, you might want to define a standard of what information you log and don't log. It’s much easier to aggregate and sort through information that is uniform rather than inconsistent.
Log what’s valuable: Everything you log should have a purpose. From user events to application errors, only record the information you think will be valuable, rather than over-indexing on everything you could capture in a log. For example, Kubernetes health checks happen frequently, but they don’t add a lot of overall value to understanding the picture of your system.
Consider using structured logging: Because humans are fallible and making sure that the text in logs is relatively consistent may be troublesome, consider using structured logging. This method conceptualizes logs not as a written message but as a record of an event and its context that can be parsed by machines. For more detail, read our documentation on log parsing or this blog on how to set up structured logging for New Relic, written with Python.

Also consider what you’ll need to know in a log when you’re diagnosing an issue. While none of these next points will help you generate fewer logs, they should help you generate more helpful logs:

Use a parseable log format: Consistent log structure can help you capture key information including essentials like date, time, and description of an error rather than just a status code. If you keep a consistent format, these details will be easier to aggregate and delve into later. If you’re using a tool that handles parsing for you, consider going with a solution that allows you to set custom log parsing rules if you need them.
Provide context alongside logs: Using a tool that provides information about where an error or issue originated from alongside the actual log itself can save you time. Learn more about how to simplify your troubleshooting with logs in context.
Ensure that you can filter logs the way you want: While it might sound obvious that you'll need to sort through logs, you should think about how and when you'll want to do that. You may be able to identify common problems or types of information that are relevant in multiple situations, not all of them emergencies. For example, you might filter logs when the information contained is sensitive, such as information that needs to be handled according to specific privacy or security requirements. Filtering also is useful if you need to audit your logs, as you should be able to filter out events that don’t pertain to what you are checking. For additional best practices for creating better logs, this blog post walks through some other considerations when managing logs.

Techniques for managing log volume

Managing log volume is crucial for maintaining the health and performance of IT systems, and there are several techniques available that you may want to try.

Log throttling: You can limit the number of logs that are created during a certain timeframe. This allows you to control the flow of logs to a more manageable level while still providing some information. Instances where you might use this include applications that generate a higher-than-normal amount of logs, such as debugging tools.
Log sampling: Much like log throttling, log sampling decreases the number of logs that you capture. Sampling can also be set for a particular time frame, but it captures logs based on either criteria you set or at random throughout this timeframe. Throttling, on the other hand, stops when the amount of logs specified for the time period is reached.
Dynamic log level adjustment: This technique changes the level of logs automatically based on your system’s needs. By changing the level, your logs should either generate more or less data. For example, the level of DEBUG may have more granular detail than INFO.
Log analysis: You can analyze your logs to find out why so many are being generated. Typically, you'll discover the source of the volume after using tools employing machine learning that can normalize the log data and look for patterns. New Relic has a quickstart for log analysis that can help you get started right away.

While you can use these techniques by themselves, you can also just use New Relic. Our log management capability automatically scales to the volume you need. Your data and user costs don’t increase, and we have predictable pricing.

Summary

In this blog post, you’ve learned some techniques and best practices to manage log volume effectively. Following some of these practices can help your organization save on cost, improve system performance, and decrease MTTD and MTTR. You may even save some time ensuring security and privacy compliance are up to company standards.

With New Relic, you can monitor your logs and find the signal within the noise. Get deeper visibility, near-instant search, and full contextual log information for any volume of queries. And with our simple, transparent pricing, data and user rates stay the same as you continue to scale.

Next steps

Want to learn more about log management available in New Relic? Visit our page that gives a brief overview of this capability and what it can do as just one part of our all-in-one observability platform.
Check out our documentation about getting started with log management.
You can get started managing your log volume effectively in just a few minutes with a free New Relic account. Your account includes 100 GB/month of free data ingest, one free full-access user, unlimited free basic users, and access to 500+ pre-built quickstart integrations.

DEV Community

Master the art of log management: tips for handling high-volume logs

Factors contributing to high log volume

Best practices for managing log volume

Techniques for managing log volume

Summary

Next steps

Top comments (0)

Read next

From Sunshine to Snowfall: Crafting Weather-Based UIs with DevCycle Feature Flag Challenge

How to Install k3s with High Availability (HA)

Simplify Environment Variable Management with GitHub Environments

Understanding DevSecOps Principles