DEV Community

Cover image for Quill: High-Performance Asynchronous C++ Logging Library
Odysseas
Odysseas

Posted on

Quill: High-Performance Asynchronous C++ Logging Library

Introduction

Efficient logging is crucial for application performance and debugging. Logging is an essential part of any software system, but it can often become a performance bottleneck, especially in low-latency applications. Quill addresses this challenge by offering an asynchronous, cross-platform logging solution that minimizes the impact on your application's hot path.

Performance Focus

Quill is a feature-rich logging library built with performance in mind. Its design aims to provide faster logging capabilities compared to many traditional logging libraries.

Thread-Local Lock-Free Ring Buffer:

Each thread is equipped with its own lock-free ring buffer, facilitating efficient, contention-free logging. This architecture eliminates inter-thread synchronization for log writes, drastically reducing overhead.

Compile-Time Metadata Generation:

Quill generates essential log metadata—such as file name, line number, and format string—at compile time. By shifting this workload to compile time, runtime performance is significantly enhanced.

Binary Log Message Serialization:

Instead of formatting log messages on-the-fly, Quill serializes argument data in binary form directly into the ring buffer. This approach minimizes processing on the critical path of application threads.

Asynchronous Backend Processing:

A dedicated backend thread retrieves binary data from the ring buffers, formats the log messages, and outputs them to the designated sinks (e.g., files, console).

This architecture enables Quill to deliver high performance by reducing work in application threads, capitalizing on compile-time optimizations, and leveraging asynchronous processing.

Example Usage

Here's a basic example of how to use Quill in your C++ application:

#include "quill/Backend.h"
#include "quill/Frontend.h"
#include "quill/LogMacros.h"
#include "quill/Logger.h"
#include "quill/sinks/ConsoleSink.h"
#include <string_view>

int main()
{
  quill::Backend::start();

  quill::Logger* logger = quill::Frontend::create_or_get_logger(
    "root", quill::Frontend::create_or_get_sink<quill::ConsoleSink>("sink_id_1"));

  LOG_INFO(logger, "Hello from {}!", std::string_view{"Quill"});
}
Enter fullscreen mode Exit fullscreen mode

Alternatively, you can try it on Compiler Explorer

Get Involved

To dive deeper into Quill or contribute to the project, visit the GitHub repository or the Documentation page.

Top comments (7)

Collapse
 
pauljlucas profile image
Paul J. Lucas

No performance graphs? You need to compare it to C's stdio.

Collapse
 
odygrd profile image
Odysseas • Edited

Thank you for your comment. My logging library is asynchronous, meaning it defers formatting and I/O to another thread, allowing the main application to run without interruption. In contrast, C's stdio functions are synchronous, performing these operations immediately on the calling thread. Because of this, a direct performance comparison isn't applicable, as the two serve different use cases: my library focuses on minimizing logging impact, while stdio prioritizes simplicity and immediacy.

Regarding the charts, I currently have performance numbers available in a table on the GitHub page (github.com/odygrd/quill?tab=readme...). While I agree that visual charts would be helpful, it's a lower priority on my list at the moment.

Collapse
 
pauljlucas profile image
Paul J. Lucas • Edited

My logging library is asynchronous ...

Yes, I know yours is asynchronous.

... it defers formatting and I/O to another thread, allowing the main application to run without interruption.

How do you guarantee that the string to be logged still exists? For example:

void f() {
    char const *s = "hello";
    LOG_INFO(logger, s);
}
Enter fullscreen mode Exit fullscreen mode

If the call to LOG_INFO() returns and f() returns before the logger has actually logged the string on the other thread, then the pointer argument s becomes a dangling pointer.

In order to guarantee thread-safety, you'd have to copy the string in the current thread, then add the copy to the logger's queue — no?

If you copy, then copying also takes time — time you wouldn't have to spend if you logged synchronously.

In contrast, C's stdio functions are synchronous, performing these operations immediately on the calling thread.

Neither synchronization with nor context switches to other threads are zero-cost. Plus if you have to copy strings to guarantee thread-safety (per above), it could be possible that the time to log a short string literal synchronously is less than the total time of copying the string, locking a mutex, inserting the string over to a data structure in the other thread, and unlocking the mutex.

You could use lock-free atomics in your implementation and that would eliminate the lock/unlock mutex time, but you still have to copy the strings to be logged.

For performance metrics, there are also other things you can measure like conversion of both integers and floating-point numbers to their string representations — assuming you don't just use the standard library functions directly.

Thread Thread
 
odygrd profile image
Odysseas • Edited

The library employs a thread-local Single Producer Single Consumer (SPSC) lock-free queue, ensuring that each application caller thread has its own queue, eliminating contention between producer threads. By default, strings arguments passed to the logger are copied into a pre-allocated space within this queue. The design minimizes synchronization overhead between the producer (the thread generating the log message) and the consumer (the logging thread), as the queue is optimized to reduce memory contention by synchronizing with the consumer only infrequently.

The cost of this approach is minimal, typically involving an atomic counter increment and a memcpy (per argument) operation.

When the string’s immutability and valid lifetime are guaranteed, the library provides an option to bypass copying. This is achieved through the StringRef class, which allows zero-copy string logging.

In contrast, synchronous logging with C's stdio functions introduces significant overhead on the hot path. With stdio, the entire log statement, including timestamp conversion and other formatting, is processed immediately on the calling thread. This not only adds latency due to the formatting process but also incurs I/O operation costs, especially when the internal buffer flushes to the file system.

The library leverages fmtlib for formatting, which is efficiently handling conversions such as transforming integers and floating-point numbers into their string representations. Notably, this formatting is performed off the hot path on the backend thread, ensuring minimal impact on application performance.

To offer a practical comparison, you can run the following simplified benchmark code on your system. This example even uses a basic log pattern to minimize the formatting work done by printf, and it excludes the more granular parts of the timestamp (milliseconds, microseconds, and nanoseconds) to further help the printf benchmark with the time conversions

godbolt.org/z/xchcdo9v6

This is the output of the above code on my machine

Image description

Thread Thread
 
pauljlucas profile image
Paul J. Lucas

In theory, can the hot path generate log messages faster than the back-end can actually emit them to the destination device (console or log file) and you could conceivably eventually either overflow internal buffers or exhaust memory?

Or at some point does calling the log API become a blocking call to wait for the queue to drain?

Thread Thread
 
odygrd profile image
Odysseas • Edited

Yes, there's a single backend thread processing logs, with potentially multiple producer threads logging. The backend thread performance depends on a few factors such as :

Log statement size and argument types
Disk speed (as it writes to files)
CPU core availability (likely pinned on a shared non-critical core)

For example, running on my isolated CPU core on a tuned linux box with SSD, when logging messages with just 2 args (an int and a double), the backend throughput is around 4.50 million msgs/sec.

The library offers 5 user-selectable policies:

Bounded SPSC: Never allocates, with Blocking or Dropping behavior
Unbounded SPSC: Starts small, allocates on the hot-path up to 2 GB, then Blocks or Drops
Unbounded SPSC Unlimited: Keeps allocating indefinitely, never Blocks or Drops

Users can configure queue size (starting size for unbounded, fixed for bounded).

By default, it uses Unbounded SPSC with Blocking when the 2 GB limit is reached.

The library notifies about queue allocations, blocked messages, and dropped messages, providing visibility into its behaviour under load.

Thread Thread
 
pauljlucas profile image
Paul J. Lucas

Everything you've written in the comments in response to me would have made a far more interesting blog post than your original one, i.e., a blog about the design, design and implementation choices, thread-safety, etc., of the logging library.