Mark Zlamal

Posted on Aug 18, 2022

Docker Solution: CockroachDB with Grafana Logging & Monitoring

#cockroachdb #docker #grafana

what is this?

A prescriptive approach to deploying CockroachDB with integrated logging, monitoring, and alerting through Grafana.

what's special here?

This blog provides an overview of the containerized database environment that leverages key services such as Prometheus, Fluentd, Loki, and a handful of supporting components bundled into a single package. This package is hosted on GitHub, pre-configured using typical settings and can be easily adjusted to match your environment.

github.com/cockroachlabs/Docker-CRDB

This Docker-CRDB GitHub repository provides the core project files while this article highlights the components and how the overall solution fits together.

your takeaways?

Coverage of the components and technologies used in this CockroachDB / Logging solution.
Overview of the architecture, how everything is tied together within Docker. No need to be a subject matter expert in these areas.
A pre-built reference solution that you can download and make your own.

Logging components used here — Technologies and components leveraged in this blog

logging is complicated

It's complicated because there are many ways to architect a solution, many ways to deploy the solution. It can be containerized and orchestrated through Kubernetes. It can be deployed as your own self-managed cloud solution, or you can leverage Grafana’s paid subscription for their integrated cloud-managed services.

The good news is that there is extensive documentation into logging, configuration, workshops, technologies, approaches, repositories, fragments, tips, and tricks.

The bad news is that there is extensive documentation into logging, configuration, workshops, technologies, approaches, repositories, fragments, tips, and tricks.

In my journey I wasn't looking for deep-dives into these topics, I just want a reference solution that's already pre-built, pre-configured, operational with minimal amount of integration steps.

At the same time I don't want automation through ansible or terraform since they can hide key integration aspects, potentially taking away from my understanding of how everything is tied together.

This project is deployed and operated as a set of independent and interacting Docker containers, where each container runs a single image that manages a single task. No orchestration through Kubernetes, and everything runs locally.

...to the architecture

This solution is divided into 2 perspectives using a Docker bridge network.

Core Architecture — Core architecture (GitHub)

The top half (public network) represents access to pre-configured endpoints that the host machine can connect to to interact with the platform. Typical interactions can be browsers, potential workload-apps, CockroachDB clients, etc.
The lower half (private network) is the isolated and containerized 'sandbox' of apps and services that interconnect using virtualized network ports and hosts within the framework of the Docker and the Docker bridge network.

the public network

The upper half is the public network where we can interact with the database and all available logging services. This is effectively the host machine and all accessible endpoints to workloads outside of Docker. All the ports shown here (eg: 3000, 9090, 8080, 26257, …) are defined in the settings as defaults in this project, and must be available host-ports when this project is deployed.

If you have any conflicts, say another unrelated project/app/workload uses one or more of these ports, then you can adjust the port number through the provided configuration and docker-compose files.

the private network

The lower half is the private network, under the umbrella of Docker running on the host. You can see all the pieces, connected together via networked-interaction-chain that process (source/sink) logging data that’s eventually consumed by the user.

Port conflicts typically do not occur in the private network because each container is treated as a virtualized host within Docker. The key-example here is the set of 3 fluentd instances that listen to port 5170, but because each is treated as a unique host (and hostname), you can distinguish between them and connect accordingly.

Each orange box is a container running a single image of the highlighted component, and these are all running inside a bridge network. This bridge is a private Docker network, conceptually similar to a virtual-private-cloud. Every running container is treated as a virtual host and treated as a distinct service with ports that can be exposed across the bridge network. Each container (eg: host) has visibility to all other containers within this network but only to exposed ports defined by the docker-compose configuration file. Outside of this private-network, none of these exposed ports are visible or accessible to the host machine running Docker.

data and log flow

Starting at the right-hand side, we have a containerized 3-node CockroachDB cluster. Normally the database sends logs directly into the cockroach-data/logs folder, but here it’s configured to use fluentd as our log sink, and this is the first step in the chain.

so what’s Fluentd?

It’s an open source data collector. It unifies log collection and consumption in a formatted, tagged, buffered, consistent way across all your applications. Fluentd can then save this structured data back into the filesystem, or as the basis of this project, Fluentd sends the formatted data via local-networking to Loki, and this is the next piece of the puzzle.

...and Loki?

Loki is a scalable, multi-tenant logs aggregator and time-series database. It’s similar to Prometheus but specifically designed for TEXT-based analytics, indexing, searching, scanning, and querying facilitation inside Grafana.

what about Prometheus?

In Parallel to the above log flows, there is another network link from CockroachDB directly to the Prometheus container. This connection facilitates the operational metrics that allow us to perform queries on Cockroach statistics, usage, SQL, data volumes, contention rates, etc.

destination: Grafana!

Finally in Grafana we defined data-sources that listen to both Loki and Prometheus, and this is the final sink to our logs.

As mentioned earlier we have a handful of endpoints exposed to the public network, notably from Cockroach and Grafana so we can access their fancy UI and run logging queries against the CockroachDB nodes.

alerting

The alerting framework is separated from the core architecture because it's an optional capability and requires API/Service keys from 3rd party cloud services. In our example, when a trigger in Grafana is activated, it calls a NodeJS endpoint that issues an API call to Twilio and SendGrid. These services send live email and global SMS messages to a recipient, notifying them of this alert with context.

In Grafana, the alert is configured to monitor prometheus metrics. When the threshold value is reached, Grafana triggers this alert and calls web-hook that sends a JSON payload containing the properties of the alert, and custom fields such as API key information and recipient details.

The NodeJS application that listens to this alerting endpoint (webhook URL & Port) formats the data into a human readable format, leveraging email-templates, SMS formatting, and sends the new payload to Twilio through their API services.

Image description — Alerting architecture (GitHub)

alerting key capabilities

Monitoring and alerting using quantitative metrics from Prometheus, triggered when thresholds are reached or exceeded.
Monitoring and alerting using events from text/string/JSON logs as our triggers.

the GitHub repository

The folders represent all the operational containers running the complete solution. Specifically there is a 1:1 relationship between each folder in github and each container in the architecture diagram, and this was intentional to make it easy to learn, to locate and make adjustments to any component independently of all the others.

configuration files & settings

The architecture diagram highlights each configuration file that governs that particular aspect of this platform, and they can be found in the corresponding github folders. While everything is wired together right out of the box using typical values, you have full control of all the settings and communication. This facilitates an easy and flexible integration into an existing Cockroach environment, and as a bonus you get to learn how all the pieces work together.

The key aspect in this organization is that each folder contains a docker-compose yaml file that defines the container including hostnames, images to use, which ports to expose and map publicly. You run docker-compose up -f and this will build the container using the image along with the properties in the spec. This container is then pushed into your local Docker repository and launched in your Docker environment.

implementation details

The GitHub repo covers the remaining details such as tools/prerequisites, establishing a Docker bridge network, and creating certificates for CockroachDB.

Cheat-sheets, fragments, and command shortcuts are included with example values to help stand-up the environment quickly.

Finally the startup sequence of containers and a listing of the endpoints are given for convenience. Note that the endpoints have defined ports based on the default settings of this project.

running CockroachDB and Grafana

We need to create data-source connections from our log sources into Grafana. According to the architecture diagram, we’ll establish the Loki and Prometheus endpoints in the Grafana UI as shown in this screenshot:

We need to define a default contact point which includes a webhook URL to the alerts container:

Define the alert parameters that are sent as a payload to the alerts container. Note that the SMS_Key field syntax must have the format '<SID>:<Secret>'.

Example Log Queries

Below are a few example queries that you can test out against your CockroachDB cluster.

Prometheus log query examples

A: Alerting Rule Query:

rate(sql_distsql_contended_queries_count{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}[3m:10s])

B: When this rate is > 0.373

The CRDB admin console provides basic views to the prometheus data, such as "SQL Statements" (queries) and "SQL Statement Contention". These charts can be replicated in Grafana using the following queries:

SQL Statements (queries)

rate(sql_update_count{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}[1m:2s])

SQL Statement Contention

rate(sql_txn_contended_count{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}[1m:1s])

Other examples that show interesting views into the data

rate(sql_mem_distsql_current{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}[5m:10s])

admission_admitted_kv{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}

rate(admission_admitted_kv{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}[2m:10s])

rate(sql_insert_count_internal{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}[3m:10s])

sql_contention_resolver_retries{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}

rate(sql_contention_resolver_retries{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}[3m:10s])

sql_stats_mem_current{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}

sql_contention_resolver_queue_size{instance=~"crdb-node01:8080|crdb-node02:8080|crdb-node03:8080"}

Loki log query examples

Loki digs into the cockroach logs folder, capturing all the text-based messages that occur within the database. This service is necessary to capture connectivity issues, gossip-protocol updates, system events, and other activities related to a distributed database.

Exact case string-search:

{job=~"CRDB01|CRDB02|CRDB03"} |= "circuitbreaker"

Case insensitive & using regex line filters expressions:

{job=~"CRDB01|CRDB02|CRDB03"} |~ "(?i)CircuitBREAKER"

Capture logs with circuit breakers and connection (issues) in the logs

{job=~"CRDB01|CRDB02|CRDB03"} |~ "(?i)Circuitbreaker" |~"connection"

Rate of circuit breakers over a timeframe of 50 seconds

rate( ( {job=~"CRDB01|CRDB02|CRDB03"} |~ "(?i)Circuitbreaker")[50s] ) 
rate( ( {job=~"CRDB01|CRDB02|CRDB03"} |~ "(?i)gossip")[20s] )

GitHub Repo URL