Getting Valuable Insights from ClickHouse Error Logs using ClickSight

#ansible #clickhous #monitoring #tutorial

Introduction

When managing a production ClickHouse cluster, you might face numerous challenges. One of them is finding the root cause in case of a crash. During such incidents, querying system tables like system.errors or system.text_log is not possible. Hence, we need to search into the clickhouse-server.err.log file to identify the root cause and address it effectively. Although using Unix tools is the most efficient approach, we should aim to do it as quickly as possible.

To gain better and faster insights from ClickHouse error logs, I developed an Ansible Playbook named ClickSight. This playbook performs log aggregation on ClickHouse logs from all specified nodes. ClickSight is available under the Apache license on GitHub as the ClickSight project. In the next section, we will see how to use it and customize it to suit your needs.

It's essential to have a monitoring dashboard, such as this one, which can help us identify issues like a high number of mutations or high RAM usage. However, in complex situations, it might not be sufficient to find the root cause, and that's when error logs become crucial.

How ClickSight Works

ClickSight leverages various useful Unix commands, such as grep, cat, tail, cut, and sed, for working with text files. Additionally, Unix's ability to pass the output of one command to another using pipes (|) allows us to chain commands together. For example, we can use grep to find specific lines in a file and then use cut to select all text after the > character with cat myfile.txt | grep "some arbitrary phrase" | cut -d">" -f2. ClickSight harnesses the power of these Unix commands in an Ansible playbook, making log aggregation a breeze. With ClickSight, you only need to run the playbook, and the results will be available on your local machine.

Setting Up ClickSight

To use ClickSight, you need at least one active ClickHouse server and Ansible installed on your system. It's recommended to install it on a Unix-based OS. Additionally, ensure that you have access to the ClickHouse server as a sudo user, or at least your user should have access to the clickhouse-server.err.log file.

For detailed setup instructions for ClickSight and Ansible, please refer to the Prerequisites section of the repository.

Running ClickSight

A comprehensive guide on how to run ClickSight can be found in the Run ClickSight section of the repository. Feel free to follow that guide, and if you have any questions, don't hesitate to ask here or on GitHub.

Analyzing ClickHouse Error Logs

ClickSight offers five modes of operation:

All fatal errors in clickhouse-server.err.log with details: In case of a crash, a fatal error is likely present, explaining the cause of the crash.
Timeline of errors: On production systems, there might be numerous errors every minute, making it challenging to track them in log files. ClickSight can help by displaying the timeline of errors extracted from ClickHouse log lines. The error names are based on the logger_name, providing valuable information about the error category.
Timeline of errors associated with a query_id: ClickHouse provides query_id for some logs associated with executed queries, denoted by {}. ClickSight can display the timeline of errors linked to specific query_ids.
Timeline of errors associated with a specific query_id: ClickSight allows you to view all errors and their occurrences associated with a specific query_id.
Timeline of a specific error: Sometimes, we suspect a particular error and want to know when and how often it occurs. ClickSight can provide a detailed timeline for specific errors.

Contributing to ClickSight

I welcome and appreciate contributions to the ClickSight project. Whether you want to report issues, suggest improvements, or submit new features, your input is valuable in making ClickSight even more useful.

1. Reporting Issues or Feature Requests

If you encounter any problems while using ClickSight or have ideas for new features, please don't hesitate to report them. To do so:

Go to the ClickSight GitHub repository: ClickSight GitHub repository.
Click on the "Issues" tab.
Click on the green "New Issue" button.
Provide a descriptive title and detailed description of the issue you encountered or the feature you want to suggest.
If it's a bug, include steps to reproduce the problem.

Contributing Code

If you're a developer and want to contribute directly to the ClickSight codebase, follow these steps:

Fork the ClickSight repository to your GitHub account using the "Fork" button in the top-right corner.
Clone the forked repository to your local development environment.
Create a new branch for your contribution: git checkout -b my-feature.
Make your changes and improvements.
Test your changes thoroughly to ensure they work as expected.
Commit your changes with clear and concise messages: git commit -m "Add my awesome feature".
Push the changes to your forked repository: git push origin my-feature.
Create a pull request (PR) by navigating to the original ClickSight repository and clicking on "New Pull Request."
Describe your changes in the PR, including any relevant information or context.
I will review your PR, provide feedback, and work with you to merge the changes into the main ClickSight project.