Monitoring Solr (3 Part Series)
As the first part of the three-part series on monitoring Apache Solr, this article explores which Solr metrics are important to monitor and why. The second part of the series covers Solr open source monitoring tools and identify the tools and techniques you need to help you monitor and administer Solr and SolrCloud in production.
When first thinking about installing Solr you usually ask yourself a question - should I go with the master-slave environment or should I dedicate myself to SolrCloud? This question will remain unanswered in this blog post, but what we want to mention is that it is important to know which architecture you will monitor. When dealing with SolrCloud clusters you not only want to monitor per node metrics but also cluster-wide information and the metrics related to Zookeeper.
When running Solr it is usually a crucial part of the system. It is used as a search and analysis engine for your data - part of it or all. Such a critical part of the whole architecture is needed to be both fault tolerant and highly available. Solr approaches that in two ways. The legacy architecture also called master-slave - it is based on a clear distinction between the master server which is responsible for indexing the data and the slave servers responsible for delivering the search and analysis results.
When the data is pushed to the master it is transformed into a so-called inverted index based on the configuration that we provide. The disk-based inverted index is divided into smaller, immutable pieces called segments, which are then used for searching. The segments can also be combined together into larger segments in a process called segment merging for performance reasons - the more segments you have, the slower your searches can be and vice versa.
Once the data has been written in the form of segments on the master’s disk, it can be replicated to the slave servers. This is done in a pull model. The slave servers use an HTTP protocol to copy the binary data from the master node. Each node does that on its own and works separately copying the changed data over the network. We already see a dozen of places that we should monitor and have knowledge about.
Having a single master node is not something that we would call fault tolerant, because of having a single point of failure. Because of that, the second type of architecture was introduced with Solr 4.0 release - the SolrCloud. It is based on the assumption that the data is distributed among a virtually unlimited number of nodes and each node can perform indexing and search processing roles. Physical copies of the data, placed in so-called shards can be created on demand in the form of physical replicas and replicated between them in a near real-time manner allowing for true fault tolerance and high availability. However, for that to happen we need an additional piece of software - Apache Zookeeper cluster to help Solr manage and configure its nodes.
When the data is pushed to any of the Solr nodes that are part of the cluster, the first thing that is done is forwarding the data to a leader shard. The leader stores the data in the write-ahead log called transaction log and, depending on the replica type, send the data to the replica for processing. The data is then indexed and written onto the disk into the inverted index format. This can cause additional I/O requirements - as the data indexing may also cause segment merging and finally it needs to be refreshed in order to be visible for searching, which requires yet another I/O operation.
When you send a search query to SolrCloud cluster, the node that is hit by the query initially propagates the data to shards that are needed to be queried in order to provide full data visibility. Each distributed search is done in two phases - scatter and gather. The scatter phase is dedicated to finding which shards have the matching documents, the identifier of those documents and their score. The gather phase is dedicated to rendering the search results by retrieving the needed documents from the shards that have them indexed. Each search phase requires I/O to read the data from disk, memory to store the results and intermediate steps required to perform the search, CPU cycles to calculate everything and network to transport the data.
Let's now look at how we can monitor all those metrics that are crucial to our indexing and searching.
One of the tools that come out of the box with the Java Development Kit and can come in handy when you need quick, ad-hoc monitoring is jconsole. A GUI tool that allows one to get basic metrics about your JVM, like memory usage, CPU utilization, JVM threads, loaded classes. In addition to that, it also allows us to read metrics exposed by Solr itself in the form of JMX MBeans. Whatever metrics were exposed by Solr creators in that form can be read using jconsole. Things like average query response time for a given search handler, number of indexing requests or number of errors - all can be read via the JMX MBeans. The problem with jconsole is that it doesn't show us the history of the measurements.
Jconsole is not the only way of reading the JMX MBean values - there are other tools that can do that, like the open sourced JMXC or our open-source Sematext Java Agent. Unlike the tools that export the data in text format, our Sematext Java Agent can ship the data to Sematext Cloud - a full stack observability solution which can help you get detailed insight into your Solr metrics.
The second option for gathering Solr metrics is an API introduced in Solr 6.4 - the Solr Metrics API. It supports on-demand metrics retrieval using the HTTP based API for cores, collections, nodes, and the JVM. However, the flexibility of the API doesn't come from it being available on-demand, but it being able to report the data to various destinations by using Reporters. Right now, out of the box the data can be exported with a little configuration to:
- JMX - JMX MBeans, something we already discussed
- SLF4J - logs or any destination that SLF4J supports
We will cover the open source Solr monitoring solutions in greater details in the second part of this three-part monitoring series.
Having knowledge about how we can monitor Solr, let's now look into top Solr metrics that we should keep an eye on.
Each handler in Solr provides information about the rate of the requests that are sent to it. Knowing how many requests in total and per handler are handled by your Solr node or cluster may be crucial to diagnose if the operations are going properly or not. A sudden drop or spike in the request rate may point out failure in one of the components of your system. Cross-referencing the request rate metric with the request latency can give you information on how fast your requests are at the given rate or show potential issues that are coming your way when the number of requests is the same, but the latency starts to grow.
A measurement of how fast your requests are, similar to the request rate, is available for each of the handlers separately. It means that we are able to easily see the latency of our queries and update requests. If you have various search handlers dedicated to different search needs, i.e. one for product search, one for articles search, you can easily measure the latency of each of the handlers and see how fast the results for a given type of data are returned. Cross-referencing the latency of the request with metrics like garbage collector work, JVM memory utilization, I/O utilization, and CPU utilization allows for easy diagnostics of performance problems.
Commit events in Solr come in various flavors. There manual commits, send with or without the indexing requests. There are automatic commits - ones that are fired after certain criteria are met - either the time has passed or the number of documents was greater than the threshold. Why are they important? They are responsible for data persistence and data visibility. The hard commit flushes the data to the disk and clears the transaction log. The soft commit reopens the searcher object allowing Solr to see new segments and thus serve new data to your users. However, it is crucial to balance the number of commits and the time between them - they are not free and we need to be sure that our commits are not too frequent as well as not too far apart.
Caches play a crucial role in Solr performance, especially when it comes to Solr master-slave architecture. The data that is cached can be easily accessed without the need for expensive disk operations. The caches are not free - they require memory and the more information you would like to cache, the more memory it will require. That's why it is important to monitor the size and hit rate of the caches. If your caches are too small, the hit rate will be low and you will see lots of evictions - removal of the data from the caches causing CPU usage and garbage collection effectively reducing your node performance. Caches that are too large on the other hand will increase the amount of data on the JVM heap pushing the garbage collector even further and again lowering the effective performance of your nodes. You can also cross-reference the utilization of the cache with commit events - remember, each commit event discards the entries inside the cache causing its refresh and warm up which uses resources such as CPU and I/O.
The above are the key Solr metrics to pay attention to, although there are other useful Solr metrics, too.
Apache Solr is a Java software and as such is greatly dependent on the performance of the whole Java Virtual Machine and its parts, such as garbage collector. The JVM itself doesn’t work in isolation and is dependent on the operating system, such as available physical memory, number of CPU cores and their speed and the speed of the I/O subsystem. Let's look into crucial metrics that we should be aware of.
Majority of the operations performed by Solr are in some degree dependent on the processing power of the CPU. When you index data it needs to be processed before it is written to the disk - the more complicated the analysis configuration, the more CPU cycles will be needed for each document. Query time analytics - facets, need to process a vast amount of documents in a subsecond time for Solr to be able to return query results in a timely manner. The Java virtual machine also requires CPU processing power for operations such as garbage collection. Correlating the CPU utilization with other metrics, i.e. request rate or request latency may reveal potential bottlenecks or point us to potential improvements.
Free memory and swap space are very important when you care about performance. The swap space is used by the operating system when there is not enough physical memory available and there is a need for assigning some more memory for applications. In such case memory pages may be swapped, which means that those will be taken out of the physical memory and written to the dedicated swap partition on the hard drive. When the data from those swapped memory pages are needed the operating system loads it from the swap space back to the physical memory. You can imagine that such an operation takes time, even the fastest solid-state drives are magnitude slower compared to RAM memory. Being aware of the implications of swapping the memory we can now easily say that the JVM applications don't like to be swapped - it kills performance. Because of that, you want to avoid you Solr JVM heap memory to be swapped. You should closely monitor your memory usage and swap usage and correlate that with your Solr performance or completely disable swapping.
In addition to the monitoring the system memory you should also keep close attention to JVM memory and utilization of its various pools. Having the JVM memory pools fully utilized, especially the old generation space will result in extensive garbage collection and your Solr being completely unusable.
Apache Solr is a very I/O based application - it writes the data when indexing and reads the data when the search is performed. The more data you have the higher the utilization of your I/O subsystem will be and of course the performance of the I/O subsystem will have a direct connection to your Solr performance. Correlating the I/O read and write metrics to request latency and CPU usage can highlight potential bottlenecks in your system allowing you to scale the whole deployment better.
When data is used inside JVM based application it is put onto the heap. First into smaller your generation, later moved to usually larger old generation heap space. Assigning an object to an appropriate heap space is one of the garbage collector responsibilities. The major responsibility and the one we are most interested in is the cleaning of the objects that are not used. When the object inside the Java code is no longer in use it can be taken out of the heap in the process of garbage collection. That process is run from time to time, like few times a second for the young generation and every now and then for the old generation heap. We need to know how fast they are, how often they are and if everything is healthy.
If your garbage collection process is not stopping the whole application and the old generation garbage collection is not constant - it is good, it usually means you have a healthy environment. Keep in mind that correlating garbage collector metrics with memory utilization and performance measurements like request latency may reveal memory issues.
Having a healthy Zookeeper ensemble is crucial when running a SolrCloud cluster. It is responsible for keeping collection configurations, collection state required for the SolrCloud cluster to work, help with leader election and so on. When Zookeeper is in trouble, your SolrCloud cluster will not be able to accept new data, move shards around or accept new nodes joining the cluster - the only thing that may work are queries, but only to some extent.
Because healthy Zookeeper cluster is a required piece of every SolrCloud cluster it is crucial to have full observability of the Zookeeper ensemble. You should keep an eye on metrics like:
- Statistics of connections established with Zookeeper
- Requests latency
- Memory and JVM memory utilization
- Garbage collector time and count
- CPU utilization
- Quorum status
Solr is an awesome search engine and analytics platform, allowing for blazingly fast data indexing and retrieval. Keeping all the relevant Solr and OS metrics under observation is way easier when used with the right monitoring tool. That's why in the second part of the monitoring Solr series, we take a look at the possible options when it comes to monitoring Solr using open source tools. The last part of the series will cover production ready Solr monitoring with Sematext.