It was both worrying and very interesting - Kafka went down and wouldn’t restart. I experienced not being able to restart kafka because of Operating System (OS) level settings not being set. I certainly wasn’t used to looking so far under the hood of how Kafka works but investigation lead me to delving into how java works with the OS.
When Kafka starts but stops very soon after with a memory allocation error like below:
 INFO [TransactionCoordinator id=3] Starting up. (kafka.coordinator.transaction.TransactionCoordinator) Java HotSpot(TM) 64-Bit Server VM warning: Attempt to protect stack guard pages failed. Java HotSpot(TM) 64-Bit Server VM warning: Attempt to deallocate stack guard pages failed.  INFO [TransactionCoordinator id=3] Startup complete. (kafka.coordinator.transaction.TransactionCoordinator) Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f4e0ff1b000, 12288, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory. # An error report file with more information is saved as: # /opt/kafka/bin/hs_err_pid2423.log Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f4eed723000, 262144, 0) failed; error='Cannot allocate memory' (errno=12) [thread 139981501064960 also had an error] Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f4fe5cc0000, 65536, 1) failed; error='Cannot allocate memory' (errno=12) [thread 139980200949504 also had an error]
Increase the maximum number of memory maps the OS is allowed to create:
sysctl -w vm.max_map_count=<number of memory maps> #default is 65536 #e.g. sysctl -w vm.max_map_count=262144
It turns out that Kafka allocates a memory map for each log file, in each partition, in each topic. When you keep kafka messages indefinitely and also have a large number of partitions it means that the process hits the limit of the number of memory maps it can allocate. This is described in detail in the kafka docs (third bullet point in that section).
To find out how many memory maps are being used by a Java process use the following:
jps -l pmap <id of kafka process> | wc -l
This is the kind of thing which will only be discovered when Kafka is restarted. It demonstrates how important it is to test restarting (i.e. failover and failback scenarios) - even in production. Being able to confidently restart is so important when things go wrong in order to restore service quickly.
This also shows the need for good monitoring of OS level values, not just application level values or simple CPU and memory consumption. In this case the number of mapped memory portions is only going to increase when keeping messages forever or experiencing more load. Knowing what to monitor is crucial and sometimes can only be understood from reading through all the documentation or finding public postmortems which you can learn from.