High CPU and zombie threads on Amazon Aurora Mysql 5.6

#aws #rds #auroramysql

Recently noticed some high avg CPU utilization on an Amazon Aurora Mysql Databases running Mysql 5.6 (oscar:5.6.mysql_aurora.1.22.2). Something that was noticed that I thought was interesting to share were zombie threads or threads that were running for a long period of time and never finished as well as threads that were not possible to be killed.

These were simple DDL statements that were triggered by a little reporting engine that created a bunch of temporary tables to gather some aggregations.

A quick look up on the process list tells us that there are some DDL statements stuck for 4 days as shown below:

mysql> show full processlist;
| Id       | User            | Host                | db         | Command | Time   | State                     | Info                
| 77569519 | app        | x.x.x.x:yyyyy | test | Query   | 404949 | init                      | DROP TEMPORARY TABLE IF EXISTS temp1
:::::

TRX status for the same:

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_TRX where trx_mysql_thread_id = 77569519 \G
*************************** 1. row ***************************
                    trx_id: 124803462108
                 trx_state: RUNNING
               trx_started: 2022-09-17 21:01:45
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 33614
       trx_mysql_thread_id: 77569519
                 trx_query: DROP TEMPORARY TABLE IF EXISTS temp1
       trx_operation_state: NULL
         trx_tables_in_use: 0
         trx_tables_locked: 0
          trx_lock_structs: 14
     trx_lock_memory_bytes: 376
           trx_rows_locked: 0
         trx_rows_modified: 33600
   trx_concurrency_tickets: 0
       trx_isolation_level: READ COMMITTED
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.00 sec)

The initial suspect was disk issues causing these long running queries but was ruled out as metrics seemed ok and the database appeared to have plenty of Local Storage to deal with temporary tables. The next attempt to recover from these were to kill the long query to free up the CPU cycles.

mysql> call mysql.rds_kill(77569519);
Query OK, 0 rows affected (0.00 sec)

mysql> call mysql.rds_kill_query(77569519);
Query OK, 0 rows affected (0.00 sec)

No luck despite attempts to kill the query and even the connection. While rds_kill_query did not change anything rds_kill did change the command status from Query to killed. Neither of these were helpful in this case and the trx_state continued to be RUNNING.

mysql> show full processlist;
| Id       | User            | Host                | db         | Command | Time   | State                     | Info                
| 77569519 | app        | x.x.x.x:yyyyy | test | Killed  | 422937 | init     | DROP TEMPORARY TABLE IF EXISTS temp1
:::::

Next up was to seek some help from AWS Support and thus gathered the below recommendations:

Reboot the Amazon Aurora Cluster (or trigger a failover).
Upgrade From Amazon Aurora 1.x to Amazon Aurora to 2.x. Particularly 2.07.8 which has some fixes from the community edition for stability around temporary tables.

Note that Aurora 2.x would mean an upgrade to Mysql 5.7.x from a compatibility standpoint.

Hope this helps!

DEV Community

High CPU and zombie threads on Amazon Aurora Mysql 5.6

Top comments (0)

Read next

Storing Certificate Files in AWS Secrets Manager and Accessing Them in Kubernetes Secrets

VPC y Subredes en AWS - Parte 2: Configuración de Conectividad Segura entre Recursos y hacia Internet

Resolving ECS Task Definition Security Risks Detected by AWS Security Hub Using Secrets Manager

Joins, Scale, and Denormalization