Recently noticed some high avg CPU utilization on an Amazon Aurora Mysql Databases running Mysql 5.6 (oscar:5.6.mysql_aurora.1.22.2)
. Something that was noticed that I thought was interesting to share were zombie threads or threads that were running for a long period of time and never finished as well as threads that were not possible to be killed.
These were simple DDL statements that were triggered by a little reporting engine that created a bunch of temporary tables to gather some aggregations.
A quick look up on the process list tells us that there are some DDL statements stuck for 4 days as shown below:
mysql> show full processlist;
| Id | User | Host | db | Command | Time | State | Info
| 77569519 | app | x.x.x.x:yyyyy | test | Query | 404949 | init | DROP TEMPORARY TABLE IF EXISTS temp1
:::::
TRX status for the same:
mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_TRX where trx_mysql_thread_id = 77569519 \G
*************************** 1. row ***************************
trx_id: 124803462108
trx_state: RUNNING
trx_started: 2022-09-17 21:01:45
trx_requested_lock_id: NULL
trx_wait_started: NULL
trx_weight: 33614
trx_mysql_thread_id: 77569519
trx_query: DROP TEMPORARY TABLE IF EXISTS temp1
trx_operation_state: NULL
trx_tables_in_use: 0
trx_tables_locked: 0
trx_lock_structs: 14
trx_lock_memory_bytes: 376
trx_rows_locked: 0
trx_rows_modified: 33600
trx_concurrency_tickets: 0
trx_isolation_level: READ COMMITTED
trx_unique_checks: 1
trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
trx_adaptive_hash_latched: 0
trx_adaptive_hash_timeout: 0
trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.00 sec)
The initial suspect was disk issues causing these long running queries but was ruled out as metrics seemed ok and the database appeared to have plenty of Local Storage to deal with temporary tables. The next attempt to recover from these were to kill the long query to free up the CPU cycles.
mysql> call mysql.rds_kill(77569519);
Query OK, 0 rows affected (0.00 sec)
mysql> call mysql.rds_kill_query(77569519);
Query OK, 0 rows affected (0.00 sec)
No luck despite attempts to kill the query and even the connection. While rds_kill_query
did not change anything rds_kill
did change the command status from Query
to killed
. Neither of these were helpful in this case and the trx_state
continued to be RUNNING
.
mysql> show full processlist;
| Id | User | Host | db | Command | Time | State | Info
| 77569519 | app | x.x.x.x:yyyyy | test | Killed | 422937 | init | DROP TEMPORARY TABLE IF EXISTS temp1
:::::
Next up was to seek some help from AWS Support and thus gathered the below recommendations:
- Reboot the Amazon Aurora Cluster (or trigger a failover).
- Upgrade From Amazon Aurora
1.x
to Amazon Aurora to2.x
. Particularly2.07.8
which has some fixes from the community edition for stability around temporary tables.
Note that Aurora 2.x
would mean an upgrade to Mysql 5.7.x
from a compatibility standpoint.
Hope this helps!
Top comments (0)