DEV Community

Franck Pachot
Franck Pachot

Posted on • Updated on

๐Ÿ”“ tracing writes and locks in YugabyteDB

Here is a quick example to show how to trace, in a lab, the writes and lock intents in YugabyteDB.

Start YugabyteDB with TEST_docdb_log_write_batches set to true, which is described as: Dump write batches being written to RocksDB (For testing only!)

docker run -d --name yb -p7000:7000 -p9000:9000 -p5433:5433 \
yugabytedb/yugabyte:2.12.5.0-b24 bin/yugabyted start \
--tserver_flags="TEST_docdb_log_write_batches=true,tserver_enable_metrics_snapshotter=false" \
--daemon=false 

Enter fullscreen mode Exit fullscreen mode

Create a test table:

docker exec -it yb ysqlsh -e -h localhost -c "
create table demo (k text primary key, v text);
"
Enter fullscreen mode Exit fullscreen mode

Log all what happens on this table tablets:

docker exec -it yb bash -c "
tail -f /root/var/logs/tserver/yb-tserver.INFO
" | 
awk "/$(
docker exec -t yb bash -c '
yb-admin -init_master_addrs localhost list_tablets ysql.yugabyte demo
' | awk '/^[0-9a-f]{32}/{print $1}' | paste -sd'|'
)/" &

Enter fullscreen mode Exit fullscreen mode

The table is identified by ysql.yugabyte demo (YSQL is the PostgreSQL API, yugabyte the name of the PostgreSQL database, and demo the name of my table. This works only if this name exists in one schema only. See https://dev.to/yugabyte/oid-tableid-53dh for more about yb-admin list_tablets

Single-shard transaction

Run the command you want to see the writes for:

docker exec -it yb ysqlsh -e -h localhost -c "
insert into demo values ('franck','yugabyte')
returning to_hex(yb_hash_code(k))
"
Enter fullscreen mode Exit fullscreen mode

My table is hash-sharded on the first column of the primary key (the default when not specifying ASC or DESC) and I return the yb_hash_code in hexadecimal to identify my row in the logs.

I see two writes (PutCF are RocksDB column families) to the tablet's RegularDB ([R]) as 2 key/value pairs:

I0601 08:22:57.754878   116 tablet.cc:1279] T 864f8ab29c13463b99a81ce9f3fc8b6b P 1be5957dcc1244e78d436f10412e23f3: Wrote 2 key/value pairs to kRegular RocksDB:
  T 864f8ab29c13463b99a81ce9f3fc8b6b P 1be5957dcc1244e78d436f10412e23f3 [R]: Frontiers: { smallest: { op_id: 1.2 hybrid_time: { physical: 1654071777750191 } history_cutoff: <invalid> hybrid_time_filter: <invalid> max_value_level_ttl_expiration_time: <invalid> } largest: { op_id: 1.2 hybrid_time: { physical: 1654071777750191 } history_cutoff: <invalid> hybrid_time_filter: <invalid> max_value_level_ttl_expiration_time: <initial> } }
  T 864f8ab29c13463b99a81ce9f3fc8b6b P 1be5957dcc1244e78d436f10412e23f3 [R]: 1. PutCF: SubDocKey(DocKey(0x86bc, ["franck"], []), [SystemColumnId(0); HT{ physical: 1654071777750191 }]) => null
  T 864f8ab29c13463b99a81ce9f3fc8b6b P 1be5957dcc1244e78d436f10412e23f3 [R]: 2. PutCF: SubDocKey(DocKey(0x86bc, ["franck"], []), [ColumnId(1); HT{ physical: 1654071777750191 w: 1 }]) => "yugabyte"
Enter fullscreen mode Exit fullscreen mode

I've started a RF=1 cluster but a tablet may have multiple peers and each one will receive the writes. [T] identifies the tablet and [P] identifies the peer (it is actually the UUID of the tablet server).

[R] is the RegularDB where committed data is stored. When transactions involve more than one shard, the provisional records are first written to the IntentsDB [I] and moved to [R] at commit.

Single-shard transactions are easy. The row is directly stored as one Document with a SubDocument per column, and locks are maintained in the tserver's memory.

The Document is identified by the primary key which is prefixed by the hash code when hash sharding: DocKey(0x86bc, ["franck"], [])

The RocksDB key includes the transaction time HT{ physical: 1654071777750191 }]) which is the cluster logical time used to order the changes.

There is one SubDocument for the table row itself (SystemColumnId(0)) and one per column (I have only one here - ColumnId(1)). The value for this SubDocument is the value of the column => "yugabyte"

Provisional records

In a SQL distributed database, transactions often involve multiple shards. Here is a simple SELECT FOR UPDATE:

docker exec -it yb ysqlsh -e -h localhost -c "
select * from demo for update;
"
Enter fullscreen mode Exit fullscreen mode

In real life, this will be part of a transaction where you reserve the row, for future update. YugabyteDB stores all intents, to ensure ACID transactions, though the tablet's Raft group so that the transaction can continue even if a new leader is elected.

Now, the logs show only the provisional records to Intents DB as 5 key/value pairs:

I0601 08:39:43.775913   116 tablet.cc:1279] T 864f8ab29c13463b99a81ce9f3fc8b6b P 1be5957dcc1244e78d436f10412e23f3: Wrote 5 key/value pairs to kIntents RocksDB:
  T 864f8ab29c13463b99a81ce9f3fc8b6b P 1be5957dcc1244e78d436f10412e23f3 [I]: Frontiers: { smallest: { op_id: 1.5 hybrid_time: { physical: 1654072783772289 } history_cutoff: <invalid> hybrid_time_filter: <invalid> max_value_level_ttl_expiration_time: <invalid> } largest: { op_id: 1.5 hybrid_time: { physical: 1654072783772289 } history_cutoff: <invalid> hybrid_time_filter: <invalid> max_value_level_ttl_expiration_time: <initial> } }
  T 864f8ab29c13463b99a81ce9f3fc8b6b P 1be5957dcc1244e78d436f10412e23f3 [I]: 1. PutCF: TXN META 819b7090-4d35-49e3-a70b-1ff240a7efdb => { transaction_id: 819b7090-4d35-49e3-a70b-1ff240a7efdb isolation: SNAPSHOT_ISOLATION status_tablet: e13f6206212a4381b8de921f33dff562 priority: 18446744070215993959 start_time: { physical: 1654072783762938 } }
  T 864f8ab29c13463b99a81ce9f3fc8b6b P 1be5957dcc1244e78d436f10412e23f3 [I]: 2. PutCF: SubDocKey(DocKey(0x86bc, ["franck"], []), []) [kStrongRead, kStrongWrite] HT{ physical: 1654072783772289 } => 'x\x81\x9bp\x90M5I\xe3\xa7\x0b\x1f\xf2@\xa7\xef\xdbw\x00\x00\x00\x00l' (78819B70904D3549E3A70B1FF240A7EFDB77000000006C) (error: Corruption (yb/docdb/primitive_value.cc:1265): Error: failed to decode value TransactionId(819b7090-4d35-49e3-a70b-1ff240a7efdb) WriteId(0) : Failed to decode value in 6C: kRowLock is not allowed in a RocksDB PrimitiveValue)
  T 864f8ab29c13463b99a81ce9f3fc8b6b P 1be5957dcc1244e78d436f10412e23f3 [I]: 3. PutCF: TXN REV 819b7090-4d35-49e3-a70b-1ff240a7efdb HT{ physical: 1654072783772289 } => SubDocKey(DocKey(0x86bc, ["franck"], []), []) [kStrongRead, kStrongWrite] HT{ physical: 1654072783772289 }
  T 864f8ab29c13463b99a81ce9f3fc8b6b P 1be5957dcc1244e78d436f10412e23f3 [I]: 4. PutCF: SubDocKey(DocKey([], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1654072783772289 w: 1 } => TransactionId(819b7090-4d35-49e3-a70b-1ff240a7efdb) none
  T 864f8ab29c13463b99a81ce9f3fc8b6b P 1be5957dcc1244e78d436f10412e23f3 [I]: 5. PutCF: TXN REV 819b7090-4d35-49e3-a70b-1ff240a7efdb HT{ physical: 1654072783772289 w: 1 } => SubDocKey(DocKey([], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1654072783772289 w: 1 }
Enter fullscreen mode Exit fullscreen mode

There are no modifications to the row, but locks are also provisional records, one for the table and one for the row:

  • SubDocKey(DocKey(0x86bc, ["franck"], []), []) for the row, identified by the primary key, where the lock is [kStrongRead, kStrongWrite]
  • SubDocKey(DocKey([], []), []) for the table where the lock is [kWeakRead, kWeakWrite]

In IntentsDB, a reverse index is maintained to keep track of all entries belonging to a transaction. The transaction metadata is:
TXN META and records some information like the transaction_id, the status_tablet (which is the tablet ID of the transaction table), the isolation level (SNAPSHOT_ISOLATION is the default), time and priority to decide which one should be canceled in case of conflict. Each new SubDocument will maintain the reverse index with TXN REV and this is why, in addition to the two provisional records, we see one TXN META and two TXN REV for a total of 5 writes.

After the commit, five more entries are logged:

$ I0601 08:39:48.801558   156 tablet.cc:1279] T 864f8ab29c13463b99a81ce9f3fc8b6b P 1be5957dcc1244e78d436f10412e23f3: Wrote 5 key/value pairs to kIntents RocksDB:
  T 864f8ab29c13463b99a81ce9f3fc8b6b P 1be5957dcc1244e78d436f10412e23f3 [I]: Frontiers: { smallest: { op_id: 1.6 hybrid_time: { physical: 1654072788783928 } history_cutoff: <invalid> hybrid_time_filter: <invalid> max_value_level_ttl_expiration_time: <invalid> } largest: { op_id: 1.6 hybrid_time: { physical: 1654072788783928 } history_cutoff: <invalid> hybrid_time_filter: <invalid> max_value_level_ttl_expiration_time: <invalid> } }
  T 864f8ab29c13463b99a81ce9f3fc8b6b P 1be5957dcc1244e78d436f10412e23f3 [I]: 1. SingleDeleteCF: TXN META 819b7090-4d35-49e3-a70b-1ff240a7efdb
  T 864f8ab29c13463b99a81ce9f3fc8b6b P 1be5957dcc1244e78d436f10412e23f3 [I]: 2. SingleDeleteCF: SubDocKey(DocKey(0x86bc, ["franck"], []), []) [kStrongRead, kStrongWrite] HT{ physical: 1654072783772289 }
  T 864f8ab29c13463b99a81ce9f3fc8b6b P 1be5957dcc1244e78d436f10412e23f3 [I]: 3. SingleDeleteCF: TXN REV 819b7090-4d35-49e3-a70b-1ff240a7efdb HT{ physical: 1654072783772289 }
  T 864f8ab29c13463b99a81ce9f3fc8b6b P 1be5957dcc1244e78d436f10412e23f3 [I]: 4. SingleDeleteCF: SubDocKey(DocKey([], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1654072783772289 w: 1 }
  T 864f8ab29c13463b99a81ce9f3fc8b6b P 1be5957dcc1244e78d436f10412e23f3 [I]: 5. SingleDeleteCF: TXN REV 819b7090-4d35-49e3-a70b-1ff240a7efdb HT{ physical: 1654072783772289 w: 1 }
Enter fullscreen mode Exit fullscreen mode

This is the cleanup that happens when the provisonal records have been moved to the RegularDB. The writes to RegularDB are no logged here.

Locks

As I use this to check which locks are acquired, here is a little explanation (but this is fully documented). The model is quite simple: "weak" records an intent on a subset of the data structure and "strong" on the entire one. If you used an Oracle database, "weak" is similar to "Sub-" or "Row-" or "R" like "Row-X" being a table lock showing the intention to write a row, as SELECT FOR UPDATE does. This is a "kWeakWrite" on DocKey([], []), [] in YugabyteDB. What Oracle calls "share" or "exclusive" is actually the intent to Read or Write. In YugabyteDB, write locks may not conflict even when strong. If they need to, because they depend on the old value, there's a strong read recorded.

My SELECT FOR UPDATE here shows the following locks:

  • SubDocKey(DocKey([], []), []) [kWeakRead, kWeakWrite] to record the weak intention of reading and writing some row in this table.
  • SubDocKey(DocKey(0x86bc, ["franck"], []), []) [kStrongRead, kStrongWrite] to record the strong intention on this specific row. This one, in Oracle, is the lock flag stored within block for each modified row.

Oldest comments (0)