Recently, I began to study how databases work under the hood, and this is the result of a small part of those studies. Today, I will share what I have learned about transactions and how Database Management Systems (DBMS) minimize errors, both in terms of data inconsistency and general errors that can affect the availability and performance of the database. I emphasize to everyone to read the documentation of the DBMS you are using to make better use of the tools it provides and to improve the performance of your applications. This article is a generalization of how DBMS implement these properties, and only by reading the documentation will you have concrete answers.
Transactions
We can say that a transaction is a unit of work. Sometimes, it is difficult to perform all the data manipulation you want to do with just one query; it may even be impossible. That's where a transaction comes in – to execute a sequence of queries.
Transactions are generally used for data manipulation (which does not prevent or prohibit read-only queries). For this purpose, they have a set of properties that make data manipulation and querying safer in terms of data quality. This set of properties is formed by the concepts of Atomicity, Consistency, Isolation, and Durability, which form the acronym ACID. Later, we will delve deeper into how they work and what they represent.
Returning to transactions, there are various DBMSs that address the topic in their own way, with some prioritizing performance in certain areas and others prioritizing performance in other operations. But in general, the way they work is quite similar. We will start with some commands that initiate transactions, terminate them, and what they do exactly.
Transaction Handling
- BEGIN
When we give the command BEGIN to the database, we are asking it to start a new "branch." In this context, a "branch" is something like a checkpoint in a game. We start our branch here, can perform our SQL manipulation or queries, and have some "privileges," such as the ability to roll back without the database applying the changes we instructed within the branch.
- COMMIT
As I mentioned earlier, starting a transaction or, as we call it, a "branch" gives us some privileges, such as the ability to revert to the database's state at the point where we began that transaction. But how do we confirm these changes (understand, with changes, I refer to DML operations executed within the same branch)? With the COMMIT command, we ask the database to save the changes we made. In other words, all DMLs we performed in the transaction will be permanently applied to the database and cannot be reverted to the previous state without another DML command.
- ROLLBACK
The ROLLBACK command tells the database that we do not want any of the queries executed "within" that transaction to be applied. In other words, everything executed in the transaction will be undone. Some databases allow the use of the ROLLBACK command on DDL operations (e.g., create table) and the like, while others do not. This depends on the DBMS you are using. I recommend reading the documentation of the DBMS you will use in your project.
- ERRORS
As I mentioned earlier, DBMSs implement transactions in their own way, and error behavior can vary among them. But it is almost a standard that the database performs a ROLLBACK to preserve data reliability whenever an error occurs. This clashes with the concept of atomicity, which we will see next.
ACID
Atomicity
Atomicity is one of the four concepts that dictate the behavior of a transaction. It mainly deals with cases of failure. The initial idea of atomicity is that a transaction is a unit of work where all queries must be executed successfully, or none of them is executed. This applies both when the transaction is "open," and you are performing queries or manipulations, and if there is a query error, a constraint violation, a database crash for any reason, or even external forces like a power outage, the database will perform a ROLLBACK. Even in more extraordinary cases, such as a crash while the database is performing a COMMIT (writing changes to disk), it will undo all the queries that were successful and return to the state before the transaction started.
Example:
Let's perform a bank transfer between João and Maria with two queries: one to withdraw money from João's account and another to deposit it into Maria's account.
Withdrawing from João's account:
UPDATE BANK_ACCOUNTS SET BALANCE - 100 WHERE NAME = 'JOAO'
Depositing into Maria's account:
UPDATE BANK_ACCOUNTS SET BALANCE + 100 WHERE NAME = 'MARIA'
Committing changes:
COMMIT
- [x] João's Query: ERROR IN THE DATABASE
- [ ] Maria's Query
In our example, the database encountered some memory error and couldn't write both changes to disk. So, instead of deducting the amount from João's account and not crediting it to Maria's account (which would leave a hole of 100 dolars in the system, thus serving inconsistent data), the database didn't write any of the changes and reverted to the state before the transaction started.
Isolation
As most database management systems (DBMS) allow multiple connections from different sources, each executing various queries, there is a need for transactions executed by these different connections to be isolated, meaning they do not share their operations and changes until they are committed. The principle of isolation ensures that transactions within a database are executed separately, even when occurring concurrently.
Within isolation, there is something we call read phenomena, which demonstrates errors that can occur when operations are executed in different transactions that are happening simultaneously. There are four phenomena in total:
Dirty read
A dirty read occurs when a query is executed within an isolation system that shares transaction states. When this happens, there is a possibility that the change will not be committed, and the query will return inconsistent data. To illustrate this, let's use the following table:
sku | quant | price |
---|---|---|
1 | 5 | 20 |
2 | 10 | 15 |
We initiate the following query in Transaction 1:
select sku, quant*price from orders
The result will be as follows:
sku | quantXprice |
---|---|
1 | 100 |
2 | 150 |
Next, Transaction 2 begins in another database connection and performs the following manipulation:
update orders set quant+5 where sku = 1
The orders
table will now look like this:
sku | quant | price |
---|---|---|
1 | 10 | 20 |
2 | 10 | 15 |
Transaction 1 then performs another query:
select sum(quant*price) from orders
The query result will be:
sum(quantXprice) |
---|
350 |
Then Transaction 2 executes the rollback command, undoing all the changes it made, and the orders
table returns to its initial state. Consequently, the result of the sum(quant X price)
operation is not 350 but rather 250.
Your example illustrates the concept of "dirty read" very effectively and how it can lead to inconsistent data in a database system when transactions are not isolated. Dirty read occurs when one transaction can see data that has been modified by another transaction, but that other transaction has not yet committed its changes.
Non-repeatable Read
A non-repeatable read is quite similar to a dirty read, with the difference being that Transaction 2 commits its change before Transaction 1 performs the second query. This leads to data inconsistency. You might think, "But if the change has been committed, shouldn't my report have the most up-to-date number?" However, this contradicts the principle of isolation, which aims to isolate the states and changes of each transaction.
In a large system where the database experiences a high volume of changes, this can be problematic. Since transactions can observe changes made by other transactions, the results can be inconsistent or unexpected.
To illustrate, let's use the same table from the dirty read example:
sku | quant | price |
---|---|---|
1 | 5 | 20 |
2 | 10 | 15 |
Transaction 1 executes a query:
select sku, quant*price from orders
The result will be as follows:
sku | quantXprice |
---|---|
1 | 100 |
2 | 150 |
Then Transaction 2 begins in another database connection and performs the following manipulation, followed by a commit:
update orders set quant+5 where sku = 1 commit
The orders
table will now permanently look like this:
sku | quant | price |
---|---|---|
1 | 10 | 20 |
2 | 10 | 15 |
Transaction 1 then performs another query:
select sum(quant*price) from orders
The query result will be:
sum(quantXprice) |
---|
350 |
This example is quite similar because both cases share similarities. The key difference is that a dirty read doesn't require the change to be committed for other transactions to observe the result, whereas in a non-repeatable read, a change must be committed for transactions to observe it.
Phantom Read
A phantom read presents the same issue as the other two phenomena, which is data inconsistency. The difference is that a phantom read involves rows that were not initially read. Instead of transactions making changes to a set or a single row, a phantom read adds rows that were not read in the initial read but affect the second result. It's called a phantom read because there are rows that alter the result but were not read because the change occurred after the transaction's initial query.
Let's illustrate with the same table example:
sku | quant | price |
---|---|---|
1 | 5 | 20 |
2 | 10 | 15 |
Transaction 1 performs the following query:
select sku, quant*price from orders
The result will be as follows:
sku | quantXprice |
---|---|
1 | 100 |
2 | 150 |
Then Transaction 2 begins in another database connection and performs the following manipulation, followed by a commit:
insert into orders values(3, 15, 2) commit
The table will now look like this:
sku | quant | price |
---|---|---|
1 | 5 | 20 |
2 | 10 | 15 |
3 | 15 | 2 |
Now, Transaction 1 performs another query:
select sum(quant*price) from orders`
The query result will be:
sum(quantXprice) |
---|
280 |
In this case, the row with SKU 3 was added in a separate transaction but affected the consistency of Transaction 1's queries, adding a "phantom row."
Lost Updates
Lost updates aren't exactly a read phenomenon, but we'll include them in this topic. Lost updates occur when two transactions attempt to make changes to the same row, and one of them ends up overwriting the changes made by the other. In this case, the changes are not shared between transactions, and since transactions don't share status (what they read/changed, when they finished, etc.), a write-write conflict occurs where one transaction "discards" its own change after another transaction commits.
Let's illustrate this with the same table used in the previous examples:
sku | quant | price |
---|---|---|
1 | 5 | 20 |
2 | 10 | 15 |
Transaction 1 executes the following query:
update orders set quant+5 where sku = 1
The table will look like this (important to note that no commit has occurred):
sku | quant | price |
---|---|---|
1 | 10 | 20 |
2 | 10 | 15 |
Afterward, Transaction 2 begins, where it also executes a modification query and commits:
update orders set quant+10 where sku = 1 commit
The table is now permanently updated as follows:
sku | quant | price |
---|---|---|
1 | 15 | 20 |
2 | 10 | 15 |
Right after that, Transaction 1 performs the query:
select sku, quant*price from orders
The expected result was:
sku | quantXprice |
---|---|
1 | 200 |
2 | 150 |
However, due to a lost update where Transaction 1 relied on the "committed table" (where Transaction 2 had just made changes), the result will be:
sku | quantXprice |
---|---|
1 | 300 |
2 | 150 |
This means the uncommitted change made by Transaction 1 was discarded without an explicit command, resulting in an unexpected outcome and data inconsistency.
These are the phenomena that the concept of database isolation aims to eliminate. So, what are isolation levels exactly?
Isolation Levels
Isolation levels are divided into different levels because each level of isolation aims to address specific issues (the read phenomena). Each DBMS implements its isolation model based on the problems and demands it seeks to address (in some DBMSs, you can choose the isolation level for your transaction). The isolation levels include:
Read Uncommitted: No isolation; any change can be observed from outside the transaction, whether committed or not.
Read Committed: Minimal isolation; any change made and committed within a transaction can be observed by another transaction.
Repeatable Read: Medium isolation; the transaction ensures that when a query reads a row, that row remains unchanged throughout the transaction's execution.
Snapshot: True isolation. Each query in the transaction is executed based on the tuples that were committed in the database before starting the transaction. It's like taking a "snapshot" of the current state of the database and executing queries based on it, completely isolating that transaction from any other changes another transaction may make.
Serializable: At the serializable isolation level, the DBMS defines the degree to which transactions are separated from each other, ensuring data consistency during the concurrent execution of multiple transactions. In other words, serializable isolation ensures that even if multiple transactions are happening simultaneously, the final result is the same as if they were executed sequentially, one after the other.
The table below illustrates which problems each isolation level addresses.
Isolation Level | Dirty Read | Non-Repeatable Read | Phantom Read | Lost Updates |
---|---|---|---|---|
Read Uncommitted | Possible | Possible | Possible | Possible |
Read Committed | Prevented | Possible | Possible | Possible |
Repeatable Read | Prevented | Prevented | Possible | Possible |
Snapshot | Prevented | Prevented | Prevented | Prevented |
Serializable | Prevented | Prevented | Prevented | Prevented |
These isolation levels offer a range of trade-offs between data consistency and performance, allowing you to choose the level that best suits your application's requirements.
Consistency
Throughout the text, I mentioned the term "data inconsistency," which is basically when data does not assume the value it should after executing manipulation commands, and it ends up not matching real-world events. In this case, we will address data inconsistencies within the database.
To illustrate this, let's use two example tables:
Table "orders":
sku | quant | price |
---|---|---|
1 | 2 | 3 |
2 | 5 | 2 |
Table "relat_orders":
sku | amount |
---|---|
1 | 5 |
2 | 10 |
In this example, we have a relationship inconsistency. The "relat_orders" table has the "amount" column, which is the result of quant x price from the "orders" table. SKU 1 should have an "amount" of 6, but it actually has 5 because the tables do not have the proper relationship. To facilitate table modeling within the database, DBMSs have constraints, which are rules applied to columns to standardize behavior or relationships between them. Examples of constraints include unique, not null, and foreign key constraints.
The concept of consistency in ACID refers to a database that maintains the integrity of all constraints. The term "integrity" refers to the accuracy or correctness of the data's state within a database, where the database cannot transition from a consistent state to an inconsistent state. Let's work with some scripts to better understand this concept.
Table creation script (PostgreSQL):
CREATE TABLE orders (id serial,
sku int,
status varchar(10) NOT NULL
);
Tuple insertion script (PostgreSQL):
INSERT INTO orders (sku) VALUES (1);
The database cannot accept the second script because the "status" column must have a value in all tuples that are inserted (due to the "not null" constraint). If the insertion occurs, the database will transition from a consistent state to an inconsistent one because it violated a constraint, consequently violating the principle of consistency.
Durability
This is the last property in this set of properties that makes the database so secure throughout the process of creation, management, and data storage. Durability ensures that all changes made within a transaction are written to non-volatile memory after a commit and cannot be altered in any other way except through a subsequent transaction. In other words, durability ensures that data is stored and can be queried and manipulated even in the event of a power outage, errors, or any type of database failure, causing it to be inactive for a certain period.
The trade-off of durability is quite clear. To establish the durability property, data must be written to non-volatile memory as mentioned earlier (usually a hard drive), and non-volatile memories are naturally slow (I will write about how databases store data on disk and why they are slow in certain operations in the future).
In-memory databases like Redis, which use RAM (volatile memory), trade off the benefits of ACID properties in favor of performance for both write and read operations. In terms of durability, the trade-off is also quite clear. Continuing with Redis, the low-latency access to data comes with the risk of an unpredictable major force (power outage, errors, etc.) shutting down the service, resulting in data loss since it is not saved in non-volatile memory. Redis also implements some things quite similar to atomicity, consistency, isolation, and durability in its own way. It caters to very specific demands, and surprisingly, these demands are very different from what databases that use volatile memory aim to solve.
Conclusion
With this, we have been able to understand how a database with ACID properties minimizes errors for both read and write operations. It's important to read about how the DBMS you are using in your project implements these "concepts." What I presented here is a generalization of how databases approach transaction management and data storage.
Finally, I recommend the "Fundamentals of Database Engineering" course available on Udemy (I'm not someone who usually pays for courses, but this one is really worth it). I'll leave the link below.
Course: Fundamentals of Database Engineering
Thank you very much to everyone who read this!
Top comments (0)