What's This About?
I've been in software development for many (many) years, and teaching a class in web development for a while now. Part of the curriculum is devoted to MySQL, and when that part of the semester is over I always feel like so much has been left out.
I've been keeping notes on things I've shared with my students to try and build up their knowledge of databases, and I'm going to put them here in the hopes that more people get some use out of them.
What's a Database Transaction?
It's all the rage to explain things like the listener is 5 years old, so I'm going to go with that.
A 5-year-old with two bank accounts, for sure, but a 5-year-old nonetheless.
Let’s say you’re on your banking app and you want to transfer $100 from one account to another - from savings to checking, to keep it simple. From what you may already know about databases, you’ve probably guessed that the money doesn’t really “go” anywhere. The transfer happens in two steps:
1) Your savings account balance goes down by $100
2) Your checking account balance goes up by $100
Now, suppose something goes wrong in that split-second between steps 1 and 2. The power goes out in the bank’s data center or something. Now, you’re stuck with $100 less in your savings account but without the extra $100 in checking, right?
Fortunately, no. If that was how it happened, you’d see a lot of headlines such as “bank power outage leaves millions stranded without weekend drinking money.” and I’m sure I’d remember reading that. So, what keeps problems like that from happening?
Well, if you thought there might be a problem with this transfer like I described above, which option would you choose?
1) Go ahead and do it, and if I lose $100, I lose $100.
2) Go ahead and do it, and if anything goes wrong just put the $100 back into my savings account and I’ll try it later.
You’re going to pick #2, right? At the very least, you want to know that if the transfer doesn’t happen all the way, you can just go back to square one and try another time.
That’s how databases do it: they use what’s called a transaction. According to techopedia, a transaction is
“a logical unit that is independently executed for data retrieval or updates”
Which is a terrible definition for a 5-year-old. Forget that.
A database transaction is just a series of steps that have to happen in an all or not-at-all way. Just like the transfer between your two bank accounts, if everything can’t be done, it’s best to just roll everything back to where we started.
In databases, transactions aren’t just about money, and they happen all the time. Let’s say you change a customer’s ID number in the database and then have to change all of that customer’s orders to reflect the new ID for the customer. That might involve hundreds of steps (1 step to change the customer’s ID and countless steps to change the orders for that customer), and if anything happened in the middle of that process - not just a power outage but maybe something as simple as one of the order records being used by another user - it’d best to just get an error message explaining what happened, put everything (the customer and the orders) back where you found them, and try again. The last thing you want is for the customer record and, say, 30 of the order records to be changed, leaving dozens of order records unchanged.
(Insert ACID Joke Here)
Which brings us to the topic of ACID.
ACID is an acronym coined by database people to describe what makes a transaction a transaction (or what a piece of software has to do before it can say that it’s a database).
Let’s go in the opposite order with the letters since I think they’re easier to understand that way (and DICA probably got laughed out of committee)
(D)urable - once a transaction is completed (or committed, as we say), it can’t be undone or somehow not happen. So, once that customer record and its orders are completely updated, you could reboot the whole data center and the customer and orders would still be changed.
Note: this might sound like a silly requirement in that it’s so easy. In reality, a lot of databases do their busiest work in the computer’s memory and only write things out to disk once in a while. This means that it’s possible for data to be changed and somehow not be saved unless the database is very careful.
Also, you might be thinking “if I transferred money from savings to checking, then transferred it back from checking to savings, that’s the same thing as undoing it, right?” No. It’s kind of the same in the sense that things ended up just like they started. On the other hand, that's like saying that taking a vacation to France would be undone if you happened to take off and land on the same runway. Moving money to checking and moving it back would actually be two transactions. Go try it now, then check your bank statement. There’s no “backsies” there: just you moving money around like the world’s smallest hedge fund trader.
(I)solated - my transaction doesn’t have anything to do with your transaction. If I’m at an ATM transferring $100 from one account to another, the fact that you’re inside the bank closing on a mortgage has nothing to do with me. The bank is able to handle lots and lots of transactions at the same time without any of them stepping on each other, and a database can too.
(C)onsistent - once a transaction is committed, anyone looking at the database should see the same story (assuming they have permission to, of course). If I transfer that $100 from savings to checking, I should be able to check my balance and see that the money has moved. In fact, if I somehow checked my account at an ATM all the way across the world, I should also see that the money has moved. A database won’t say that the transaction is done until it’s done for everyone to see.
--- Nerd Alert ---
This has been a big difference between SQL and NoSQL databases. SQL databases (like MySQL, Microsoft SQL Server, and Oracle) tend to go for immediate consistency, like what I described above. That’s their thing. NoSQL databases like MongoDB go for eventual consistency. Eventual consistency would be like saying that if you transfer that $100, you could see it on your phone app and maybe at your local bank, but someone at a bank across town or on the other side of the world might not see it for a few seconds. It’s not a bad trade off - most people probably don’t care if someone on the other side of the world can see that transfer the instant it happens - but it’s important to know which kind of consistency you have.
Eventual consistency isn't considered good enough for something to be ACID. Upon hearing this, many members of the NoSQL database community decided that they didn't want to be ACID anyway, and instead they'd go for being:
1) Basically Available - generally up and running
2) Soft-state - updates don't always show up instantly, but
3) Eventually consistent - they will soon enough
Otherwise known as BASE (get it?)
MongoDB looks like it’s gotten almost immediate consistency in its latest version, so there may be changes to look out for soon in the NoSQL world.
--- End of Nerd Alert ---
Why is eventual consistency considered a good idea? Because NoSQL databases like MongoDB are heavily invested in things like sharding, which allows you to spread a database over lots of different servers rather than putting it on one big server. Among other things, sharding lets you add a server - old, new, cheap, powerful - to a MongoDB setup and have the new server take on some of the work. It also makes it easier to have databases in different parts of the world which act like one big database. That is very, very cool, but it sometimes means that a transaction has to travel around a lot before all the parts of the database know about it. Fans of eventual consistency believe that it’s better to just sort of admit up front that transactions may not be immediately visible to everyone everywhere (since, as I mentioned, that’s not usually what people are looking for anyway).
(A)tomic - this is that “all or not-at-all” thing from before. It means that all of the steps in a transaction have to be done, or none of them are. Just like how an atom is the smallest thing there is (or used to be, I guess), a transaction is the smallest bit of work a database should do. If it can’t do every step in the transaction, just put everything back and walk away (so to speak).
So how do databases do all of this? Let’s go back to the money transfer example. In this case, the database would write something in a special file known as a transaction log. It’s like a to-do list for databases. When the database is told to lower the savings account balance by $100 and raise the checking account balance by $100, it writes both steps out in the transaction log. When it does step 1 (lowering savings by $100), it puts the database equivalent of a checkmark next to step 1. When it does step 2, it puts another checkmark next to step 2. Then it notices that it’s done everything it was supposed to and sort of scribbles out both steps to tell itself that the whole transaction was done.
If something drastic like a power outage happens, the database knows that the first thing it needs to do after it starts back up is look at that to-do list for anything that wasn’t completely scribbled out. Those things are uncommitted transactions, and when it finds one it looks at the steps to see which of them it was able to do (for instance, lower savings by $100). To put things right, the database will undo those steps (or perform a rollback, as we say).
Okay, that's probably good for now. Please leave me some feedback, especially if you want to see something else like this covered!