Utsav

Posted on May 12, 2023

Database management systems

Introduction

DBMS is perhaps one of the most important concepts that every back-end engineer needs to have a reasonable level of mastery of. This paper aims to help readers understand all the basic concepts, We won't touch on the details considering there is a lot to cover here, But i'll try to make it comprehensive

Transactions

Transactions are a series of logically related operations. They are a fundamental part of DBMS.

Operations involving transactions:

Read operation: This operation is used to read data from the database and store it in the buffer memory
Write operation: This operation is used to write data to the database from the buffer.
Commit: This operation is used to save the work done permanently
Rollback: Used to rollback the work done, in case of failures.

ACID Properties

This is an important principle if you want to have a robust database.

Atomicity

One transaction must occur in its entirety, if the transaction fails, No changes should be reflected in the database. It should either be a complete success or a complete failure.

Consistency

The changes must remain consistent throughout the database, i.e. the data must remain correct after the transaction.

Isolation

Every transaction must be independent of other concurrent transactions, It can be a huge problem if transactions are allowed to interfere with each other.

Durability

This property ensures that the data in the database perists even after system failures.

CAP Theorem

This theorem is used by system desginers working with distributed systems to manage trade offs while desgining shared-data systems(over networks). The theorem states that it is not possible to guarantee all three desriable properties from a distributed system. It consists of three properties:

Consistency: Please do not confuse this consistency with the one in ACID properties, consistency in this context refers to maintaining the consistency of all copies of the database across all nodes in the distrubuted system.
Availability: This property means that the database should always be available for operations and all nodes in the network should be able to respond in reasonable time.
Partition Tolerance: Partition tolerance can be applied over a network if the nodes are separated into clusters (It's kind of like strongly connected components in a graph), So when a fault occurs in a cluster, the outage is restricted to that cluster.

Joins

Joins are used in SQL when you require data from two tables. Let me demonstrate with an example:

Let's say you need permanent address and bank account details of an employee, but they're located in two different tables, here you'd need to join the two tables.

You'll start to realise you need some common attribute for these employees to connect these two tables, It can't be names because two people can have the same names, it needs to be something that is guaranteed to be unique for all employees. Here comes the concept of primary keys and foreign keys. If you have a number that is unique for all records, you can use that attribute to match the records from the two tables and get the desired attributes. Now, lets get into joins, there's all kinds of them.

Inner join: Essentially gets all the data that's common for both of the tables, discards everything else.

Left join: Gets all the data from the table that's been joined on and the data that's common for both.

Right join: The opposite of the left join.

Outer join: Gets everything

Aggregation functions

Aggregate functions are usually used in conjunction with the group by clause. It applies the function on fields based on the attributes mentioned in the group by clause. For example

    SELECT employees, sum(transactions) AS revenue
    FROM profits GROUP BY employees

The query above will get the sum of transactions for each employee, which in turn tells you the total amount of revenue brought in by every single employee.

count(): this function aggregates the count of an attribute.
sum(): gets the sum of attributes
avg(): gets the average of given attributes
max(): gets the max of given attributes
min(): gets the min of given attributes.

Normalizations

This property defines constraints for the database to follow in order to make it more robust, fault tolerant and accurate. There are a lot of normal forms but we'll only go through the ones that are practically used.

First Normal Form: Requires the tables to have primary keys, so each record can be identified uniquely and every column should have a unique as well.

Second Normal Form: First normal form eliminates repeated data, this normal form deals with redundancies, there's a fine distinction between the two.

ID	Course	Fee
1	C1	199
2	C2	299
3	C1	199

In the table above, ID is the primary key, but the attribute Fee is directly dependent on the course, not the primary key of the table, you can see "199" appearing twice, this is the redundancy that the second normal form aims to eliminate.

ID	Course
1	C1
2	C2
3	C1

Course	Fee
C1	199
C2	299

This would be the state of the table after applying the second normal form.

Third Normal Form: In the example above, The Fee attribute had a transitive dependency on the course attribute, our example eliminates that as well.

Indexes

Indexes are special lookup tables that databases use to speed up data retrieval. They're automatically created for primary and unique key constraints.

In the background, the working of indexing is quite complex so we won't get into too much detail here, Let's scratch the surface:

Indexes are essentially a data structure designed for faster data retrieval, it involves a key and a reference to the data. There are multiple types of indexing methods, we'll discuss the primary indexes here;

Primary Index: The indexes created on the basis of primary key constraints, these indexes are automatically created and require no explicit invocation.

Locking mechanisms

Locking mechanisms are crucial for concurrency control.

Concurrency control: Essentially means that concurrent transactions must not interfere with each other. It can be hard to maintain consistency and integrity if you don't have mechanisms in place to protect transactions.

Now let's talk about different locking protocols:

Simplistic Locks: This mechanism locks the data that is being operated on, the lock is released when the transaction is complete.

Pre-Claiming Locks: This mechanism lists out all the data that needs to be operated on, and requests the database for locks on that data, If the locks are granted, the transaction will proceed, otherwise it will wait until the locks are granted.

Two-Phase Locks: This mechanism divides the process into two phases of Aquiring and releasing locks.

Growing phase: In the growing phase, the transaction only acquires locks on data items but cannot release locks.
Shrinking phase: In this phase, the transaction may release locks, but no new locks can be acquired.

Database Isolation Levels

Isolation levels are essentially levels of strictness by which isolation is enforced by the system.

Lets look at some levels in the order of their strictness:

Read Uncommited: This is the lowest level, Concurrent transactions can read uncommited changes by other ongoing transactions.

Read Commited: In this isolation level, the transactions can only read the commited changes by other transactions.

Repeatable Read: This level is mostly isolated but phantom reads are still possible, i.e. if the transaction calls the read operation twice and a commited transaction happens in between those calls, The output of the two queries can differ.

Serilizable: This is completely isolated, the transactions read like they were serialized.

Triggers

A trigger is a stored procedure that is invoked when a special event occurs, this "special event" can be defined by the user. Let's look at an example:

create trigger stud_marks 
before INSERT 
on 
Student 
for each row 
set Student.total = Student.subj1 + Student.subj2 + Student.subj3, Student.per = Student.total * 60 / 100;

Above SQL statement will create a trigger in the student database so that, whenever marks are entered before inserting this data into the database, trigger will compute those two values and insert with the entered values.

DEV Community