Database Indexing Internals Explained

#computerscience #database #distributedsystems #microservices

There are two basic kinds of indices:
• Ordered indices. Based on a sorted ordering of the values.
• Hash indices. Based on a uniform distribution of values across a range of buckets.
The bucket to which a value is assigned is determined by a function, called a hash
function.

A file may have several indices, on different search keys. If the file containing the records is sequentially ordered, a clustering index is an index whose search key
also defines the sequential order of the file.

Indices whose search key
specifies an order different from the sequential order of the file are called nonclustering
indices, or secondary indices

An index entry, or index record, consists of a search-key value and pointers to one or
more records with that value as their search-key value. The pointer to a record consists
of the identifier of a disk block and an offset within the disk block to identify the record
within the block.
There are two types of ordered indices that we can use:

Dense index: In a dense index, an index entry appears for every search-key value
in the file. In a dense clustering index, the index record contains the search-key
value and a pointer to the first data record with that search-key value. The rest of
the records with the same search-key value would be stored sequentially after the
first record, since, because the index is a clustering one, records are sorted on the
same search key.
In a dense nonclustering index, the index must store a list of pointers to all
records with the same search-key value.

Sparse index: In a sparse index, an index entry appears for only some of the search-
key values. Sparse indices can be used only if the relation is stored in sorted order
of the search key; that is, if the index is a clustering index. As is true in dense
indices, each index entry contains a search-key value and a pointer to the first data
record with that search-key value. To locate a record, we find the index entry with
the largest search-key value that is less than or equal to the search-key value for
which we are looking. We start at the record pointed to by that index entry and
follow the pointers in the file until we find the desired record.

Multilevel Indices
Dense Index Overview:

A dense index has one index entry for each record in the relation.
Example: A relation with 1,000,000 tuples and 100 index entries per 4 KB block results in 10,000 index blocks.
Issues with Large Indices:

Large relations require large indices. For 100,000,000 tuples, the index occupies 1,000,000 blocks (4 GB).
Large indices are often stored as sequential files on disk due to their size.
Index search can become costly due to multiple random I/O operations.
Binary Search on Index:

Binary search requires up to ⌈log2(b)⌉ block reads for an index occupying b blocks.
Example: For a 10,000-block index, binary search requires 14 random reads (taking ~140 ms on a magnetic disk).
Each block read is random, increasing search time.
Overflow blocks make binary search more complex and potentially less efficient.
Sequential Search:

Sequential search reads all b blocks, which may take longer than binary search.
However, sequential reads can sometimes benefit from lower access costs compared to random reads.
Multilevel Indexing:

Solution to reduce search time for large indices.
Treat the large index (inner index) as a sequential file and construct a sparse outer index.
The outer index points to blocks of the inner index, which contains pointers to actual data blocks.
Example of Multilevel Index:

Inner index with 10,000 blocks → Outer index with 100 blocks.
Outer index (if in main memory) reduces the search to a single inner index block read instead of 14 with binary search.
14x improvement in search efficiency.
Multilevel Index with Extremely Large Files:

For a relation with 100,000,000 tuples:
Inner index: 1,000,000 blocks.
Outer index: 10,000 blocks (40 MB).
If the outer index is too large for main memory, another level of indexing is added, creating a multilevel index.
Advantages of Multilevel Index:

Significantly fewer I/O operations compared to binary search or sequential search.
Supports efficient searching for very large relations.
Relation to Tree Structures:

Multilevel indices resemble tree structures (e.g., binary trees).
Higher levels of indices act like parent nodes, and lower levels act like child nodes.

Key Takeaways
Multilevel indices optimize the search process for large datasets by reducing the number of I/O operations.
Sparse indexing at higher levels minimizes memory usage while maintaining search efficiency.
They are an efficient alternative to binary or sequential searches for extremely large relations.

This is Part 1 of the upcoming series on Database Indexing.

These are just my notes from the one of my favourite books: Database System Concepts

You can find me on
X

DEV Community

Database Indexing Internals Explained

Top comments (0)

Read next

How I Went From Being a DBA to DevOps Engineer

DynamoDB-style Limits for Predictable SQL Performance?

Everyone Uses Postgres… But Why?

Doctrine’s Collection filter method - a double-edged sword