In this chapter we will be going through how database indexes work and how they are implemented.
What is an Index?
- An index is a data structure to speed up retrieval of data records based on some search key.
- A search key is a sequence of
k
data attributes.- It can be the primary key of the database
- It can also be the composite key of the database
- Indexes are also stored at files on the computer.
- A unique index is an index where the search key is unique.
- Otherwise it is a non-unique index.
- Indexes are stored as file
- Records in an index file are referred as data entries
Types of Indexes
In databases, there are 2 main types of database indexes:
- Tree-based index
- These are based on the sorting of search trees
- IE: Binary Search Trees, B+ Trees, etc
- Hash-based index
- Data entries are accessed using hash functions
- IE: Static hashing, extendible hashing, linear hashing, etc
How to choose an index
Considerations | Description |
---|---|
Search Performance | How fast can we find a record in the index for a given query (Can be ranged or equality search |
Storage overhead | How much space does the index take up in the disk |
Update performance | How much does it cost to update a record in the index |
Format of data entries
In a database index, there are various ways to store data entries.
Format | Description | Example |
---|---|---|
Format 1 | The actual record with the search key is stored at the leaf pages | (k, Record data) |
Format 2 | The records are stored in tuples of (k, RID) at the leaf pages | (k, RID) |
Format 3 | The records are stored as tuples of (k, List of RIDs). All entries with the key k will be stored together | (k, List of RIDs) |
Note
- For format 1, finding the relevant leaf pages is equivalent to finding the actual records.
- In formats 2 & 3, there is 1 additional lookup to find the actual page containing the record if it is required.
B+ Tree Index
Some terminologies for a B+ Tree
- Leaf nodes: Store sorted data entries
k*
denote a data entry in the form of (k, RID)k
= search key value corresponding data recordRID
=RID
of corresponding data record
- Leaf nodes are doubly linked
- Internal nodes: Store sorted pointers to child nodes
- In the form of (
p1
,k1
,p2
,k2
,……,pn
,kn
)- Where
k1
<k2
<k3
, etc pi
= disk page address (Root node of subtree)
- Where
- In the form of (
- For each data entry in the B+ tree, the value of
k
must be between the 2 keys in the parent node adjacent to it. - Each
(ki, pi)
is an index entry,ki
serves as a separator between 2 pointers between the contents pointed bypi-1
andpi
It has the following properties
- Dynamic structure, it adapts to data updates gracefully.
- Height-balanced index structure
- Order of index tree,
d
is a subset of positive integersd
controls the space utilization of index nodes- Each non-root node contains
m
entries whered <= m <= 2d
- The root node contains
m
entries where1 <= m <= 2d
If you find the above definition confusing, here is a diagram of a B+ tree:
In the diagram you can see:
d
value is 1- So each internal node can have a maximum of 2 entries
- The root can have less but currently does not.
- Each internal node (root + 1 layer below root) consists of 2 numbers (
k
values) and 3 pointers (p
values). - Each data entry in the leaf node lies between the 2 keys in the parent node adjacent to it.
- IE:
0007
is between the node0007
and0008
.
- IE:
- For nodes at the edge, the missing k value is assumed to be -infinity or +infinity or inferred from further parents.
- IE:
0006
is between -infinity and0007
,0017
is between0017
and +infinity. - IE:
0008
is between0008
and0009
and0011
is between0011
and0012
- IE:
To visualize it yourself, visit B+ Tree Visualization
B+ Tree Index Search
There are 2 types of searches which are supported in a B+ Tree index.
- Equality search
- Ranged search
Equality queries
To search for a record in a B+ tree, we need to do the following:
- Start at the root node
- Find the largest key value that is smaller than the search key,
ki
- If the search key exists
- search the subtree at pointer
pi
- otherwise search the subtree at
p0
- search the subtree at pointer
- If the search key exists
- We can repeat this recursively to find the value we want.
Ranged queries
To search for a range of records in a B+ tree, we need to do the following:
- Start at the root node
- Find the largest key value that is smaller than the search key,
ki
- If the search key exists
- search the subtree at pointer
pi
- otherwise search the subtree at
p0
- search the subtree at pointer
- If the search key exists
- When we arrive at the leaf node
- We collect all values and follow the pages until the first record that is larger than the search key is found
Insertion into B+ Tree
If we want to insert a value (k, RID)
into the B+ Tree, we need to do the following:
- Find the leaf node (
l
) where the value should be inserted. - If the leaf node is not full (IE:
m < 2d
)- Insert the value into the leaf node
- Done
- Otherwise
- Allocate a new leaf node (
n*
), - Put the last
d+1
entries ofl
inton*
- Update the sibling pointers in the parent node (Parent of root is null)
- Update the pointers of
l
andn*
to point to each other
- Allocate a new leaf node (
To see a detailed visualization, visit B+ Tree Visualization
There are optimizations that can be done to reduce the number of disk accesses.
We can check the page to the left and right of the current page.
If there is space, we can move 1 entry over and update the pointer between the 2 pages in the parent node.
B+ Tree deletion
To delete a value (k, RID)
from the B+ Tree, we need to do the following:
- Find the leaf node (
l
) where the value should be deleted. - Remove the entry
- If the current leaf does not fall below the minimum threshold (IE:
m >= d
)- Done
- Otherwise
- If a sibling has additional nodes over the minimum threshold
- Move 1 node from the sibling to the current node and update the internal node
- done
- Otherwise
- Merge the current node with the sibling
- Update the parent node to remove the pointer to the current node
- If a sibling has additional nodes over the minimum threshold
Similar to insertion, a visualization can be found at B+ Tree Visualization which can help us better understand the process.
Bulk loading a B+ Tree
If we have a large collection of records that we want to make a B+ tree for, we can do it in 2 ways.
Insert records into B+ tree 1 at a time
This is the same as the insertion process we have seen before. The process is time consuming, especially for large datasets.
Bulk loading
We can also bulk load the B+ tree.
This are the steps to bulk load a tree:
- Sort the data entries by search keys
- Load the leaf pages of B+ Tree with sorted entries
- Initialize the B+ tree with an empty root page
- For each leaf page
- Insert its index entry into the rightmost parent of leaf level page of B+ Tree
This is a video explanation for bulk loading in a B+ Tree which can help us better understand the process.