Unit 3 Storage Strategies Indices B-Trees Hashing
Unit 3 Storage Strategies Indices B-Trees Hashing
Indexing in DBMS:
Index structure:
The first column of the database is the search key that contains a copy of
the primary key or candidate key of the table. The values of the primary
key are stored in sorted order so that the corresponding data can be
accessed easily.
The second column of the database is the data reference. It contains a set
of pointers holding the address of the disk block where the value of the
particular key can be found.
Ordered indices:
The indices are usually sorted to make searching faster. The indices which are sorted are known as
ordered indices.
Example: Suppose we have an employee table with thousands of record and each of which is 10
bytes long. If their IDs start with 1, 2, 3....and so on and we have to search student with ID-543.
o In the case of a database with no index, we have to search the disk block from starting till it
reaches 543. The DBMS will read the record after reading 543*10=5430 bytes.
o In the case of an index, we will search using indexes and the DBMS will read the record after
reading 542*2= 1084 bytes which are very less compared to the previous case.
Primary Index
o If the index is created on the basis of the primary key of the table, then it is known as primary
indexing. These primary keys are unique to each record and contain 1:1 relation between the
records.
o As primary keys are stored in sorted order, the performance of the searching operation is
quite efficient.
o The primary index can be classified into two types: Dense index and Sparse index.
Dense index
o The dense index contains an index record for every search key value in the data file. It makes
searching faster.
o In this, the number of records in the index table is same as the number of records in the main
table.
o It needs more space to store index record itself. The index records have the search key and a
pointer to the actual record on the disk.
Sparse index
o In the data file, index record appears only for a few items. Each item points to a block.
o In this, instead of pointing to each record in the main table, the index points to the records
in the main table in a gap.
Clustering Index
o A clustered index can be defined as an ordered data file. Sometimes the index is created on
non-primary key columns which may not be unique for each record.
o In this case, to identify the record faster, we will group two or more columns to get the unique
value and create index out of them. This method is called a clustering index.
o The records which have similar characteristics are grouped, and indexes are created for these
group.
Example: suppose a company contains several employees in each department. Suppose we use a clustering
index, where all employees which belong to the same Dept_ID are considered within a single cluster, and
index pointers point to the cluster as a whole. Here Dept_Id is a non-unique key.
The previous schema is little confusing because one disk block is shared by records which belong to the
different cluster. If we use separate disk block for separate clusters, then it is called better technique.
Secondary Index
In the sparse indexing, as the size of the table grows, the size of mapping also grows. These
mappings are usually kept in the primary memory so that address fetch should be faster. Then the
secondary memory searches the actual data based on the address got from mapping. If the mapping
size grows, then fetching the address itself becomes slower. In this case, the sparse index will not be
efficient. To overcome this problem, secondary indexing is introduced.
In secondary indexing, to reduce the size of mapping, another level of indexing is introduced. In this
method, the huge range for the columns is selected initially so that the mapping size of the first level
becomes small. Then each range is further divided into smaller ranges. The mapping of the first level
is stored in the primary memory, so that address fetch is faster. The mapping of the second level
and actual data are stored in the secondary memory (hard disk).
For example:
o If you want to find the record of roll 111 in the diagram, then it will search the highest entry
which is smaller than or equal to 111 in the first level index. It will get 100 at this level.
o Then in the second index level, again it does max (111) <= 111 and gets 110. Now using the
address 110, it goes to the data block and starts searching each record till it gets 111.
o This is how a search is performed in this method. Inserting, updating or deleting is also done
in the same manner.
B+ Tree
o The B+ tree is a balanced binary search tree. It follows a multi-level index format.
o In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf nodes
remain at the same height.
o In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can support
random access as well as sequential access
Structure of B+ Tree
o In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is of the
order n where n is fixed for every B+ tree.
o It contains an internal node and leaf node.
Internal node
o An internal node of the B+ tree can contain at least n/2 record pointers except the root node.
o At most, an internal node of the tree contains n pointers.
Leaf node
o The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.
o At most, a leaf node contains n record pointer and n key values.
o Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.
So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the end, we
will be redirected to the third leaf node. Here DBMS will perform a sequential search to find 55.
B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node after 55.
It is a balanced tree, and a leaf node of this tree is already full, so we cannot insert 60 there.
In this case, we have to split the leaf node, so that it can be inserted into tree without affecting the
fill factor, balance and order
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will split the
leaf node of the tree in the middle so that its balance is not altered. So we can group (50, 55) and
(60, 65, 70) into 2 leaf nodes.
If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have 60
added to it, and then we can have pointers to a new leaf node.
This is how we can insert an entry when there is overflow. In a normal scenario, it is very easy to find the node
where it fits and then place it in that leaf node.
B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have to remove 60 from the
intermediate node as well as from the 4th leaf node too. If we remove it from the intermediate node,
then the tree will not satisfy the rule of the B+ tree. So we need to modify it to have a balanced tree.
After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as follows:
Hashing in DBMS
In a huge database structure, it is very inefficient to search all the index values and reach the desired
data. Hashing technique is used to calculate the direct location of a data record on the disk without
using index structure.
In this technique, data is stored at the data blocks whose address is generated by using the hashing
function. The memory location where these records are stored is known as data bucket or data
blocks.
In this, a hash function can choose any of the column value to generate the address. Most of the
time, the hash function uses the primary key to generate the address of the data block. A hash
function is a simple mathematical function to any complex mathematical function. We can even
consider the primary key itself as the address of the data block. That means each row whose address
will be the same as a primary key stored in the data block.
The above diagram shows data block addresses same as primary key value. This hash function can also be a
simple mathematical function like exponential, mod, cos, sin, etc. Suppose we have mod (5) hash function
to determine the address of the data block. In this case, it applies mod (5) hash function on the primary
keys and generates 3, 3, 1, 4 and 2 respectively, and records are stored in those data block addresses.
Types of Hashing:
Static Hashing
In static hashing, the resultant data bucket address will always be the same. That means if we
generate an address for EMP_ID =103 using the hash function mod (5) then it will always result in
same bucket address 3. Here, there will be no change in the bucket address.
Hence in this static hashing, the number of data buckets in memory remains constant throughout.
In this example, we will have five data buckets in the memory used to store the data.
When a record needs to be searched, then the same hash function retrieves the address of the
bucket where the data is stored.
o Insert a Record
When a new record is inserted into the table, then we will generate an address for a new record
based on the hash key and record is stored in that location.
o Delete a Record
To delete a record, we will first fetch the record which is supposed to be deleted. Then we will delete
the records for that address in memory.
o Update a Record
To update a record, we will first search it using a hash function, and then the data record is updated.
If we want to insert some new record into the file but the address of a data bucket generated by the
hash function is not empty, or data already exists in that address. This situation in the static hashing
is known as bucket overflow. This is a critical situation in this method.
To overcome this situation, there are various methods. Some commonly used methods are as
follows:
1. Open Hashing
When a hash function generates an address at which data is already stored, then the next bucket
will be allocated to it. This mechanism is called as Linear Probing.
For example: suppose R3 is a new address which needs to be inserted, the hash function generates
address as 112 for R3. But the generated address is already full. So the system searches next available
data bucket, 113 and assigns R3 to it.
2. Close Hashing
When buckets are full, then a new data bucket is allocated for the same hash result and is linked
after the previous one. This mechanism is known as Overflow chaining.
For example: Suppose R3 is a new address which needs to be inserted into the table, the hash
function generates address as 110 for it. But this bucket is full to store the new data. In this case, a
new bucket is inserted at the end of 110 buckets and is linked to it.
Dynamic Hashing
o The dynamic hashing method is used to overcome the problems of static hashing like bucket
overflow.
o In this method, data buckets grow or shrink as the records increases or decreases. This
method is also known as Extendable hashing method.
o This method makes hashing dynamic, i.e., it allows insertion or deletion without resulting in
poor performance.
For example:
Consider the following grouping of keys into buckets, depending on the prefix of their hash address:
The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and 6 are 01, so it will
go into bucket B1. The last two bits of 1 and 3 are 10, so it will go into bucket B2. The last two bits of 7 are
11, so it will go into B3.