Chap. 6 Hash-Based Indexing: Abel J.P. Gomes
Chap. 6 Hash-Based Indexing: Abel J.P. Gomes
6 Hash-Based Indexing
Not cahos-like, together crushed and bruised, But, as the world harmoniously confused: Where order in variety we see. -- Alexander Pope
Bibliography: 1. R. Ramakrishnan and J. Gehrke. Database Management Systems. Addison-Wesley, 2003 (cap.11).
1. Objectives
What is the intuition behind hash-structured indexes? Why are they especially good for equality searches but useless for range selections? What is Extendible Hashing? How does it handle search, insert, and delete? What is Linear Hashing? How does it handle search, insert, and delete? What are the similarities and differences between Extendible and Linear Hashing?
2. Introduction
The basic idea is to use a hashing function, which maps a search-key value (of a field) into a record or bucket of records. As for any index, 3 alternatives for data entries k*:
Actual data record (with key value k) <k, rid of matching data record> <k, list of rids of matching data records>
Hash-based indexes are best for equality selections. They cannot support range searches. Static and dynamic hashing techniques exist; trade-offs
similar to ISAM vs. B+ trees.
3
3. Static Hashing
A bucket is a unit of storage containing one or more records (a bucket is typically a disk block). In a hash file organization we obtain the bucket of a record directly from its search-key value using a hash function. Hash function h is a function from the set of all search-key values K to the set of all bucket addresses B. Hash function is used to locate records for access, insertion as well as deletion. Records with different search-key values may be mapped to the same bucket; thus entire bucket has to be searched sequentially to locate a record.
4
# primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed. h(key) mod B = bucket to which data entry with key key belongs. (B = #of buckets)
0 2 key ... ... ...
N-1
primary bucket pages
...
overflow pages
5
Buckets contain data entries. Hash function works on search key field of record r. Must distribute values over range 0,...,B-1.
h(key)=(a*key+b) usually works well. a and b are constants; lots known about how to tune h.
Long overflow chains can develop and degrade performance. Extendible and Linear Hashing: dynamic techniques to fix this problem.
Let us assume we have 2 records/bucket and 4 buckets numbered from 0 to 3.. Inserting the data entry e leads to creation of an overflow bucket page, as illustrated below: INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 h(e) = 1 0 1 2 3
d a c b e
Deleting the data entry c after e and f leads to elimination of overflow bucket containing the data entry d; d is then moved into the primary bucket 1. 0 1 2 3
DELETE: e f c
a b c d e f g
maybe move g up
8
5. Rule of Thumb
Try to keep space utilization between 50% and 80%! Utilization = # keys used total # keys that fit If <50%, wasting space If > 80%, overflows significant depends on how good hash function is & on # keys/bucket
Extensible Linear
10
7 Extendible Hashing
Situation: bucket (primary page) becomes full. Why not re-organize file by doubling # of buckets? Reading and writing all pages is expensive! Idea: use directory of pointers to buckets, double # of buckets by doubling the directory, splitting just the bucket that overflowed! Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. No overflow page! Trick lies in how hash function is adjusted!
11
Directory is array of size 4. To find bucket for k, take last global depth # of bits of h(k). For example: If h(k)=5=binary 101, it is in bucket pointed to by 01.
2 00 01
10 11
2 10* Bucket C
DIRECTORY
Bucket D
Insert: If bucket is full, split it (allocate new page, re-distribute). If necessary, double the directory. (As we will see, splitting a bucket does not always require doubling; we can tell by comparing global depth with local depth for the split bucket.) For example: insert h(k)=20 (causes doubling)
2 Bucket A 32*16* 2 1* 5* 21*13* Bucket B 2 10* 2 000 001 010 011 100 Bucket D 101 110 111 Bucket A2 (`split image' of Bucket A) DIRECTORY LOCAL DEPTH GLOBAL DEPTH 3 2 1* 5* 21*13* Bucket B 2 10* 2 15* 7* 19* 3 4* 12* 20* Bucket C 3 32*16* Bucket A
Bucket C
Summary
Hash-based indexes: best for equality searches, cannot support range searches. Static Hashing can lead to long overflow chains. Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. (Duplicates may require overflow pages.)
Directory to keep track of buckets, doubles periodically. Can get large with skewed data; additional I/O if this does not fit in main memory.
14
END OF CHAPTER
Summary (cont.)
Key compression increases fanout, reduces height. Bulk loading can be much faster than repeated inserts for creating a B+ tree on a large data set. Most widely used index in database management systems because of its versatility. One of the most optimized components of a DBMS.
15