[go: up one dir, main page]

0% found this document useful (0 votes)
55 views15 pages

Chap. 6 Hash-Based Indexing: Abel J.P. Gomes

Extendible hashing avoids overflow pages by splitting buckets when they become full and inserting new data entries. It uses a directory of pointers to buckets, doubling the directory size when needed to split an overflowed bucket. This allows splitting to be done with a single data page rewrite instead of reorganizing the whole file. The hash function is adjusted when splitting to map keys to the new bucket structure. Linear hashing is similar but handles deletions differently by allowing buckets to merge. Both techniques are dynamic hashing methods that adapt the hash table as the dataset changes over time.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views15 pages

Chap. 6 Hash-Based Indexing: Abel J.P. Gomes

Extendible hashing avoids overflow pages by splitting buckets when they become full and inserting new data entries. It uses a directory of pointers to buckets, doubling the directory size when needed to split an overflowed bucket. This allows splitting to be done with a single data page rewrite instead of reorganizing the whole file. The hash function is adjusted when splitting to map keys to the new bucket structure. Linear hashing is similar but handles deletions differently by allowing buckets to merge. Both techniques are dynamic hashing methods that adapt the hash table as the dataset changes over time.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Chap.

6 Hash-Based Indexing
Not cahos-like, together crushed and bruised, But, as the world harmoniously confused: Where order in variety we see. -- Alexander Pope

Abel J.P. Gomes

Bibliography: 1. R. Ramakrishnan and J. Gehrke. Database Management Systems. Addison-Wesley, 2003 (cap.11).

1. Objectives

What is the intuition behind hash-structured indexes? Why are they especially good for equality searches but useless for range selections? What is Extendible Hashing? How does it handle search, insert, and delete? What is Linear Hashing? How does it handle search, insert, and delete? What are the similarities and differences between Extendible and Linear Hashing?

2. Introduction

The basic idea is to use a hashing function, which maps a search-key value (of a field) into a record or bucket of records. As for any index, 3 alternatives for data entries k*:
Actual data record (with key value k) <k, rid of matching data record> <k, list of rids of matching data records>

Hash-based indexes are best for equality selections. They cannot support range searches. Static and dynamic hashing techniques exist; trade-offs
similar to ISAM vs. B+ trees.
3

3. Static Hashing

A bucket is a unit of storage containing one or more records (a bucket is typically a disk block). In a hash file organization we obtain the bucket of a record directly from its search-key value using a hash function. Hash function h is a function from the set of all search-key values K to the set of all bucket addresses B. Hash function is used to locate records for access, insertion as well as deletion. Records with different search-key values may be mapped to the same bucket; thus entire bucket has to be searched sequentially to locate a record.
4

3. Static Hashing (cont.1)

# primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed. h(key) mod B = bucket to which data entry with key key belongs. (B = #of buckets)
0 2 key ... ... ...

N-1
primary bucket pages

...
overflow pages
5

3. Static Hashing (cont.2)


Buckets contain data entries. Hash function works on search key field of record r. Must distribute values over range 0,...,B-1.

h(key)=(a*key+b) usually works well. a and b are constants; lots known about how to tune h.

Long overflow chains can develop and degrade performance. Extendible and Linear Hashing: dynamic techniques to fix this problem.

4. Example to illustrate inserts and overflows


Let us assume we have 2 records/bucket and 4 buckets numbered from 0 to 3.. Inserting the data entry e leads to creation of an overflow bucket page, as illustrated below: INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 h(e) = 1 0 1 2 3

d a c b e

4. Example to illustrate deletes

Deleting the data entry c after e and f leads to elimination of overflow bucket containing the data entry d; d is then moved into the primary bucket 1. 0 1 2 3

DELETE: e f c

a b c d e f g
maybe move g up
8

5. Rule of Thumb

Try to keep space utilization between 50% and 80%! Utilization = # keys used total # keys that fit If <50%, wasting space If > 80%, overflows significant depends on how good hash function is & on # keys/bucket

6 How do we cope with growth?

Overflows and reorganizations Dynamic hashing:


Extensible Linear

10

7 Extendible Hashing

Situation: bucket (primary page) becomes full. Why not re-organize file by doubling # of buckets? Reading and writing all pages is expensive! Idea: use directory of pointers to buckets, double # of buckets by doubling the directory, splitting just the bucket that overflowed! Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. No overflow page! Trick lies in how hash function is adjusted!

11

7.1 Extendible Hashing: Example

Directory is array of size 4. To find bucket for k, take last global depth # of bits of h(k). For example: If h(k)=5=binary 101, it is in bucket pointed to by 01.

LOCAL DEPTH GLOBAL DEPTH

2 4* 12* 32* 16* Bucket A

2 00 01

2 1* 5* 21* 13* Bucket B

10 11

2 10* Bucket C

DIRECTORY

2 15* 7* 19* DATA PAGES


12

Bucket D

7.2 Extendible Hashing: Inserting


Insert: If bucket is full, split it (allocate new page, re-distribute). If necessary, double the directory. (As we will see, splitting a bucket does not always require doubling; we can tell by comparing global depth with local depth for the split bucket.) For example: insert h(k)=20 (causes doubling)
2 Bucket A 32*16* 2 1* 5* 21*13* Bucket B 2 10* 2 000 001 010 011 100 Bucket D 101 110 111 Bucket A2 (`split image' of Bucket A) DIRECTORY LOCAL DEPTH GLOBAL DEPTH 3 2 1* 5* 21*13* Bucket B 2 10* 2 15* 7* 19* 3 4* 12* 20* Bucket C 3 32*16* Bucket A

LOCAL DEPTH GLOBAL DEPTH 2 00 01 10 11 DIRECTORY

Bucket C

15* 7* 19* 2 4* 12* 20*

Bucket D Bucket A2 (`split image' 13 of Bucket A)

Summary

Hash-based indexes: best for equality searches, cannot support range searches. Static Hashing can lead to long overflow chains. Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. (Duplicates may require overflow pages.)

Directory to keep track of buckets, doubles periodically. Can get large with skewed data; additional I/O if this does not fit in main memory.

14

END OF CHAPTER

Summary (cont.)

Key compression increases fanout, reduces height. Bulk loading can be much faster than repeated inserts for creating a B+ tree on a large data set. Most widely used index in database management systems because of its versatility. One of the most optimized components of a DBMS.

15

You might also like