Advance Data Structures
Advance Data Structures
Data Structures
s s
re
REEMA THAREJA
yP
Assistant Professor
Department of Computer Science
sit
University of Delhi
er
iv
S. RAMA SREE
Un
Professor
d
Published in India by
Oxford University Press
Ground Floor, 2/11, Ansari Road, Daryaganj, New Delhi 110002, India
ss
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by licence, or under terms agreed with the appropriate reprographics
re
rights organization. Enquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press, at the
yP
address above.
ISBN-13: 978-0-19-948717-2
er
ISBN-10: 0-19-948717-0
iv
Oxford University Press disclaims any responsibility for the material contained therein.
xf
O
s
and some are highly specialized for specific tasks. For example, B-trees are particularly well-
s
re
suited for implementation of databases. Similarly, hash tables are widely used for compiler
implementations to look up identifiers. The advanced data structures such as priority queues, AVL
yP
trees, red-black trees, multi-way search trees, and digital search trees provide more capabilities
to organize data. For example, trie is used in Spell Checker, Auto-Complete for commands, etc.
sit
based on prefix searching.
All in all, data structures are used in almost every program. Specific data structures are essential
er
ingredients of many efficient algorithms, and make possible the management of enormous amounts
of data, such as large databases and internet indexing services. Hence, a course on data structures
iv
is offered to help students master the design and applications of various data structures and also
Un
This book is designed to serve as a textbook for advanced course in data structures offered to
or
efficient data storage and related applications for understanding the newly established techniques
O
s
examination.
s
re
Organization of the Book
The book is organized into 8 chapters and 1 appendix.
yP
Chapter 1 provides a detailed explanation of external sorting using k-way merging, parallel
operations using buffer handling, run generation, and optimal merging of runs.
sit
Chapter 2 describes static hashing, hash functions including the secure hash function, concepts of
er
overflow handling, and theoretical evaluation of overflow techniques. The chapter also discusses
dynamic hashing with as well as without using directories.
iv
Chapter 3 focuses on binary trees, their traversal schemes, and representation in memory.
The chapter also discusses the variants of simple binary trees, namely, expression trees and
Un
tournament trees.
Chapter 4 discusses the structure, properties, and applications of binary heaps or queues. It also
d
details the underlying concept, structure, property, implementation, and examples of binomial
or
heaps (or queues). As one of the major applications of priority queues, the selection problem and
the event simulation problem are also covered.
xf
Chapter 5 deals with the efficient binary search trees. Some very important data structures like
O
optimal binary search trees, AVL trees, and red-black trees are discussed in detail. It unleashes
the drawbacks of binary search trees and demonstrates various operations performed on these
height-balanced trees.
Chapter 6 introduces the concept of multi-way search trees. It discusses its variants B tree and
B+ tree along with their definition, properties, operations, and examples in sufficient depth.
Chapter 7 introduces the various types of digital search structures. It discusses digital search trees,
tries, Patrica, binary tries, compressed binary tries with skip fields as well as labeled edges, multi-
way tries, and compressed multi-way tries along with a detailed description of their operations
like inserting, deleting, and searching. The chapter ends with an introduction to the applications
of complex tries in internet packet forwarding, IP routing in 1-bit tries, fixed-stride tries, and
variable-stride tries.
Chapter 8 contains a detailed explanation of graphs as a data structure. It discusses the memory
representation, traversal schemes, and applications of graphs in the real world. It also discusses
concepts such as deducing minimum spanning tree of a given graph using Kruskal’s and Prim’s
algorithms and finding the shortest path using Dijkstra’s, Warshall’s, and Bellman-Ford’s
algorithms.
Online Resources
To aid teachers and students, the book is accompanied with online resources that are available at
india.oup.com/orcs/9780199487172. The content for the online resources are as follows.
For Faculty
∑ Chapter-wise PPTs ∑ Solutions manual for select chapter-end exercises
For Students
∑ Projects ∑ Quizzes ∑ Additional programs
s s
re
Reema Thareja
S. Rama Sree
yP
sit
er
iv
Un
d
or
xf
O
s
Theraja, for his inspiration and guidance in writing this book.
s
re
Last but not the least, I would like to thank the editorial team at Oxford University Press, India
for their help and support.
yP
Reema Thareja
sit
I take this opportunity to acknowledge the people who have been instrumental in the successful
er
completion of this book.
First and foremost, I would like to show my greatest appreciation to my husband, Mr S.N.S.V.S.C.
iv
Ramesh, for his unfailing understanding, motivation, extended sacrifice, and unconditional love
Un
and support while writing this book. I would also appreciate my children, Sri Hasini and Sri
Harshini, for their cooperation. I would like to express my gratitude to my parents, Sri P. Samba
Siva Rao and Smt. P. Swarupa Rani, and my in-laws, Sri S. Viswanadha Sastry and Smt. S. Lakshmi
d
Reddy, Principal, Aditya Engineering College, for extending their guidance and support by giving
O
valuable feedback.
My thanks and appreciations also go to my colleagues of Aditya Engineering College, Mrs A.
Vanathi, Mr M. Raja Babu, Mrs N. N. S. Eswari, and S. Surekha, for helping me a lot in writing
this book and people who have willingly helped me out with their abilities.
I am thankful to the editorial team at Oxford University Press, India for their support.
S. Rama Sree
The authors are thankful to the following reviewers for their valuable suggestions and constructive
feedback that helped in making the book better.
Dr Rajya Lakshmi
JNTU Kakinada University College of Engineering, Narsaraopeta
Prof. P. Chakradhar
Rama Chandra Engg College, Eluru
s s
KKR and KSR Inst of Technology and Sciences, Guntur
re
Prof. Ch. Ravi Teja
yP
St. Marys Womens Engg College, Guntur
Prof. N. Eswar Rao
sit
Mittapalli Engineering College, Guntur
er
iv
Un
d
or
xf
O
Detailed Contents
Preface iv
Acknowlegements vii
1. External Sorting 1
1.1 Introduction 1
1.2 k–way Merge Sort 2
s
1.2.1 2-way Merge Sort 2
s
re
1.2.2 3–way Merge Sort 3
1.2.4 4–way Merge Sort 3
yP
1.2.4 Merging Runs 4
1.2.5 Time Complexity 6
sit
1.3 Buffer Handling for Parallel Operation 7
er
1.4 Run Generation 15
1.5 Optimal Merging of Runs 17
iv
Un
2. Hashing 27
2.1 Introduction to Static Hashing 27
d
3. Trees 63
3.1 Introduction 63
3.1.1 Basic Terminology 63
3.2 Types of Trees 64
3.2.1 General Trees 64
s
3.2.2 Forests 64
s
re
3.2.3 Binary Trees 65
3.2.4 Binary Search Trees 69
yP
3.2.5 Expression Trees 69
3.2.6 Tournament Trees 70
sit
3.3 Creating a Binary Tree from a General Tree 70
er
3.4 Traversing a Binary Tree 72
3.4.1 Pre-order Traversal 72
iv
s
5.1 Binary Search Trees 107
s
5.1.1 Operations on Binary Search Trees 109
re
5.1.1.1 Searching for a Node in a Binary Search Tree 109
yP
5.1.1.2 Inserting a New Node in a Binary Search Tree 110
5.1.1.3 Deleting a Node from a Binary Search Tree 111
sit
5.2 Optimal Binary Search Tree (OBST) 116
er
5.3 Self-balancing Binary Search Tree 122
5.4 AVL Trees 122
iv
s
7.1 Introduction to Digital Search Tree 183
s
7.1.1 Operations on Digital Search Trees 184
re
7.1.1.1 Insertion 184
yP
7.1.1.2 Searching 186
7.1.1.3 Deletion 188
sit
7.2 Binary Tries and Patricia 191
er
7.2.1 Binary Tries 191
7.2.2 Compressed Binary Trie 192
iv
s
8. Graphs 227
s
re
8.1 Introduction 227
8.2 Graph Terminology 228
yP
8.3 Directed Graphs 229
8.3.1 Terminology of a Directed Graph 229
sit
8.3.2 Transitive Closure of a Directed Graph 230
8.4 Bi-connected Components 232
er
1
External Sorting
s s
re
LEARNING OBJECTIVE
yP
sit
Sorting is one of the major operations used in various applications of computer
science. Sorting refers to arrangement of elements in a certain order. It is classified
er
into two types, namely, internal sorting and external sorting. This chapter discusses
the various techniques and algorithms of external sorting.
iv
Un
d
or
1.1 INTRODUCTION
The term sorting means arranging the elements of an array so that they are placed in some relevant
xf
order which may be either ascending or descending. There are two types of sorting: internal sorting
O
and external sorting. Internal sorting is concerned with the ordering of elements present in a file
stored in computer’s memory. Few of the internal sorting techniques widely used are bubble sort,
insertion sort, selection sort, heap sort, quick sort, merge sort, and radix sort. In many applications,
it is required to perform sorting on very large files consisting of trillions of records for their efficient
use in the application concerned. The internal sorting techniques are inefficient for sorting these
huge files as they require the entire file to be present in the main memory of the computer, which
is practically impossible. So, the need for external sorting methods arises to sort large files.
External sorting is applied when there is voluminous data to be sorted that cannot fit in the
memory. Because of their large volumes, the files are stored in external storage devices like
magnetic tapes, magnetic disks, etc. The external storage techniques need to consider the type of
medium on which the files are present. If it is assumed that the files are residing on disks, the term
block refers to the smallest unit of data that can be read from or written to a disk at a particular
instance of time. In general, a block contains several records.
The files present on the disk are read into the internal memory in terms of blocks, one block at
a time. The block of records stored in internal memory is sorted making use of any of the existing
internal sorting techniques. Each of the sorted block of records is called a run. Now, the file is
viewed as a collection of runs. The runs are written on to the external storage devices as and when
required. The most popular method for sorting files on external storage devices is external merge
sort which is more efficient in gathering and processing the runs stored on external devices.
s
at a time to generate a single run twice as long. The merging process is repeated until a single
s
run is generated.
re
Technique
yP
For easy understanding, consider that there are 6000 records to be sorted and the internal memory
capacity is 500 records. Let Ri j represent the jth run in the ith pass. The generated runs in the first
sit
pass are R1 1 to R1 12. In the first pass, R1 1 and R1 2 are merged resulting in run R2 1 which consists
of the sorted list of first 1000 records. The next two runs R1 3 and R1 4 are merged resulting in
er
R2 2. Likewise, four other runs will be merged in the second pass resulting in runs R2 1 to R2 6.
Similarly, in the third pass, R2 1 and R2 2 are merged to form R3 1. Likewise, two other runs are
iv
generated resulting in runs R3 1 to R3 3. In the fourth pass, R3 1 and R3 2 are merged to form run R4
Un
1. The last run R3 3 will be taken as it is to R4 2. In fifth pass, R4 1 and R4 2 runs are merged to form
run R5 1, the final sorted file. The run generation using 2–way merge sort is shown in Fig. 1.1.
d
R1 1 R1 2 R1 3 R1 4 R1 12
or
1 500 501 1000 1001 1500 1501 2000 .... 5501 6000
xf
O
R2 1 R2 2 R2 6
1 1000 1001 2000 .... 5001 6000
R3 1 R3 2 R3 3
1 2000 2001 4000 4001 6000
R4 1 R4 2
1 4000 4001 6000
R5 1
1 6000
Example 1.1 Consider 6000 records are available on a disk which are to be sorted. In the
internal memory of the computer, only 500 records can be resided. The block size of the disk is
100 records. Sort the file using 2–way merge sort.
Solution The steps involved in sorting the file are as follows:
Step 1 Read five blocks of data, i.e., totally 500 records at a time from the file residing on the
disk to internal memory. Sort these blocks using any internal sorting technique to generate 12
runs (runs = 6000 records/500 records). These runs are written back on to the disk after sorting.
Step 2 Merge 2 runs at a time to generate a new run in the next pass with size twice as long.
Until all runs in the pass are processed.
Step 3 Repeat Step 2 until a single run is generated.
Step 4 EXIT
s
1.2.2 3–Way Merge Sort
s
The k–way merge sort where k = 3 is a 3–way merge sort. In 3–way merge sort, 3 runs are merged
re
at a time to generate a single run thrice as long. The merging process is repeated until a single run
yP
is generated. Considering the same Example 1.1 for 3–way merge sort yields the runs as shown
in Fig. 1.2.
In the first pass, R1 1 to R1 3 are merged resulting in run R2 1, which consists of the sorted list
sit
of first 1500 records. The next three runs R1 4, R1 5, R1 6 are merged resulting in R2 2. Likewise,
four runs will be emerging in the second pass, i.e., R2 1 to R2 4. Similarly, in the third pass, R2 1 to
er
R2 3 are merged to form R3 1. The last run R2 4 will be taken as it is to R3 2. In fourth pass, R3 1 and
iv
R1 1 R1 2 R1 3 R1 12
1 500 501 1000 1001 1500 .... 5501 6000
d
or
R2 1 R2 4
xf
R3 1 R3 2
1 4500 4501 6000
R4 1
1 6000
In the first pass, if the records are divided into 12 runs, 500 records each, then we have runs
generated as R1 1 to R1 12. In the second pass, R1 1 to R1 4 are merged to form R2 1. In the same way
R2 2 and R2 3 are generated by merging R1 5, R1 6, R1 7, R1 8 and R1 9, R1 10, R1 11, R1 12 respectively. In the
second pass all the runs (R2 1, R2 2 , and R2 3) are merged together to form the final sorted run, i.e., R3 1.
R1 1 R1 2 R1 3 R1 4 R1 12
1 500 501 1000 1001 1500 1501 2000 .... 5501 6000
R2 1 R2 2 R2 3
1 2000 2001 4000 4001 6000
s
R3 1
s
1 6000
re
Figure 1.3 4–way merge sort
yP
From the above we observe that as we increase the number of runs to be merged, there is a
substantial decrease in the number of passes. To summarize, k–way merge combines k runs into
sit
one sorted run where k ≥2. The algorithm for k–way merge sort is shown in Fig. 1.4.
er
k–way_mergesort( File1, M, k)
iv
Merge k runs into one sorted run with size as k times the input run size
Write this single run back onto disk
[END OF LOOP]
Step 3: If more than 1 run is generated in Step 2 then
Repeat Step 2
Else
The file on the external storage is sorted
Step 4: EXIT
each run and place it in the smallest set. The smallest of the two records in the smallest set is the
smallest record overall. This record is put in the output run and removed from the corresponding
run. The next smallest record in the corresponding run is moved to the smallest set. This process
is repeated until the two initial runs are empty. The k–way merge behaves identically, but at each
step the smallest of k records is selected and removed from its run.
Example 1.2 Consider 3 runs with 4 records each as shown in Fig 1.5. Implement the 3-way
merge sort technique.
Run 1 3 5 12 15
Run 2 2 4 10 17
s
Run 3 1 6 8 18
s
re
Output Run 1 2 3 4 5 6 8 10 12 15 17 18
yP
Figure 1.5 Three runs ready to be merged and the output run resulting from the merge
sit
Solution
er
∑ Consider the smallest record of each run and add it to the smallest set: {3, 2, 1}.
∑ Take the smallest record of the smallest set, 1, add it to the output run and delete it from the
iv
original run. At this point, the output run is {1}. The step-by- step process of merging the
Un
Step 2: The three records in smallest set are {3, 2, 6}. Remove 2 from the second run and append
or
Step 3: The three records in smallest set are {3, 4, 6}. Remove 3 from the first run and append
it to the output run: {1, 2, 3}.Move 5 to the smallest set.
O
Step 4: The three records in smallest set are {5, 4, 6}. Remove 4 from the second run and append
it to the output run: {1, 2, 3, 4}. Move 10 to the smallest set.
Step 5: The three records in smallest set are {5, 10, 6}. Remove 5 from the first run and append
it to the output run: {1, 2, 3, 4, 5}. Move 12 to the smallest set.
Step 6: The three records in smallest set are {12, 10, 6}. Remove 6 from the third run and append
it to the output run: {1, 2, 3, 4, 5,6}. Move 8 to the smallest set.
Step 7: The three records in smallest set are {12, 10, 8}. Remove 8 from the third run and append
it to the output run: {1, 2, 3, 4, 5, 6, 8}. Move 18 to the smallest set.
Step 8: The three records in smallest set are {12, 10, 18}. Remove 10 from the second run and
append it to the output run: {1, 2, 3, 4, 5, 6, 8, 10}. Move 17 to the smallest set.
Step 9: The three records in smallest set are {12, 17, 18}. Remove 12 from the first run and append
it to the output run: {1, 2, 3, 4, 5, 6, 8, 10, 12}. Move 15 to the smallest set.
Step 10: The three records in smallest set are {15, 17, 18}. Remove 15 from the first run and
append it to the output run: {1, 2, 3, 4, 5, 6, 8, 10, 12, 15}.The first run is now empty, the merge
follows as a 2-way merge instead of a 3-way merge.
Step 11: The two top records are {17, 18}. Remove 17 from the second run and append it to the
output run: {1, 2, 3, 4, 5, 6, 8, 10, 12, 15, 17}. Now, the second run is also empty, only the third
run remains non-empty.
Step 12: The records of the last run are {18} and are appended to the output run and the final run
is obtained {1, 2, 3, 4, 5, 6, 8, 10, 12, 15, 17, 18}.
The result of merging the three initial runs is {1, 2, 3, 4, 5, 6, 8, 10, 12, 15, 17, 18}.
s
∑ Seek time: It is the time taken to position the read/write heads on to the required cylinder and
s
it depends on the number of cylinders across which the head moves.
re
∑ Latency time: It is the time taken to position the head on to the right sector on the disk.
yP
∑ Transmission Time: It is the time taken to transmit the block of data to/from the disk.
To perform analysis over the complexity of external sort, the following terms are used:
ts = maximum seek time
sit
tl = maximum latency time
er
trw = time to read or write one block of records
tIO = time to input or output one block (ts + tl + trw)
iv
ntm= time taken to merge n records from input buffers to the output buffer
Here, we assume that every time a block of records is read from or written onto the disk, the
maximum seeks and latency times are utilized. However, this may not be true in reality, but this
d
assumption simplifies the analysis. The computing time for various operations performed in
or
Example 1.1, to sort 6000 records stored in disk is given in Fig. 1.6.
xf
Operations Time
O
Step 1: Read 60 blocks of records from the input file, 60tIO, 120 tIO + 12tIS
perform internal sort with time taken as 12tIS. Again write back 60
blocks on to the disk, 60 t IO
Step 2: Merge runs 1 to 12 of 500 records each and every two 120 tIO + 6000tm
runs at a time.
Step 3: Merge two runs of 1000 records each, with a total of 20 120 tIO + 6000tm
blocks.
Step 4: Merge two runs of 2000 records each, with a total of 40 80tIO + 4000tm
blocks.
Step 5: Merge one run of 4000 records with one run of 2000 120tIO + 6000tm
records.
Total Computing Time = 560tIO + 22000tm + 12tIS
From Fig. 1.6, it is evident that the computing time for the sorting process is mainly dependent
on the number of passes made over the data to get it sorted. The first pass involves the data that
© Oxford University Press. All rights reserved.
External Sorting 7
is divided into runs after the internal sort is carried out on the data with size equal to the capacity
of the internal memory.
The merging of runs requires 3⅔ passes over the entire set of records. In merging, the first
pass is used to merge 12 runs of length 500 records each yielding to 6 runs of length 1000 records
each. The time taken for this merge is 6000tm. The second pass is used to merge 6 runs yielding
3 runs of length 2000 records each. The time taken for this merge is 6000tm. In the third pass,
only 2 runs of the 3 are merged, i.e., merge 2 runs of length 2000 records each yielding to a run
of length 4000 records and another run of length 2000 records. This is considered as 2/3 of the
pass. The time taken for this merge is 4000tm. The fourth pass is used to merge two runs, one run
of length 4000 records and the other with 2000 records yielding to a single run of length 6000
records. The time taken for this merge is 6000tm.
The total number of passes required for sorting is 4⅔ (1 pass for generating runs and 3⅔ passes
for merging the runs). Therefore, the total input and output time is 2 × (4⅔) × 60tIO = 560tIO. The
time required for merging is 3⅔ × 6000tm = 22000tm.
s
Due to the strong correlation of computing time with the number of passes made over the
s
data, the complexity of the k–way merge sort is mainly concerned with the counting of number
re
of passes required.
In general if we start with m runs in a 2–way merge, log2 m + 1 passes are required to sort the
yP
records. The number of passes can be reduced with the use of a higher-order merge, i.e., k–way
merge with k ≥ 2. Therefore, a k–way merge with m runs requires logk m + 1 passes for sorting
sit
the data. However when we use higher-order merge, k–runs of length s1,s2, s3......, sk cannot
be merged in O(Âkisk) time as the next record to be output next is not confined to the smallest key
er
of the two possibilities in a 2–way merge but instead need to consider all the k possibilities, i.e.,
iv
The computing time now becomes O((k-1)Âkisk). As logk m passes are made, the total number of
key comparisons is n(k-1) logk m = n(k-1) log2 m / log2 k, where n is the number of records to be
d
sorted. So, (k-1)/log2 k, is the factor by which number of key comparisons increases.
or
As the value of k increases, the input/output time will decrease. But, beyond a certain value of
k, the input/output time will increase despite the decrease in the number of passes being made.
xf
The optimal value for k depends on disk parameters and the amount of internal memory available
for buffers.
O
With higher-order merge, the amount of I/O needed may be reduced with no significant loss in
the internal processing speed but it is not as much as indicated by the reduction to logk m passes.
This is mainly due to the increase in the number of input buffers in k–way merge increases with
the k.
1.3 BUFFER HANDLING FOR PARALLEL OPERATION
As the k–way merge sorting technique is introduced in the previous section, let us now discuss in
more detail the process of merging runs internally using buffers. In k–way merge sorting technique,
we need at least k–input buffers and 1–output buffer for performing the merge operation. The
major steps involved in the k–way merge sort are:
∑ Read: Load the records from runs into input buffers.
∑ Merge: Perform merge operation on input buffers and store the result in output buffer.
∑ Write: Store the records from the output buffer onto the disk.
In the previous section, the read, merge, and write operations are shown to be performed
sequentially. Instead of performing these operations in sequence, we can opt for performing these
operations in parallel order to improve the efficiency, i.e., we can perform read, merge, and write
© Oxford University Press. All rights reserved.
8 Advanced Data Structures
operations simultaneously. But k–input buffers and 1–output buffer strategy is not sufficient to
handle the parallel operations.
While the output buffer is being written on to the disk, it is not possible to perform the merge
operation as there is no place to collect the merged records. So, we need 2 output buffers, one
for write operation and one for merge operation. Similarly, if the content is being loaded into
the input buffer, merging will be delayed. This can be handled by taking 2 input buffers. If one
input buffer is busy with read operation, then the other input buffer is used for merge operation.
Therefore, we require 2k–input buffers and 2–output buffers to handle the operations in parallel,
i.e., double buffering strategy is used. The parallel operations are demonstrated in Example 1.3.
Example 1.3 Using buffer handling for parallel operations, implement 2–way merge for the
runs shown below. Consider the block size as 2 and each buffer can hold 2 records.
Run 1 Run 2
s
2 4 6 8 9 10 3 5 7 16 21 26
s
re
Solution As 2–way merge is to be used, the input buffers used are 4 and the output buffers
used are 2. Let in[1] in[2] represent the input buffers for loading the records from Run1 and
yP
in[3], in[4] represent the input buffers for loading the records from Run2. Let ou[0] and ou[1]
represent the output buffers. Initially all the input and output buffers are empty.
sit
Step 1: Read–Load in[1] with a block from Run1. The configuration of buffers after read
operation is shown in Fig. 1.7.
er
iv
ou[0] ou[1]
xf
Output buffers
O
Step 2: Read–Load in[3] with a block from Run2. The configuration of buffers after read
operation is shown in Fig. 1.8.
ou[0] ou[1]
Output buffers
2 3
ou[0] ou[1]
Output buffers
s
Step 4: Read–Load in[4] with a block from Run2.
s
re
Merge–Perform merge operation on input buffers in[1], in[2] and in[3]. Store the merged record
in ou[1].
yP
Write–Store the content in ou[0] onto the disk. The configuration of buffers after read, merge,
and write operations is shown in Fig. 1.10.
sit
Input buffers for Run1 Input buffers for Run2
er
6 8 7 16
iv
4 5
d
ou[0] ou[1]
or
Output buffers
xf
6 7
ou[0] ou[1]
Output buffers
8 9
ou[0] ou[1]
Output buffers
s s
Figure 1.12 Configuration of buffers after step 6
re
Step 7: Merge–Perform merge operation on input buffers in[1],in[3], and in[4]. Store the merged
yP
record in ou[0].
Write–Store the content in ou[1] onto the disk. The configuration of buffers after merge and write
operations is shown in Fig. 1.13.
sit
er
Input buffers for Run1 Input buffers for Run2
21 26
iv
10 16
d
ou[0] ou[1]
or
Output buffers
xf
After Step 7, merging will be delayed until another record is loaded from Run1 into in[1].
In the above example, even though we used 2 buffers per run, we reached a situation in which
processing has to be held up due to lack of input records from the run. Simply assigning 2 buffers
per run does not solve the problem. Therefore, input buffers have to be assigned to the runs
cleverly to avoid the input delay, i.e., an individual buffer may be assigned to any run depending
upon need. This type of buffer that is assigned to a run when needed is called a floating buffer
and the k–way merge algorithm using floating buffers is called buffering algorithm.
In the buffer assignment strategy of floating buffers, at any time there will be at least one
input buffer for each run. The remaining buffers will be filled on a priority basis. The following
assumptions are made for the k–way merge with floating buffers (buffering algorithm):
1. For performing the operations of reading from a disk and writing onto a disk simultaneously,
we take two disk drives.
2. While read or write operation is being processed, the CPU cannot make reference to the same
block of memory, i.e., the same output buffer cannot be used for filling in and writing out.
If this is possible, only one output buffer is sufficient.
© Oxford University Press. All rights reserved.
External Sorting 11
Example 1.4 Implement k–way merge with floating buffers for the runs shown below. Assume
that the last block of each run is loaded with the sentinel record with key +∞. The blocks in
Run1 and Run2 are as follows:
s s
Run 1 2 4 6 8 9 10 +∞
re
yP
Run 2 3 5 7 16 21 26 +∞
sit
Solution As we have considered two runs, k = 2 and there will
be 2 linked queues one for each run, which are initially empty. b1
er
There are four floating buffers (b1, b2, b3, and b4) initially placed
b2
on the stack as in the Fig. 1.14. Let N be the run from which the
iv
Let LastKey[i] be the key value recently placed into the input b4
buffer from run i where 1 ≤ i ≤ k. Let N = i where LastKey[i] Figure 1.14 Stack of 4 floating
is minimum and N is the run from which the next block is to be buffers
d
from the stack and then it will be loaded with a block from run N. Whenever the floating buffer
is empty, it will be pushed back on to the stack of floating buffers. Let M represent the sentinel
xf
value.
O
The process of 2–way merge with floating buffers is shown in Fig. 1.15 and is explained in a
line-by-line manner below.
Line 1: Take two floating buffers (one for each run) from the stack of floating buffers and assign
a buffer to each run. Load the first block from Run1 into floating buffer b1 and first block from
Run2 into floating buffer b2. LastKey[1] = 4 and LastKey[2] = 5. Here, N = 1, as the LastKey[1] is
minimum. The next block is to be loaded from Run1.
Line 2: The minimum keys (2 and 3) in the input buffers are merged into the output buffer and
the next record in Run1 is loaded into the popped buffer b3 and is added to the linked queue at
Run1. Now N = 2, as the LastKey[2] = 5 is minimum when compared to LastKey[1]. As the output
buffer is full, store its contents on to the disk.
Line 3: The minimum keys (4 and 5) in the input buffers are merged into the output buffer and
the next record in Run2 is loaded into the popped buffer b4 and is added to the linked queue at
Run2. Now N = 1, as the LastKey[1] is minimum. The empty buffer b1 is pushed back on to the
stack. As the output buffer is full, store its contents on to the disk.
Line 4: The minimum keys (6 and 7) in the input buffers are merged into the output buffer and
the next record in Run1 is loaded into the popped buffer b1 and is added to the linked queue at
© Oxford University Press. All rights reserved.
12 Advanced Data Structures
Run1. Now N = 1, as the LastKey[1] is minimum. The empty buffer b2 is pushed back on to the
stack. As the output buffer is full, store its contents on to the disk.
Line 5: The minimum keys (8 and 9) in the input buffers are merged into the output buffer and
the next record M (sentinel record) in Run1 is loaded into the popped buffer b2 and is added to the
linked queue at Run1. Now N = 2, as the LastKey[2] is minimum. The empty buffer b3 is pushed
back on to the stack. As the output buffer is full, store its contents on to the disk.
Line 6: The minimum keys (10 and 16) in the input buffers are merged into the output buffer
and the next record in Run2 is loaded into the popped buffer b3 and is added to the linked queue
at Run2. Now N = 2, as the LastKey[2] is minimum. Finally, the empty buffers b1, b4 are pushed
back on to the stack. As the output buffer is full, store its contents on to the disk.
Line 7: The minimum keys (21 and 26) in the input buffers are merged into the output buffer
and the next record M (sentinel record) in Run2 is loaded into the popped buffer and is added to
the linked queue at Run2. As all the input buffers consists sentinel record there is no load step
further. Finally, the empty buffer is pushed back on to the stack. As the output buffer is full,
s
store its contents on to the disk.
s
re
Line 8: As the sentinel record is placed in the output buffer the entire process terminates.
yP
Line Run 1 Run 2
sit Output Next block from
1 2 4 3 5 no output Run 1
2 4 6 8 5 2 3 Run 2
er
iv
3 6 8 7 16 4 5 Run 1
Un
4 8 9 10 16 6 7 Run 1
5 10 M 16 8 9 Run 2
d
or
6 M 21 26 10 16 Run 2
xf
7 M M 21 26 no next
O
8 M M
Example 1.5 Implement k–way merge with floating buffers for the runs shown below. Assume
that the last block of each run is loaded with the sentinel record with key +∞. The blocks in
Run1, Run2, and Run3 are as follows:
Run 1 22 27 28 30 31 32 35 +∞
Run 2 25 31 36 38 40 62 72 +∞
Run 3 26 30 33 35 42 45 52 +∞
Solution As we have considered three runs, k = 3 and there will be 3 linked queues, one for
each run which are initially empty. There are six floating buffers
(b1, b2, b3, b4, b5, and b6) initially placed on the stack as in the
b1
Fig. 1.16.
b2
Let N be the run from which the next block is to be loaded and
s be the buffer size. Here s = 2. Let LastKey[i] be the key value b3
recently placed into the input buffer from run i where 1 ≤ i ≤ k. Let b4
N = i where LastKey[i] is minimum and N is the run from which b5
the next block is to be loaded. b6
In the next load operation a floating buffer will be popped from
Figure 1.16 Stack of 6 floating
the stack and then be loaded with a block from run N. Whenever
buffers
the floating buffer is empty it will be pushed back on to the stack
of floating buffers. Let M represents the sentinel value.
The process of 3–way merge with floating buffers is shown in Fig. 1.17 and is explained in a
s
line-by-line manner below.
s
Line 1: Take three floating buffers (one for each run) from the stack of floating buffers and assign
re
a buffer to each run. Load the first block from Run1 into floating buffer b1 and first block from
Run2 into floating buffer b2 and first block from Run3 into floating buffer b3. LastKey[1] = 27,
yP
LastKey[2] = 31 and LastKey[3] = 30. Here N = 1, as the LastKey[1] is minimum. The next block is
to be loaded from Run1.
sit
Line 2: The minimum keys (22 and 25) in the input buffers are merged into the output buffer
and the next record in Run1 is loaded into the popped buffer from the stack and is added to
er
the linked queue at Run1. Now N = 1, as the LastKey[1] is minimum. (Here, Run1 and Run3 are
having equal keys 30, but the minimum indexed run is considered as per our assumptions). As
iv
Line 3: The minimum keys (26 and 27) in the input buffers are merged into the output buffer
and the next record in Run1 is loaded into the popped buffer from the stack and is added to the
linked queue at Run1. Now N = 3, as the LastKey[3] is minimum. The empty buffer in the linked
d
queue at Run1 is pushed back on to the stack. As the output buffer is full, store its contents onto
or
the disk.
xf
Line 4: The minimum keys (28 and 30) in the input buffers are merged into the output buffer
and the next record in Run3 is loaded into the popped buffer from the stack and is added to the
O
linked queue at Run3. Now N = 2, as the LastKey[2] is minimum. The empty buffer in the linked
queue at Run1 is pushed back on to the stack. As the output buffer is full, store its contents on
to the disk.
Line 5: The minimum keys (30 and 31) in the input buffers are merged into the output buffer
and the next record in Run2 is loaded into the popped buffer from the stack and is added to the
linked queue at Run2. Now set N = 1, as the LastKey[1] is minimum. As the output buffer is full,
store its contents on to the disk.
Line 6: The minimum keys (31 and 32) in the input buffers are merged into the output buffer
and the next record in Run1 is loaded into the popped buffer from the stack and is added to the
linked queue at Run1. Now set N = 3, as the LastKey[3] is minimum. The empty buffer in the linked
queue at Run1 is pushed back on to the stack. As the output buffer is full, store its contents on
to the disk.
Line 7: The minimum keys (33 and 35) in the input buffers are merged into the output buffer
and the next record in Run3 is loaded into the popped buffer from the stack and is added to the
linked queue at Run3. Now set N = 2, as the LastKey[2] is minimum. As the output buffer is full,
store its contents on to the disk.
Line 8: The minimum keys (35 and 36) in the input buffers are merged into the output buffer
and the next record in Run2 is loaded into the popped buffer from the stack and is added to the
linked queue at Run2. Now set N = 3, as the LastKey[3] is minimum. The empty buffer in the linked
queue at Run3 is pushed back on to the stack. As the output buffer is full, store its contents on
to the disk.
Line 9: The minimum keys (38 and 40) in the input buffers are merged into the output buffer
and the next record in Run3 is loaded into the popped buffer from the stack and is added to the
linked queue at Run3. Now set N = 2, as the LastKey[2] is minimum. The empty buffer in the linked
queue at Run2 is pushed back on to the stack. As the output buffer is full, store its contents on
to the disk.
Line 10: The minimum keys (42 and 45) in the input buffers are merged into the output buffer
and the next record in Run2 is loaded into the popped buffer from the stack and is added to the
s
linked queue at Run2. Now, the minimum key of the last loaded records from each runs is +∞
s
(M) which indicates there is no next load operation. The empty buffer in the linked queue at Run3
re
is pushed back on to the stack. As the output buffer is full, store its contents on to the disk.
Line 11: The minimum keys (52 and 62) in the input buffers are merged. As there is no next
yP
(N value) in the previous step, no load operation is performed and no need to update N. The empty
buffer in the linked queue at Run2 is pushed back on to the stack. As the output buffer is full,
sit
store its contents on to the disk.
Line 12: The minimum keys (72 and M) in the input buffers are merged into the output buffer.
er
As the sentinel recorded is loaded merged the process terminates with this step.
iv
Un
1 22 27 25 31 26 30 no output Run 1
d
2 27 28 30 31 26 30 22 25 Run 1
or
3 28 30 31 32 31 30 26 27 Run 3
xf
4 31 32 31 30 33 35 28 30 Run 2
O
5 32 31 36 38 33 35 30 31 Run 1
6 35 M 36 38 33 35 30 31 Run 3
7 M 36 38 35 42 45 33 35 Run 2
8 M 38 40 62 42 45 35 36 Run 3
9 M 62 42 45 52 M 38 40 Run 2
10 M 62 72 M 52 M 42 45 no next
11 M 72 M M 52 62 no next
12 M M 72 M
The algorithm depicting the process of k–way merge using floating buffers is presented in
Fig. 1.18.
s
buffer is returned to the stack of empty buffers. However, if an input buffer
s
becomes empty and the output buffer gets +∞ merged into it, the empty buffer is
re
left on the queue. If the output buffer gets +∞ merged into it, the contents of
ou are copied onto disk, the merge terminates and the control goes to Step 6.
yP
Step 4: Wait for any ongoing disk input/output to complete.
Step 5: Repeat Steps 2 to 4 until there is no input block to be read.
Step 6: EXIT
sit
Fig 1.18 Algorithm for k–way merge with floating buffers
er
iv
With the existing internal sorting techniques, it is possible to generate runs whose size is equal
to the number of records that can be held in the internal memory. However, by using the loser’s
tree, it can be performed in a better way. A loser tree can be understood clearly by knowing firstly
d
about winner trees. A winner tree is a complete binary tree which helps us to obtain the smallest
or
key from a set of keys where each node represents the smaller of its two children. The root consists
of the smallest value in the entire tree.
xf
The construction of the winner tree may be compared with the tournament tree. In a tournament
O
tree, each external node represents a player and each internal node represents the winner of the
match played between the players represented by its children nodes. Similarly, each internal node
of the winner tree represents the winner of a tournament and the root node is the overall winner
of the entire tournament. Each internal node consists of smallest value of its children and the
overall winner is the key with the smallest value in the entire tree. The keys or records are present
in the external nodes or leaf nodes only. All the internal nodes consist of only the pointers to the
winners of each tournament played in the respective levels of the tree. Here, in the k–way merge
sort, leaf nodes of the winner tree consists of the first record from the k runs being merged. The
concept of merging using loser trees can be better understood using the following example.
Example 1.6 Consider merging of 4 runs using winner tree and loser tree for the runs shown
below.
15 10 24 8
25 18 56 12
30 20 68 16
46 93 76 28
Run 1 Run2 Run 3 Run 4
Solution To construct winner tree, all the first keys from all the 4 runs will be initially loaded
into the external nodes as shown in Fig. 1.19.
8 10
s
10 8 10 12
s
re
15 10 24 8 15 10 24 12
yP
Figure 1.19 Winner tree Figure 1.20 Winner tree after first record is output
sit
The tournament is played between the sibling nodes and the resulting winner (smallest key) is
er
placed in the parent node. The root node consists of the overall winner of the tournament which is
the result of tournament played between the winners of the previous level in the tree. Every new
iv
comparison takes place at every level of the tree, starting from the leaf nodes towards the root.
Un
In the Fig. 1.19, the internal nodes 10 and 8 represent the winner of the respective tournaments
played between 15, 10, and 24, 8 respectively. The root node of the winner tree consists of the
overall winner, i.e., 8. This value pointed by root is stored onto the disk.
d
When a record is output in the winner tree, the next record to be placed in the external node
or
is taken from the run from which the smallest record obtained as the output in the previous pass.
xf
Hence, in this example, the smallest record at root, 8, is taken from Run4. So, the next value
that is input is to be taken from Run4. The next value in Run4 to be input is 12. The value 12 is
O
compared against the other values which are basically the losers in the previous pass. The winner
tree constructed after first record is output is shown in Fig. 1.20. The process repeats until all the
elements from the runs are processed.
Another efficient way of generating runs is by using loser trees in which the number of
comparisons is reduced. The winner tree can be restructured in such a way that instead of placing
pointers to winners as internal nodes, pointers to losers can be placed as internal nodes. Such trees
are called as loser trees. A special node is added on top of the root node to represent the overall
winner of the tournament. The leaf nodes of loser tree consist of the first records of all the runs
as in winner tree.
The loser tree for Example 1.6 is shown in the Fig. 1.21. The internal nodes 15 and 24 represent
the losers of the respective tournaments played between 15, 10 and 24, 8 respectively. The root
node consists of loser of the winner nodes 10 and 8, i.e., 10. The overall winner consists of the
winner of 10 and 8 (i.e., 8).
In the first step, the overall winner is the output and therefore the value 8 is copied onto the
disk. The next value is to be considered from Run4 as the output value 8 is from Run4. The value
to be input is now 12. This is copied into the leaf nodes. But once the overall winner is output, the
records with which the input needs to play these tournaments are readily available in the parent
nodes. As a result, the sibling nodes are not accessed when the tournaments are being played.
The loser tree after the first record is output is shown in Fig. 1.22. The value 12 in leaf node is
compared with its parent 24 and not compared with its sibling 24. The root node consists of loser
of the nodes 10 and 12, i.e., 12. The overall winner consists of the winner of 10 and 12 (i.e., 10).
The process repeats until all the elements from the runs are processed.
Overall winner Overall winner
8 10
s s
10 12
re
yP
15 24 15 24
sit
15 10 24 8 15 10 24 12
er
Figure 1.21 Loser tree Figure 1.22 Loser tree after the first record is output
iv
The time required to generate all the runs for an n run list of size k is O(n log k), as it takes
Un
O(log k) time to adjust the loser tree each time a record is output.
When the run size is equal, the task of merging is very simple and can be implemented using loser
or
tree as discussed in the previous section. But when runs of different sizes are to be merged, the
process of merging is challenging and the main question is how the merging can be optimized.
xf
For merging the runs with different sizes, a merge tree is used.
O
A merge tree is a tree constructed by merging the runs of different sizes. In a merge tree, the
circular node is called as internal node and the square node is called as external node which is
used to represent the initial runs. The possible ways of merging the runs of different sizes can be
demonstrated clearly by considering an example of four runs of length 3, 6, 8, and 14 respectively.
The runs are merged in two different ways using 2–way merge. The two Merge trees generated
are shown in Fig. 1.23(a) and Fig. 1.23(b). In the first merge tree shown in Fig. 1.23 (a), the merging
is started by combining the runs of size 3 and 6 represented as external nodes. The output is a
single run of size 9 represented as an internal node. This run with size 9 is next merged with the
run of size 8 (external node) which results in the run of size 17 represented as an internal node.
This run is finally merged with the run of size 14(external node) to result in a single run of size
31 represented as a root node.
In the second merge tree shown in Fig. 1.23 (b), firstly the merging proceeds by combining
the runs of size 3 and 6 represented as external nodes to result in a run of size 9 represented as an
internal node. The next step is to merge the next external nodes, i.e., runs of size 8 and 14 to result
in a run of size 22 represented as an internal node. Finally, the two internal node are merged, i.e.,
merge the run of size 9 with run of size 22 to obtain a single run of size 31 represented as a root
node.
14
3 6 3 6 8 14
(a) (b)
s s
Now the question is which among the two is optimal, i.e., which type of merge has the minimum
re
merging time. This can be easily obtained by calculating the weighted external path length. The
weighted external path length (WEPL) of a tree T is the total merge time calculated by adding
yP
the product of weight of an external node and the depth of the external node from the root node.
The weight of the external node is the run size. The depth of the external node is the length of the
sit
path from the root node to the external node. The WEPL can be calculated using the following
equation.
er
= 57
For the Figure 1.23 (b),
d
WEPL(T) = 3*2+6*2+8*2+14*2
or
= 62
The cost of a k–way merge of n runs of length qi where 1≤ i ≤ n is reduced by utilizing a merge
xf
tree of degree k that has the minimum WEPL. Applying the same concept to Fig. 1.23(a) and
Fig. 1.23(b), we observe that Fig. 1.23(a) has minimum WEPL value of 57. Therefore, the merging
O
Huffman worked on this problem and came out with a solution by constructing a Binary tree
with minimum weighted external path length called the Huffman tree. The Huffman algorithm
is applied for constructing the Huffman tree. The two main activities done using the Huffman
algorithm are constructing a Huffman tree and finding the Huffman codes by traversing the nodes
from root to leaf nodes in the Huffman tree. In the Huffman tree, the leaf nodes represent the
characters and the internal nodes represent the intermediary values. The process of constructing
a Huffman tree and finding the Huffman codes is demonstrated Example 1.7.
Note HUFFMAN TREE Huffman coding is an entropy encoding algorithm developed by David A. Huffman
that is widely used as a lossless data compression technique. The Huffman coding algorithm uses a variable
length code table to encode a source character where the variable-length code table is derived on the basis of
the estimated probability of occurrence of the source character.
The key idea behind Huffman algorithm is that it encodes the most common characters using shorter strings of
bits than those used for less common source characters.
The algorithm works by creating a binary tree of nodes that are
s
stored in an array. A node can be either a leaf node or an internal
s
node. Initially, all the nodes in the tree are at the leaf level and store
re
the source character and its frequency of occurrence (also known
as weight).
yP
While the internal node is used to store the weight and contains links
to its child nodes, the external node contains the actual character.
Conventionally, a '0' represents following the left child and a '1'
sit
represents following the right child. A finished tree that has n leaf
nodes will have n – 1 internal nodes.
er
coding, let us first learn how to calculate the length of the paths in the tree. The external path length of a binary
Un
tree is defined as the sum of all path lengths summed over each path from the root to an external node. The
internal path length is also defined in the same manner. The internal path length of a binary tree is defined as
the sum of all path lengths summed over each path from the root to an internal node. Look at the binary tree
d
Thus, LI + 2n = LE, where n is the number of internal nodes. Now if the tree has n external nodes and each external
O
node is assigned a weight, then the weighted path length P is defined as the sum of the weighted path lengths.
Therefore, P = W1L1 + W2L2 + …. + WnLn
where, Wi and Li are the weight and path length of an external node Ni.
Example 1.7 Consider six characters q1 to q6 with the following values for each node.
q1 = 4, q2 = 6, q3 = 8, q4 = 9, q5 = 15, q6 = 28. Construct a Huffman tree and find the code for each
character.
Solution Step 1: Create six leaf nodes for the six characters as shown below.
4 6 8 9 15 28
Step 2: Select two minimum nodes from the given character list. In this example q1 and q2 values
are minimum. Construct a Huffman tree as shown in Fig. 1.26 with q1 as left child and q2 as right
child. Calculate the internal node value as sum of the left and right child, i.e., 10. Replace q1
and q2 in the list with this value 10. The list now contains 10 (internal node value) in Fig. 1.26,
q3 = 8, q4 = 9, q5 = 15, and q6 = 28.
© Oxford University Press. All rights reserved.
20 Advanced Data Structures
10 17
4 6 8 9
Figure 1.26 Binary tree with nodes q1 and q2 Figure 1.27 Binary tree with nodes q3 and q4
Step 3: Consider the next two minimum nodes from the list. The values of q3 and q4 are minimum.
Construct a tree with q3 as left child and q4 as right child. Create an internal node with value as
the sum of the values of left and right child, i.e., 17 as shown in Fig. 1.27. Replace q3 and q4 in
the list with this value 17. The list now contains 10, 17, q5 = 15, and q6 = 28.
Step 4: Consider the next two minimum nodes from the list. The values 10 and q5 are minimum.
Construct the tree with the internal node 10 as left child and q5 as right child. Create an internal
node with value as the sum of the values of left and right child, i.e., 25 as shown in Fig. 1.28.
Replace 10 and q5 in the list with this value 25. The list now contains 17, 25, and q6 = 28.
s
Step 5: Consider the next two minimum nodes from the
s
re
25
list. The values 17 and 25 are minimum. Construct the
tree with the internal node 17 as left child and 25 as right
yP
child. Create an internal node with value as the sum of the 10 15
values of left and right child, i.e., 42 as shown in Fig 1.28.
sit
Replace 17 and 25 in the list with this value 42. The list 4 6
now contains, 42 and q6 = 28.
er
Step 6: Consider the next two minimum nodes from the Figure 1.28 Binary tree after q5 is inserted
list. The values 28 and 42 are minimum. Construct the tree
iv
with the internal node 42 as right child and 28 as left child. Create an internal node with value as
Un
the sum of the values of left and right child, i.e., 70 as shown in Fig. 1.30. Replace 28 and 42 in
the list with this value 70. The list now contains 70.
Consider the next two minimum nodes from the list. The list consists of only one value. The
d
process stops and the final binary tree constructed is the Huffman tree shown in Fig. 1.30.
or
70
xf
O
42 28 42
17 25 17 25
8 9 10 15 8 9 10 15
4 6
4 6
Figure 1.29 Binary tree after nodes 17 and 25 are inserted Figure 1.30 Huffman tree
The Huffman tree in Fig. 1.30, constructed using Huffman algorithm has a minimum WEPL.
s
The algorithm to construct Huffman tree is shown in Fig. 1.32.
s
re
Step 1: Use the given characters in the list and create a leaf node for each character.
yP
Step 2: Select two nodes with minimum values in the given list of characters.
Step 3: Construct a binary tree by creating a new internal node with value as the sum
sit
of two minimum values from the list. The minimum value of the list is considered as a
left child and the next minimum value is considered as a right child. Replace the two
nodes selected in the list with the internal node value.
er
Step 4: Repeat Steps 2 and 3 until the list has only one value.
iv
Step 5: EXIT
Un
The time complexity for constructing a Huffman tree is O(n log n), where n is the number of
d
unique characters.
or
xf
Programming Example
1. Write a C++ program to construct a Huffman tree.
O
#include<iostream>
#include<vector>
#include<string>
#include <bits/stdc++.h>
using namespace std;
class Huffman;
//A Tree Node
class node
{
node *Left;
node *Right;
double value;
string character;
string Code;
friend class Huffman;
};
class Huffman
{
public:
node SetHuffmanTree();
void BFS(node * temproot,string s);
void SetHuffmanCode();
vector<node> V;
// Use nodeArray to record all the nodes that may be created in the whole process
node extractMin()
{
double temp = (double) INT_MAX;
vector<node>::iterator i,pos;
//Declaring iterator to access vector elements
for(i=V.begin();i!=V.end();i++)
{
if(temp>(*i).value)
//comparing values
{
pos=i;
temp=(*i).value;
s
}
s
}
re
node tempNode = (*pos);
V.erase(pos);
yP
return tempNode;
}
};
sit
node Huffman::SetHuffmanTree()
{
er
InternalNode->Left = tempNode1;
or
InternalNode->Right = tempNode2;
//create new internal node by Adding values
InternalNode->value = tempNode1->value+tempNode2->value;
xf
V.push_back(*InternalNode);
O
cout<<" "<<root1->character<<"\t\t"<<root1->Code<<endl;
}
else
{
//Appending 0 for left child
root1->Left->Code =str.append("0");
str.erase(str.end()-1);
//Appending 1 for right child
root1->Right->Code = str.append("1");
str.erase(str.end()-1);
BFS(root1->Left,str.append("0"));
str.erase(str.end()-1);
BFS(root1->Right,str.append("1"));
str.erase(str.end()-1);
}
}
void Huffman::SetHuffmanCode()
s
{
s
int size,i;
re
double F;
string tempString="";
yP
cout<<"Enter the number of characters to encode "<<endl;
cin>>size;
for(i=0;i<size;i++)
sit
{
cout<<"Enter character to be encoded and its value "<<endl;
node InternalNode;
er
cin>>tempString;
cin>>F;//Reading values
iv
InternalNode.value=F;
InternalNode.character=tempString;
Un
InternalNode.Left=NULL;
InternalNode.Right=NULL;
V.push_back(InternalNode);
d
}
or
int main()
{
Huffman h;
h.SetHuffmanCode();
return 0;
}
Output
Enter the number of characters to encode!
6
Enter character to be encoded and its value
q1 4
Enter character to be encoded and its value
q2 6
Enter character to be encoded and its value
q3 8
Enter character to be encoded and its value
q4 9
POINTS TO REMEMBER
∑ External sorting is a term for a class of sorting ∑ The input delay in buffer handling can be overcome
s
algorithms that can handle massive amounts of data. with the help of floating buffers.
s
∑ In external sorting, each of the sorted block of ∑ A winner tree is a complete binary tree which helps
re
records is called a run. us to obtain the smallest key from a set of keys
∑ k–way merge algorithms are a specific type of where each node represents the smaller of its two
yP
sequence merge algorithms that specialize in taking children.
in multiple sorted lists and merging them into a ∑ The weighted external path length (WEPL) of a tree
sit
single sorted list. T is the total merge time calculated by adding the
∑ In k–way merge, read, write and merge operations product of weight of an external node and the depth
er
are performed in serial. In order to improve its of the external node from the root node.
efficiency, these operations are carried out in parallel ∑ Using Huffman algorithm, a Huffman tree with
iv
EXERCISES
d
or
1. Write an algorithm for k–way merge sort. 1. Write a program to implement huffman tree.
2. Explain the step by step process of merging runs 2. Write a program to implement loser tree.
O
4. Files from which algorithms are selected for (a) finding the median (b) merging
records are called: (c) selection (d) all of the above
(a) Scan files (b) Spanned files
(c) Non dense files (d) Indexed files True or False
5. In sort merge strategy, small sub files sorted are 1. k–way merge is used for external sorting.
called: 2. The number of merge passes over the runs can be
(a) Runs (b) Folders increased by using a high-order merge than 2–way
(c) Blocks (d) Indexes merge.
6. Finding the location of a given item in a collection 3. In k–way merge with buffer handling, assigning
of items is called: two buffers per run solves the problem of input
(a) Discovering (b) Finding delay.
(c) Searching (d) Mining 4. Loser trees are also called as tournament trees.
7. Which of the following is an external sorting? 5. A winner tree is a tree in which every internal node
(a) Insertion sort (b) Bubble sort of tree in level i is the winner of the tournament
s
(c) Merge sort (d) Tree sort played by the child nodes at level i-1.
s
8. Which of the following is an internal sorting? 6. Sorting is always performed on the elements
re
(a) Tape sort (b) 2–way merge sort stored in primary memory.
(c) Merge sort (d) Tree sort 7. The process of sorting a list stored in a file in
yP
9. Merging k sorted tables into a single sorted table secondary memory is known as internal sorting.
is called: 8. The time required to read or write is considered
sit
(a) k–way merging (b) kth merge to be significant in evaluating the performance of
(c) k+1 merge (d) k-1 merge external sorting.
er
11. What is the time complexity of Huffman Coding? output buffers are used.
or
(a) O(N) (b) O(N(logn)^2) 3. ______ is used for generating runs in which
(c) O(nlogn) (d) O(N^2) number of comparisons are reduced.
xf
12. Which of the following is not a correct statement: 4. In k–way merge with ______, k-linked queues are
(a) internal sorting is used if the number of items
O
used.
to be sorted is very large 5. In the decode tree, ______ represents a left branch
(b) External sorting is used if the number of items and ______ represents a right branch.
to be sorted is very large 6. The 2-way merge sort principle could be extended
(c) External sorting needs auxiliary storage to k ordered lists in which case it is termed as
(d) Internal sorting needs ___________.
13. What is the basic algorithm used for external 7. By using __________ the smallest element can
sorting? be determined in O(log k) time.