[go: up one dir, main page]

0% found this document useful (0 votes)
55 views37 pages

Advance Data Structures

data structures

Uploaded by

learnpoltics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views37 pages

Advance Data Structures

data structures

Uploaded by

learnpoltics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Advanced

Data Structures

s s
re
REEMA THAREJA

yP
Assistant Professor
Department of Computer Science
sit
University of Delhi
er
iv

S. RAMA SREE
Un

Professor
d

Department of Computer Science and Engineering


or

Aditya Engineering College, Andhra Pradesh


xf
O

© Oxford University Press. All rights reserved.


3
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide. Oxford is a registered trade mark of
Oxford University Press in the UK and in certain other countries.

Published in India by
Oxford University Press
Ground Floor, 2/11, Ansari Road, Daryaganj, New Delhi 110002, India

© Oxford University Press 2018

The moral rights of the author/s have been asserted.

First published in 2018

All rights reserved. No part of this publication may be reproduced, stored in


a retrieval system, or transmitted, in any form or by any means, without the

ss
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by licence, or under terms agreed with the appropriate reprographics

re
rights organization. Enquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press, at the

yP
address above.

You must not circulate this work in any other form


sit
and you must impose this same condition on any acquirer.

ISBN-13: 978-0-19-948717-2
er
ISBN-10: 0-19-948717-0
iv

Typeset in Times New Roman


by Archetype, New Delhi 110063
Un

Printed in India by Magic International (P) Ltd., Greater Noida

Cover image: Login / Shutterstock


d

Third-party website addresses mentioned in this book are provided


by Oxford University Press in good faith and for information only.
or

Oxford University Press disclaims any responsibility for the material contained therein.
xf
O

© Oxford University Press. All rights reserved.


Preface
Data structure is an important discipline and is one of the most heavily researched topics in computer
science today. Organizing or structuring of data is crucial for the design and implementation of
efficient algorithms. Be it any area of computer science that requires implementation of efficient
problem-solving techniques, all requires the application of appropriate data structures during
program development.
A data structure is defined as the way of organizing the data and also shows the relationship
among each other. There are different kinds of data structures to suit different kinds of applications,

s
and some are highly specialized for specific tasks. For example, B-trees are particularly well-

s
re
suited for implementation of databases. Similarly, hash tables are widely used for compiler
implementations to look up identifiers. The advanced data structures such as priority queues, AVL

yP
trees, red-black trees, multi-way search trees, and digital search trees provide more capabilities
to organize data. For example, trie is used in Spell Checker, Auto-Complete for commands, etc.
sit
based on prefix searching.
All in all, data structures are used in almost every program. Specific data structures are essential
er
ingredients of many efficient algorithms, and make possible the management of enormous amounts
of data, such as large databases and internet indexing services. Hence, a course on data structures
iv

is offered to help students master the design and applications of various data structures and also
Un

use them in writing effective programs and real-time applications.

About the Book


d

This book is designed to serve as a textbook for advanced course in data structures offered to
or

undergraduate as well as postgraduate students of computer science engineering and information


technology. It can also be used as a reference text and resource by young researchers working on
xf

efficient data storage and related applications for understanding the newly established techniques
O

of a rapidly growing research field.


The book aims to introduce the complex and advanced concepts of data structures and illustrate
their use in problem solving. It provides a comprehensive introduction to the design and analysis
of advanced algorithms and data structures. Every chapter in this book contains several examples
and programs to impart practically sound knowledge of the concept. To further enhance and test
the understanding of the subject, a variety of chapter-end exercises are provided.
The overall objective of this book is to give the reader a sound understanding of data structures
and to prepare them for taking up challenging tasks in the area. Efforts have been made to acquaint
the reader with the techniques and applications in the area.
The book in its present form completely caters to Advanced Data Structures syllabus
requirements of universities like JNTU.

© Oxford University Press. All rights reserved.


Preface v

Key Features of the Book


The following are the important features of the book:
∑ Offers simple explanations of the concepts supported with diagrams, algorithms, and examples
for easy understanding of the concepts.
∑ Provides comprehensive coverage for k-way merge sort, hashing techniques, heaps, AVL
trees, red-black trees, B-trees, B+ trees, Patricia, digital search trees, and multi-way tries.
∑ Includes algorithms along with examples for each operation on the various data structures.
∑ Provides numerous solved examples and programs in C++ to illustrate the application and
implementation of concepts.
∑ Includes plenty of chapter-end exercises such as review questions, multiple-choice questions,
fill in the blanks, and true or false questions (with answers) to help readers revise and practise
the concepts learnt.
∑ Provides two solved model question papers to help students prepare for their university’s

s
examination.

s
re
Organization of the Book
The book is organized into 8 chapters and 1 appendix.

yP
Chapter 1 provides a detailed explanation of external sorting using k-way merging, parallel
operations using buffer handling, run generation, and optimal merging of runs.
sit
Chapter 2 describes static hashing, hash functions including the secure hash function, concepts of
er
overflow handling, and theoretical evaluation of overflow techniques. The chapter also discusses
dynamic hashing with as well as without using directories.
iv

Chapter 3 focuses on binary trees, their traversal schemes, and representation in memory.
The chapter also discusses the variants of simple binary trees, namely, expression trees and
Un

tournament trees.
Chapter 4 discusses the structure, properties, and applications of binary heaps or queues. It also
d

details the underlying concept, structure, property, implementation, and examples of binomial
or

heaps (or queues). As one of the major applications of priority queues, the selection problem and
the event simulation problem are also covered.
xf

Chapter 5 deals with the efficient binary search trees. Some very important data structures like
O

optimal binary search trees, AVL trees, and red-black trees are discussed in detail. It unleashes
the drawbacks of binary search trees and demonstrates various operations performed on these
height-balanced trees.
Chapter 6 introduces the concept of multi-way search trees. It discusses its variants B tree and
B+ tree along with their definition, properties, operations, and examples in sufficient depth.
Chapter 7 introduces the various types of digital search structures. It discusses digital search trees,
tries, Patrica, binary tries, compressed binary tries with skip fields as well as labeled edges, multi-
way tries, and compressed multi-way tries along with a detailed description of their operations
like inserting, deleting, and searching. The chapter ends with an introduction to the applications
of complex tries in internet packet forwarding, IP routing in 1-bit tries, fixed-stride tries, and
variable-stride tries.
Chapter 8 contains a detailed explanation of graphs as a data structure. It discusses the memory
representation, traversal schemes, and applications of graphs in the real world. It also discusses

© Oxford University Press. All rights reserved.


vi Preface

concepts such as deducing minimum spanning tree of a given graph using Kruskal’s and Prim’s
algorithms and finding the shortest path using Dijkstra’s, Warshall’s, and Bellman-Ford’s
algorithms.

The book contains an appendix on answers to chapter-end objective type questions.

Online Resources
To aid teachers and students, the book is accompanied with online resources that are available at
india.oup.com/orcs/9780199487172. The content for the online resources are as follows.

For Faculty
∑ Chapter-wise PPTs ∑ Solutions manual for select chapter-end exercises

For Students
∑ Projects       ∑ Quizzes       ∑ Additional programs

s s
re
Reema Thareja
S. Rama Sree

yP
sit
er
iv
Un
d
or
xf
O

© Oxford University Press. All rights reserved.


Acknowlegements
The writing of this textbook was a mammoth task for which a lot of help was required from many
people. Fortunately, I have had wholehearted support of my family, friends, and fellow members
of the teaching staff at Shyama Prasad Mukherji College, New Delhi.
My special thanks would always go to my parents, Mr Janak Raj Thareja and Mrs Usha
Thareja, and my siblings, Pallav, Kimi, and Rashi, who were a source of abiding inspiration and
divine blessings for me. I am especially thankful to my son, Goransh, who has been very patient
and cooperative in letting me realize my dreams. My sincere thanks goes to my uncle, Mr B.L.

s
Theraja, for his inspiration and guidance in writing this book.

s
re
Last but not the least, I would like to thank the editorial team at Oxford University Press, India
for their help and support.

yP
Reema Thareja
sit
I take this opportunity to acknowledge the people who have been instrumental in the successful
er
completion of this book.
First and foremost, I would like to show my greatest appreciation to my husband, Mr S.N.S.V.S.C.
iv

Ramesh, for his unfailing understanding, motivation, extended sacrifice, and unconditional love
Un

and support while writing this book. I would also appreciate my children, Sri Hasini and Sri
Harshini, for their cooperation. I would like to express my gratitude to my parents, Sri P. Samba
Siva Rao and Smt. P. Swarupa Rani, and my in-laws, Sri S. Viswanadha Sastry and Smt. S. Lakshmi
d

Kameswari, for their blessings and encouragement.


or

I am indebted to Mr N. Sesha Reddy, Chairman, Aditya Group of Educational Institutions, Mr


N. Satish Reddy, Vice Chairman, Aditya Group of Educational Institutions, and Dr M. Sreenivasa
xf

Reddy, Principal, Aditya Engineering College, for extending their guidance and support by giving
O

valuable feedback.
My thanks and appreciations also go to my colleagues of Aditya Engineering College, Mrs A.
Vanathi, Mr M. Raja Babu, Mrs N. N. S. Eswari, and S. Surekha, for helping me a lot in writing
this book and people who have willingly helped me out with their abilities.
I am thankful to the editorial team at Oxford University Press, India for their support.

S. Rama Sree

The authors are thankful to the following reviewers for their valuable suggestions and constructive
feedback that helped in making the book better.
Dr Rajya Lakshmi
JNTU Kakinada University College of Engineering, Narsaraopeta
Prof. P. Chakradhar
Rama Chandra Engg College, Eluru

© Oxford University Press. All rights reserved.


viii Acknowlegements

Prof. P. Srinivasa Rao


JNTU Kakinada University College of Engineering, Narsaraopeta
Dr Pullela S.V.V.S.R. Kumar
Aditya College of Engineering, Surampalem
Dr N. Leelavathy
Pragati Engineering College, Surampalem
Prof. K. Rama Rao
Sri Vasavi Engineering College, Tadepalligudeum
Dr Guru Kesav Das
Eluru Engineering College, Eluru
Dr M.V. Rama Sundari
Sasi Institute of Technology, Tadepalligudem
Prof. R. Ramesh

s s
KKR and KSR Inst of Technology and Sciences, Guntur

re
Prof. Ch. Ravi Teja

yP
St. Marys Womens Engg College, Guntur
Prof. N. Eswar Rao
sit
Mittapalli Engineering College, Guntur
er
iv
Un
d
or
xf
O

© Oxford University Press. All rights reserved.


Detailed Contents ix

Detailed Contents
Preface iv

Acknowlegements vii

1. External Sorting 1
1.1 Introduction 1
1.2 k–way Merge Sort 2

s
1.2.1 2-way Merge Sort 2

s
re
1.2.2 3–way Merge Sort 3
1.2.4 4–way Merge Sort 3

yP
1.2.4 Merging Runs 4
1.2.5 Time Complexity 6
sit
1.3 Buffer Handling for Parallel Operation 7
er
1.4 Run Generation 15
1.5 Optimal Merging of Runs 17
iv
Un

2. Hashing 27
2.1 Introduction to Static Hashing 27
d

2.2 Hash Tables 28


or

2.3 Hash Functions 29


xf

2.4 Different Hash Functions 30


2.4.1 Division Method 30
O

2.4.2 Multiplication Method 31


2.4.3 Mid-Square Method 31
2.4.4 Folding Method 32
2.5 Secure Hash Functions 32
2.6 Collision Resolution (or Overflow Handling) Techniques 33
2.6.1 Open Addressing Techniques 33
2.6.1.1 Linear Probing 33
2.6.1.2 Quadratic Probing 39
2.6.1.3 Double Hashing 42
2.6.1.4 Rehashing 45
2.6.2 Chaining Technique 49
2.6.3 Comparison between Open Hashing and Closed Hashing Techniques used for Resolving
Collisions 53

© Oxford University Press. All rights reserved.


x Detailed Contents

2.7 Dynamic Hashing 53


2.7.1 Motivation for Dynamic Hashing 53
2.7.2 Dynamic Hashing (or Extended Hashing) Using Directories 54
2.7.3 Directory-less Dynamic Hashing 57
2.8 Pros and Cons of Hashing 58
2.9 Applications of Hashing 59

3. Trees 63
3.1 Introduction 63
3.1.1 Basic Terminology 63
3.2 Types of Trees 64
3.2.1 General Trees 64

s
3.2.2 Forests 64

s
re
3.2.3 Binary Trees 65
3.2.4 Binary Search Trees 69

yP
3.2.5 Expression Trees 69
3.2.6 Tournament Trees 70
sit
3.3 Creating a Binary Tree from a General Tree 70
er
3.4 Traversing a Binary Tree 72
3.4.1 Pre-order Traversal 72
iv

3.4.2 In-order Traversal 73


Un

3.4.3 Post-order Traversal 73


3.4.4 Level-order Traversal 74
d

3.4.5 Constructing a Binary Tree from Traversal Results 74


or

3.5 Applications of Trees 75


xf

4. Priority Queues (Heaps) 79


O

4.1 Introduction—Priority Queues 79


4.2 Binary Heaps—Model and Simple Implementation 83
4.2. Binary Heap – Structure Property 84
4.2.2 Binary Heap – Order Property 84
4.3 Basic Heap Operations 85
4.3.1 Inserting a New Element in a Binary Heap 85
4.3.2 Deleting an Element from a Binary Heap 87
4.4 Other Heap Operations 88
4.5 Applications of Priority Queues 92
4.5.1 The Selection Problem 92
4.5.2 Event Simulation 93
4.6 Binomial Heaps (or Queues) 94
4.7 Binomial Heap (or Queue) Structure and Implementation 95

© Oxford University Press. All rights reserved.


Detailed Contents xi

4.8 Binomial Queue Operations 96


4.8.1 Creating a New Binomial Heap 96
4.8.2 Finding the Node with Minimum Key 96
4.8.3 Linking and Uniting Two Binomial Heaps 97
4.8.4 Inserting a New Node 99
4.8.5 Extracting the Node with Minimum Key 100
4.8.6 Decreasing the Value of a Node 101
4.8.7 Deleting a Node 101
4.8.8 Lazy Binomial Queue 103
4.9 Comparison between Binary and Binomial Heaps 103

5. Efficient Binary Search Trees 107

s
5.1 Binary Search Trees 107

s
5.1.1 Operations on Binary Search Trees 109

re
5.1.1.1 Searching for a Node in a Binary Search Tree 109

yP
5.1.1.2 Inserting a New Node in a Binary Search Tree 110
5.1.1.3 Deleting a Node from a Binary Search Tree 111
sit
5.2 Optimal Binary Search Tree (OBST) 116
er
5.3 Self-balancing Binary Search Tree 122
5.4 AVL Trees 122
iv

5.4.1 Operations on AVL Trees 124


Un

5.5 Red-Black Trees 140


5.5.1 Properties and Representation of Red-Black Trees 140
d

5.5.2 Operations on Red-Black Trees 141


or

5.5.2.1 Searching for a Node in a Red-Black Tree 141


5.5.2.2 Inserting a Node in a Red-Black Tree 142
xf

5.4.2.3 Deleting a Node from a Red-Black Tree 148


O

5.5.2.4 Joining and Splitting Red-Black Tree 152


5.5.3 Applications of Red-Black Trees 155

6. Multi-way Search Trees 159


6.1 M-way Search Trees 159
6.1.1 Definition and Properties 159
6.1.2 Searching an M-way Search Tree 160
6.2 B-Trees 161
6.2.1 Definitions 161
6.2.2 Properties of B-trees 161
6.2.3 Number of Elements in a B-tree 161
6.2.4 Searching for an Element in a B-tree 162
6.2.5 Inserting a New Element in a B-trees 163

© Oxford University Press. All rights reserved.


xii Detailed Contents

6.2.6 Deleting an Element from a B-tree 164


6.2.7 Applications of B-trees 172
6.3 B+ Trees 172
6.3.1 Searching a B+ Tree 173
6.3.2 Inserting a New Element in a B+ Tree 173
6.3.3 Deleting an Element from a B+ Tree 174
6.4 2-3 Trees 175
6.4.1 Searching for an Element in a 2-3 Tree 176
6.4.2 Inserting a New Element in a 2-3 Tree 177
6.4.3 Deleting an Element from a 2-3 Tree 178

7. Digital Search Structures 183

s
7.1 Introduction to Digital Search Tree 183

s
7.1.1 Operations on Digital Search Trees 184

re
7.1.1.1 Insertion 184

yP
7.1.1.2 Searching 186
7.1.1.3 Deletion 188
sit
7.2 Binary Tries and Patricia 191
er
7.2.1 Binary Tries 191
7.2.2 Compressed Binary Trie 192
iv

7.2.3 Patricia 192


Un

7.2.3.1 Searching Patricia 193


7.2.3.2 Inserting into Patricia 195
d

7.2.3.3 Delete a node from Patricia 197


or

7.3 Muti-way Tries 200


7.3.1 Definition 200
xf

7.3.2 Searching a Trie 203


O

7.3.3 Sampling Strategies 203


7.3.4 Inserting into a Trie 205
7.3.5 Deletion from a Trie 206
7.3.6 Keys with Different Length 206
7.3.7 Height of a Trie 207
7.3.8 Space Required and Alternative Node Structures 207
7.3.9 Prefix Search and Applications 209
7.3.10 Compressed Tries 211
7.3.11 Compressed Tries with Digit Numbers 211
7.3.11.2 Inserting into a Compressed Trie with Digit Numbers 212
7.3.11.3 Deletion of Element from Compressed tries with Digit Numbers 213
7.3.12 Compressed Tries with Skip Fields 214
7.3.12.1 Searching a Compressed Tries with Skip 215

© Oxford University Press. All rights reserved.


Detailed Contents xiii

7.3.12.2 Inserting into a Compressed Trie with Skip Fields 215


7.3.12.3 Deleting an Element from a Compressed Trie with Skip Fields 217
7.3.13 Compressed Tries with Labeled Edges 217
7.3.13.1 Searching a Compressed Trie with Labeled Edges 218
7.3.13.2 Inserting into a Compressed Trie with Labeled Edges 218
7.3.13.3 Deleting an Element from a Compressed Trie with Labeled Edges 219
7.4 Tries and Internet Packet (IP) Forwarding 220
7.4.1 IP Routing 220
7.4.2 1-bit Tries 221
7.4.3 Fixed–stride Tries 221
7.4.4 Variable–stride Tries 222

s
8. Graphs 227

s
re
8.1 Introduction 227
8.2 Graph Terminology 228

yP
8.3 Directed Graphs 229
8.3.1 Terminology of a Directed Graph 229
sit
8.3.2 Transitive Closure of a Directed Graph 230
8.4 Bi-connected Components 232
er

8.5 Representation of Graphs 232


iv

8.5.1 Adjacency Matrix Representation 232


Un

8.5.2 Adjacency List Representation 235


8.6 Graph Traversal Algorithms 237
d

8.6.1 Breadth-First Search 238


or

8.6.2 Depth-First Search 241


8.7 Topological Sorting 244
xf

8.8 Minimum Spanning Trees 250


O

8.8.1 Kruskal’s Algorithm 252


8.8.2 Prim’s Algorithm 256
8.9 Shortest Path Algorithms 260
8.9.1 Dijkstra’s Algorithm 260
8.9.2 Bellman-Ford Algorithm 263
8.9.3 Warshall’s Algorithm 267
8.9.4 Floyd-Warshall Algorithm 268
8.10 Applications of Graphs 271
Appendix: Answers to Objective-type Questions 275
Solved Model Question Paper-I 278
Solved Model Question Paper-II 280
About the Authors 282

© Oxford University Press. All rights reserved.


CHAPTER

1
External Sorting

s s
re
LEARNING OBJECTIVE
yP
sit
Sorting is one of the major operations used in various applications of computer
science. Sorting refers to arrangement of elements in a certain order. It is classified
er

into two types, namely, internal sorting and external sorting. This chapter discusses
the various techniques and algorithms of external sorting.
iv
Un
d
or

1.1 INTRODUCTION
The term sorting means arranging the elements of an array so that they are placed in some relevant
xf

order which may be either ascending or descending. There are two types of sorting: internal sorting
O

and external sorting. Internal sorting is concerned with the ordering of elements present in a file
stored in computer’s memory. Few of the internal sorting techniques widely used are bubble sort,
insertion sort, selection sort, heap sort, quick sort, merge sort, and radix sort. In many applications,
it is required to perform sorting on very large files consisting of trillions of records for their efficient
use in the application concerned. The internal sorting techniques are inefficient for sorting these
huge files as they require the entire file to be present in the main memory of the computer, which
is practically impossible. So, the need for external sorting methods arises to sort large files.
External sorting is applied when there is voluminous data to be sorted that cannot fit in the
memory. Because of their large volumes, the files are stored in external storage devices like
magnetic tapes, magnetic disks, etc. The external storage techniques need to consider the type of
medium on which the files are present. If it is assumed that the files are residing on disks, the term
block refers to the smallest unit of data that can be read from or written to a disk at a particular
instance of time. In general, a block contains several records.
The files present on the disk are read into the internal memory in terms of blocks, one block at
a time. The block of records stored in internal memory is sorted making use of any of the existing

© Oxford University Press. All rights reserved.


2 Advanced Data Structures

internal sorting techniques. Each of the sorted block of records is called a run. Now, the file is
viewed as a collection of runs. The runs are written on to the external storage devices as and when
required. The most popular method for sorting files on external storage devices is external merge
sort which is more efficient in gathering and processing the runs stored on external devices.

1.2 k–WAY MERGE SORT


External merge sort is performed in two phases. The first phase involves the run generation and
the second phase involves the merging of runs to form a larger run. This run generation is repeated
and merging is continued till a single run is generated with the sorted file as its outcome. If k runs
are merged at a time, the external merge sort is known as a k–way merge sort. To understand the
k–way merge sort better, first let us first look at the 2–way merge sort with an example.

1.2.1 2-Way Merge Sort


The k–way merge sort where k = 2 is a 2–way merge sort. In 2–way merge sort, 2 runs are merged

s
at a time to generate a single run twice as long. The merging process is repeated until a single

s
run is generated.

re
Technique

yP
For easy understanding, consider that there are 6000 records to be sorted and the internal memory
capacity is 500 records. Let Ri j represent the jth run in the ith pass. The generated runs in the first
sit
pass are R1 1 to R1 12. In the first pass, R1 1 and R1 2 are merged resulting in run R2 1 which consists
of the sorted list of first 1000 records. The next two runs R1 3 and R1 4 are merged resulting in
er
R2 2. Likewise, four other runs will be merged in the second pass resulting in runs R2 1 to R2 6.
Similarly, in the third pass, R2 1 and R2 2 are merged to form R3 1. Likewise, two other runs are
iv

generated resulting in runs R3 1 to R3 3. In the fourth pass, R3 1 and R3 2 are merged to form run R4
Un

1. The last run R3 3 will be taken as it is to R4 2. In fifth pass, R4 1 and R4 2 runs are merged to form
run R5 1, the final sorted file. The run generation using 2–way merge sort is shown in Fig. 1.1.
d

R1 1 R1 2 R1 3 R1 4 R1 12
or

1 500 501 1000 1001 1500 1501 2000 .... 5501 6000
xf
O

R2 1 R2 2 R2 6
1 1000 1001 2000 .... 5001 6000

R3 1 R3 2 R3 3
1 2000 2001 4000 4001 6000

R4 1 R4 2
1 4000 4001 6000

R5 1
1 6000

Figure 1.1 2–way merge sort

© Oxford University Press. All rights reserved.


External Sorting 3

Example 1.1 Consider 6000 records are available on a disk which are to be sorted. In the
internal memory of the computer, only 500 records can be resided. The block size of the disk is
100 records. Sort the file using 2–way merge sort.
Solution The steps involved in sorting the file are as follows:
Step 1 Read five blocks of data, i.e., totally 500 records at a time from the file residing on the
disk to internal memory. Sort these blocks using any internal sorting technique to generate 12
runs (runs = 6000 records/500 records). These runs are written back on to the disk after sorting.
Step 2 Merge 2 runs at a time to generate a new run in the next pass with size twice as long.
Until all runs in the pass are processed.
Step 3 Repeat Step 2 until a single run is generated.
Step 4 EXIT

s
1.2.2 3–Way Merge Sort

s
The k–way merge sort where k = 3 is a 3–way merge sort. In 3–way merge sort, 3 runs are merged

re
at a time to generate a single run thrice as long. The merging process is repeated until a single run

yP
is generated. Considering the same Example 1.1 for 3–way merge sort yields the runs as shown
in Fig. 1.2.
In the first pass, R1 1 to R1 3 are merged resulting in run R2 1, which consists of the sorted list
sit
of first 1500 records. The next three runs R1 4, R1 5, R1 6 are merged resulting in R2 2. Likewise,
four runs will be emerging in the second pass, i.e., R2 1 to R2 4. Similarly, in the third pass, R2 1 to
er

R2 3 are merged to form R3 1. The last run R2 4 will be taken as it is to R3 2. In fourth pass, R3 1 and
iv

R3 2 runs are merged to form run R4 1, the sorted output.


Un

R1 1 R1 2 R1 3 R1 12
1 500 501 1000 1001 1500 .... 5501 6000
d
or

R2 1 R2 4
xf

1 1500 .... 4501 6000


O

R3 1 R3 2
1 4500 4501 6000

R4 1
1 6000

Figure 1.2 3–way merge sort

1.2.3 4–Way Merge Sort


The k–way merge sort where k = 4 is a 4–way merge sort. In 4–Way merge sort, 4 runs are merged
at a time to generate a single run four times as long. The process is repeated until a single run is
generated. Considering again example 1 for 4–way merge sort, yields the runs as shown in Fig. 1.3.

© Oxford University Press. All rights reserved.


4 Advanced Data Structures

In the first pass, if the records are divided into 12 runs, 500 records each, then we have runs
generated as R1 1 to R1 12. In the second pass, R1 1 to R1 4 are merged to form R2 1. In the same way
R2 2 and R2 3 are generated by merging R1 5, R1 6, R1 7, R1 8 and R1 9, R1 10, R1 11, R1 12 respectively. In the
second pass all the runs (R2 1, R2 2 , and R2 3) are merged together to form the final sorted run, i.e., R3 1.

R1 1 R1 2 R1 3 R1 4 R1 12
1 500 501 1000 1001 1500 1501 2000 .... 5501 6000

R2 1 R2 2 R2 3
1 2000 2001 4000 4001 6000

s
R3 1

s
1 6000

re
Figure 1.3 4–way merge sort

yP
From the above we observe that as we increase the number of runs to be merged, there is a
substantial decrease in the number of passes. To summarize, k–way merge combines k runs into
sit
one sorted run where k ≥2. The algorithm for k–way merge sort is shown in Fig. 1.4.
er

k–way_mergesort( File1, M, k)
iv

File1: unsorted file on disk


M: max records that can be stored and sorted in internal memory at a time
Un

k: number of runs to be merged


Step 1: Run Generation
Repeat until all records in File1 are processed into runs
d

Read M records into main memory and sort internally


or

Write this sorted sub-list (run) onto disk


[END OF LOOP]
xf

Step 2: Merging of Runs


Repeat until all runs are processed
O

Merge k runs into one sorted run with size as k times the input run size
Write this single run back onto disk
[END OF LOOP]
Step 3: If more than 1 run is generated in Step 2 then
Repeat Step 2
Else
The file on the external storage is sorted
Step 4: EXIT

Figure 1.4 Algorithm for k–way merge sort

1.2.4 Merging Runs


As the k–way merge sort includes merging of runs in every pass, let us now look at the merging
process in more detail. The simplest merging technique is the 2–way merge, where two sorted runs
are merged into one run. In a 2-way merge, firstly, we need to identify the first smallest record of

© Oxford University Press. All rights reserved.


External Sorting 5

each run and place it in the smallest set. The smallest of the two records in the smallest set is the
smallest record overall. This record is put in the output run and removed from the corresponding
run. The next smallest record in the corresponding run is moved to the smallest set. This process
is repeated until the two initial runs are empty. The k–way merge behaves identically, but at each
step the smallest of k records is selected and removed from its run.

Example 1.2 Consider 3 runs with 4 records each as shown in Fig 1.5. Implement the 3-way
merge sort technique.

Run 1 3 5 12 15

Run 2 2 4 10 17

s
Run 3 1 6 8 18

s
re
Output Run 1 2 3 4 5 6 8 10 12 15 17 18

yP
Figure 1.5 Three runs ready to be merged and the output run resulting from the merge
sit
Solution
er
∑ Consider the smallest record of each run and add it to the smallest set: {3, 2, 1}.
∑ Take the smallest record of the smallest set, 1, add it to the output run and delete it from the
iv

original run. At this point, the output run is {1}. The step-by- step process of merging the
Un

three runs of Fig. 1.5 is as follows:


Step 1: The three records in the smallest set are {3, 2, 1}. Remove the smallest record1, from the
third run and put it in the output run: {1}. Move 6 to the smallest set.
d

Step 2: The three records in smallest set are {3, 2, 6}. Remove 2 from the second run and append
or

it to the output run: {1, 2}. Move 4 to the smallest set.


xf

Step 3: The three records in smallest set are {3, 4, 6}. Remove 3 from the first run and append
it to the output run: {1, 2, 3}.Move 5 to the smallest set.
O

Step 4: The three records in smallest set are {5, 4, 6}. Remove 4 from the second run and append
it to the output run: {1, 2, 3, 4}. Move 10 to the smallest set.
Step 5: The three records in smallest set are {5, 10, 6}. Remove 5 from the first run and append
it to the output run: {1, 2, 3, 4, 5}. Move 12 to the smallest set.
Step 6: The three records in smallest set are {12, 10, 6}. Remove 6 from the third run and append
it to the output run: {1, 2, 3, 4, 5,6}. Move 8 to the smallest set.
Step 7: The three records in smallest set are {12, 10, 8}. Remove 8 from the third run and append
it to the output run: {1, 2, 3, 4, 5, 6, 8}. Move 18 to the smallest set.
Step 8: The three records in smallest set are {12, 10, 18}. Remove 10 from the second run and
append it to the output run: {1, 2, 3, 4, 5, 6, 8, 10}. Move 17 to the smallest set.
Step 9: The three records in smallest set are {12, 17, 18}. Remove 12 from the first run and append
it to the output run: {1, 2, 3, 4, 5, 6, 8, 10, 12}. Move 15 to the smallest set.

© Oxford University Press. All rights reserved.


6 Advanced Data Structures

Step 10: The three records in smallest set are {15, 17, 18}. Remove 15 from the first run and
append it to the output run: {1, 2, 3, 4, 5, 6, 8, 10, 12, 15}.The first run is now empty, the merge
follows as a 2-way merge instead of a 3-way merge.
Step 11: The two top records are {17, 18}. Remove 17 from the second run and append it to the
output run: {1, 2, 3, 4, 5, 6, 8, 10, 12, 15, 17}. Now, the second run is also empty, only the third
run remains non-empty.
Step 12: The records of the last run are {18} and are appended to the output run and the final run
is obtained {1, 2, 3, 4, 5, 6, 8, 10, 12, 15, 17, 18}.
The result of merging the three initial runs is {1, 2, 3, 4, 5, 6, 8, 10, 12, 15, 17, 18}.

1.2.5 Time Complexity


To analyze the time complexity of external sorting, the three factors influencing the read/write
time from a disk must be known. They are as follows:

s
∑ Seek time: It is the time taken to position the read/write heads on to the required cylinder and

s
it depends on the number of cylinders across which the head moves.

re
∑ Latency time: It is the time taken to position the head on to the right sector on the disk.

yP
∑ Transmission Time: It is the time taken to transmit the block of data to/from the disk.
To perform analysis over the complexity of external sort, the following terms are used:
ts = maximum seek time
sit
tl = maximum latency time
er
trw = time to read or write one block of records
tIO = time to input or output one block (ts + tl + trw)
iv

tIS = time taken to internally sort the records in memory


Un

ntm= time taken to merge n records from input buffers to the output buffer
Here, we assume that every time a block of records is read from or written onto the disk, the
maximum seeks and latency times are utilized. However, this may not be true in reality, but this
d

assumption simplifies the analysis. The computing time for various operations performed in
or

Example 1.1, to sort 6000 records stored in disk is given in Fig. 1.6.
xf

Operations Time
O

Step 1: Read 60 blocks of records from the input file, 60tIO, 120 tIO + 12tIS
perform internal sort with time taken as 12tIS. Again write back 60
blocks on to the disk, 60 t IO
Step 2: Merge runs 1 to 12 of 500 records each and every two 120 tIO + 6000tm
runs at a time.
Step 3: Merge two runs of 1000 records each, with a total of 20 120 tIO + 6000tm
blocks.
Step 4: Merge two runs of 2000 records each, with a total of 40 80tIO + 4000tm
blocks.
Step 5: Merge one run of 4000 records with one run of 2000 120tIO + 6000tm
records.
Total Computing Time = 560tIO + 22000tm + 12tIS

Figure 1.6 Computing times for operations performed in Example 1.1

From Fig. 1.6, it is evident that the computing time for the sorting process is mainly dependent
on the number of passes made over the data to get it sorted. The first pass involves the data that
© Oxford University Press. All rights reserved.
External Sorting 7

is divided into runs after the internal sort is carried out on the data with size equal to the capacity
of the internal memory.
The merging of runs requires 3⅔ passes over the entire set of records. In merging, the first
pass is used to merge 12 runs of length 500 records each yielding to 6 runs of length 1000 records
each. The time taken for this merge is 6000tm. The second pass is used to merge 6 runs yielding
3 runs of length 2000 records each. The time taken for this merge is 6000tm. In the third pass,
only 2 runs of the 3 are merged, i.e., merge 2 runs of length 2000 records each yielding to a run
of length 4000 records and another run of length 2000 records. This is considered as 2/3 of the
pass. The time taken for this merge is 4000tm. The fourth pass is used to merge two runs, one run
of length 4000 records and the other with 2000 records yielding to a single run of length 6000
records. The time taken for this merge is 6000tm.
The total number of passes required for sorting is 4⅔ (1 pass for generating runs and 3⅔ passes
for merging the runs). Therefore, the total input and output time is 2 × (4⅔) × 60tIO = 560tIO. The
time required for merging is 3⅔ × 6000tm = 22000tm.

s
Due to the strong correlation of computing time with the number of passes made over the

s
data, the complexity of the k–way merge sort is mainly concerned with the counting of number

re
of passes required.
In general if we start with m runs in a 2–way merge, log2 m + 1 passes are required to sort the

yP
records. The number of passes can be reduced with the use of a higher-order merge, i.e., k–way
merge with k ≥ 2. Therefore, a k–way merge with m runs requires logk m + 1 passes for sorting
sit
the data. However when we use higher-order merge, k–runs of length s1,s2, s3......, sk cannot
be merged in O(Âkisk) time as the next record to be output next is not confined to the smallest key
er

of the two possibilities in a 2–way merge but instead need to consider all the k possibilities, i.e.,
iv

the k smallest keys present in the k–runs.


The only way to merge k–runs is to make k-1 comparisons to determine the next record to output.
Un

The computing time now becomes O((k-1)Âkisk). As logk m passes are made, the total number of
key comparisons is n(k-1) logk m = n(k-1) log2 m / log2 k, where n is the number of records to be
d

sorted. So, (k-1)/log2 k, is the factor by which number of key comparisons increases.
or

As the value of k increases, the input/output time will decrease. But, beyond a certain value of
k, the input/output time will increase despite the decrease in the number of passes being made.
xf

The optimal value for k depends on disk parameters and the amount of internal memory available
for buffers.
O

With higher-order merge, the amount of I/O needed may be reduced with no significant loss in
the internal processing speed but it is not as much as indicated by the reduction to logk m passes.
This is mainly due to the increase in the number of input buffers in k–way merge increases with
the k.
1.3 BUFFER HANDLING FOR PARALLEL OPERATION
As the k–way merge sorting technique is introduced in the previous section, let us now discuss in
more detail the process of merging runs internally using buffers. In k–way merge sorting technique,
we need at least k–input buffers and 1–output buffer for performing the merge operation. The
major steps involved in the k–way merge sort are:
∑ Read: Load the records from runs into input buffers.
∑ Merge: Perform merge operation on input buffers and store the result in output buffer.
∑ Write: Store the records from the output buffer onto the disk.
In the previous section, the read, merge, and write operations are shown to be performed
sequentially. Instead of performing these operations in sequence, we can opt for performing these
operations in parallel order to improve the efficiency, i.e., we can perform read, merge, and write
© Oxford University Press. All rights reserved.
8 Advanced Data Structures

operations simultaneously. But k–input buffers and 1–output buffer strategy is not sufficient to
handle the parallel operations.
While the output buffer is being written on to the disk, it is not possible to perform the merge
operation as there is no place to collect the merged records. So, we need 2 output buffers, one
for write operation and one for merge operation. Similarly, if the content is being loaded into
the input buffer, merging will be delayed. This can be handled by taking 2 input buffers. If one
input buffer is busy with read operation, then the other input buffer is used for merge operation.
Therefore, we require 2k–input buffers and 2–output buffers to handle the operations in parallel,
i.e., double buffering strategy is used. The parallel operations are demonstrated in Example 1.3.

Example 1.3 Using buffer handling for parallel operations, implement 2–way merge for the
runs shown below. Consider the block size as 2 and each buffer can hold 2 records.
Run 1 Run 2

s
2 4 6 8 9 10 3 5 7 16 21 26

s
re
Solution As 2–way merge is to be used, the input buffers used are 4 and the output buffers
used are 2. Let in[1] in[2] represent the input buffers for loading the records from Run1 and

yP
in[3], in[4] represent the input buffers for loading the records from Run2. Let ou[0] and ou[1]
represent the output buffers. Initially all the input and output buffers are empty.
sit
Step 1: Read–Load in[1] with a block from Run1. The configuration of buffers after read
operation is shown in Fig. 1.7.
er
iv

Input buffers for Run1 Input buffers for Run2


2 4
Un

in[1] in[2] in[3] in[4]


d
or

ou[0] ou[1]
xf

Output buffers
O

Figure 1.7 Configuration of buffers after step 1

Step 2: Read–Load in[3] with a block from Run2. The configuration of buffers after read
operation is shown in Fig. 1.8.

Input buffers for Run1 Input buffers for Run2


2 4 3 5
in[1] in[2] in[3] in[4]

ou[0] ou[1]
Output buffers

Figure 1.8 Configuration of buffers after step 2

© Oxford University Press. All rights reserved.


External Sorting 9

Step 3: Read–Load in[2] with a block from Run1.


Merge–Perform merge operation on input buffers in[1] and in[3]. Store the merged record in
ou[0]. The configuration of buffers after read and merge operations is shown in Fig. 1.9.

Input buffers for Run1 Input buffers for Run2


4 6 8 5
in[1] in[2] in[3] in[4]

2 3
ou[0] ou[1]
Output buffers

Figure 1.9 Configuration of buffers after step 3

s
Step 4: Read–Load in[4] with a block from Run2.

s
re
Merge–Perform merge operation on input buffers in[1], in[2] and in[3]. Store the merged record
in ou[1].

yP
Write–Store the content in ou[0] onto the disk. The configuration of buffers after read, merge,
and write operations is shown in Fig. 1.10.
sit
Input buffers for Run1 Input buffers for Run2
er

6 8 7 16
iv

in[1] in[2] in[3] in[4]


Un

4 5
d

ou[0] ou[1]
or

Output buffers
xf

Figure 1.10 Configuration of buffers after step 4


O

Step 5: Read–Load in[1] with a block from Run1.


Merge–Perform merge operation on input buffers in[2] and in[4]. Store the merged record in ou[0]
Write–Store the content in ou[1] onto the disk. The configuration of buffers after read, merge,
and write operations is shown in Fig 1.11.

Input buffers for Run1 Input buffers for Run2


9 10 8 16
in[1] in[2] in[3] in[4]

6 7
ou[0] ou[1]
Output buffers

Figure 1.11 Configuration of buffers after step 5

© Oxford University Press. All rights reserved.


10 Advanced Data Structures

Step 6: Read–Load in[3] with a block from Run2.


Merge–Perform merge operation on input buffers in[1],in[2], and in[4]. Store the merged record
in ou[1]
Write–Store the content of ou[0] onto the disk. The configuration of buffers after read, merge,
and write operations is as shown in Fig. 1.12.

Input buffers for Run1 Input buffers for Run2


10 21 26 16
in[1] in[2] in[3] in[4]

8 9
ou[0] ou[1]
Output buffers

s s
Figure 1.12 Configuration of buffers after step 6

re
Step 7: Merge–Perform merge operation on input buffers in[1],in[3], and in[4]. Store the merged

yP
record in ou[0].
Write–Store the content in ou[1] onto the disk. The configuration of buffers after merge and write
operations is shown in Fig. 1.13.
sit
er
Input buffers for Run1 Input buffers for Run2
21 26
iv

in[1] in[2] in[3] in[4]


Un

10 16
d

ou[0] ou[1]
or

Output buffers
xf

Figure 1.13 Configuration of buffers after step 7


O

After Step 7, merging will be delayed until another record is loaded from Run1 into in[1].
In the above example, even though we used 2 buffers per run, we reached a situation in which
processing has to be held up due to lack of input records from the run. Simply assigning 2 buffers
per run does not solve the problem. Therefore, input buffers have to be assigned to the runs
cleverly to avoid the input delay, i.e., an individual buffer may be assigned to any run depending
upon need. This type of buffer that is assigned to a run when needed is called a floating buffer
and the k–way merge algorithm using floating buffers is called buffering algorithm.
In the buffer assignment strategy of floating buffers, at any time there will be at least one
input buffer for each run. The remaining buffers will be filled on a priority basis. The following
assumptions are made for the k–way merge with floating buffers (buffering algorithm):
1. For performing the operations of reading from a disk and writing onto a disk simultaneously,
we take two disk drives.
2. While read or write operation is being processed, the CPU cannot make reference to the same
block of memory, i.e., the same output buffer cannot be used for filling in and writing out.
If this is possible, only one output buffer is sufficient.
© Oxford University Press. All rights reserved.
External Sorting 11

3. Input and output buffers are of the same size.


4. End of each run has a sentinel record with a very large key, say +∞ and all other records have
key value less than that of sentinel record.
5. The time to merge into an output buffer equals the time to read a block.
6. In case of equal keys, the run with smallest index is chosen for the next read operation.
7. Input buffers are queued in k queues, one queue for each run. Empty buffers are placed on
a linked stack.
The process of implementing k–way merge using floating buffers is demonstrated in detail in
Examples 1.4 and 1.5.

Example 1.4 Implement k–way merge with floating buffers for the runs shown below. Assume
that the last block of each run is loaded with the sentinel record with key +∞. The blocks in
Run1 and Run2 are as follows:

s s
Run 1 2 4 6 8 9 10 +∞

re
yP
Run 2 3 5 7 16 21 26 +∞
sit
Solution As we have considered two runs, k = 2 and there will
be 2 linked queues one for each run, which are initially empty. b1
er

There are four floating buffers (b1, b2, b3, and b4) initially placed
b2
on the stack as in the Fig. 1.14. Let N be the run from which the
iv

next block is to be loaded and s be the buffer size. Here, s = 2. b3


Un

Let LastKey[i] be the key value recently placed into the input b4
buffer from run i where 1 ≤ i ≤ k. Let N = i where LastKey[i] Figure 1.14 Stack of 4 floating
is minimum and N is the run from which the next block is to be buffers
d

loaded. In the next load operation, a floating buffer will be popped


or

from the stack and then it will be loaded with a block from run N. Whenever the floating buffer
is empty, it will be pushed back on to the stack of floating buffers. Let M represent the sentinel
xf

value.
O

The process of 2–way merge with floating buffers is shown in Fig. 1.15 and is explained in a
line-by-line manner below.
Line 1: Take two floating buffers (one for each run) from the stack of floating buffers and assign
a buffer to each run. Load the first block from Run1 into floating buffer b1 and first block from
Run2 into floating buffer b2. LastKey[1] = 4 and LastKey[2] = 5. Here, N = 1, as the LastKey[1] is
minimum. The next block is to be loaded from Run1.
Line 2: The minimum keys (2 and 3) in the input buffers are merged into the output buffer and
the next record in Run1 is loaded into the popped buffer b3 and is added to the linked queue at
Run1. Now N = 2, as the LastKey[2] = 5 is minimum when compared to LastKey[1]. As the output
buffer is full, store its contents on to the disk.
Line 3: The minimum keys (4 and 5) in the input buffers are merged into the output buffer and
the next record in Run2 is loaded into the popped buffer b4 and is added to the linked queue at
Run2. Now N = 1, as the LastKey[1] is minimum. The empty buffer b1 is pushed back on to the
stack. As the output buffer is full, store its contents on to the disk.
Line 4: The minimum keys (6 and 7) in the input buffers are merged into the output buffer and
the next record in Run1 is loaded into the popped buffer b1 and is added to the linked queue at
© Oxford University Press. All rights reserved.
12 Advanced Data Structures

Run1. Now N = 1, as the LastKey[1] is minimum. The empty buffer b2 is pushed back on to the
stack. As the output buffer is full, store its contents on to the disk.
Line 5: The minimum keys (8 and 9) in the input buffers are merged into the output buffer and
the next record M (sentinel record) in Run1 is loaded into the popped buffer b2 and is added to the
linked queue at Run1. Now N = 2, as the LastKey[2] is minimum. The empty buffer b3 is pushed
back on to the stack. As the output buffer is full, store its contents on to the disk.
Line 6: The minimum keys (10 and 16) in the input buffers are merged into the output buffer
and the next record in Run2 is loaded into the popped buffer b3 and is added to the linked queue
at Run2. Now N = 2, as the LastKey[2] is minimum. Finally, the empty buffers b1, b4 are pushed
back on to the stack. As the output buffer is full, store its contents on to the disk.
Line 7: The minimum keys (21 and 26) in the input buffers are merged into the output buffer
and the next record M (sentinel record) in Run2 is loaded into the popped buffer and is added to
the linked queue at Run2. As all the input buffers consists sentinel record there is no load step
further. Finally, the empty buffer is pushed back on to the stack. As the output buffer is full,

s
store its contents on to the disk.

s
re
Line 8: As the sentinel record is placed in the output buffer the entire process terminates.

yP
Line Run 1 Run 2
sit Output Next block from
1 2 4 3 5 no output Run 1

2 4 6 8 5 2 3 Run 2
er
iv

3 6 8 7 16 4 5 Run 1
Un

4 8 9 10 16 6 7 Run 1

5 10 M 16 8 9 Run 2
d
or

6 M 21 26 10 16 Run 2
xf

7 M M 21 26 no next
O

8 M M

Figure 1.15 The status of linked queues for 2 runs

Example 1.5 Implement k–way merge with floating buffers for the runs shown below. Assume
that the last block of each run is loaded with the sentinel record with key +∞. The blocks in
Run1, Run2, and Run3 are as follows:

Run 1 22 27 28 30 31 32 35 +∞

Run 2 25 31 36 38 40 62 72 +∞

Run 3 26 30 33 35 42 45 52 +∞

© Oxford University Press. All rights reserved.

Chapter-1.indd 12 08-03-2019 13:41:22


External Sorting 13

Solution As we have considered three runs, k = 3 and there will be 3 linked queues, one for
each run which are initially empty. There are six floating buffers
(b1, b2, b3, b4, b5, and b6) initially placed on the stack as in the
b1
Fig. 1.16.
b2
  Let N be the run from which the next block is to be loaded and
s be the buffer size. Here s = 2. Let LastKey[i] be the key value b3
recently placed into the input buffer from run i where 1 ≤ i ≤ k. Let b4
N = i where LastKey[i] is minimum and N is the run from which b5
the next block is to be loaded. b6
  In the next load operation a floating buffer will be popped from
Figure 1.16 Stack of 6 floating
the stack and then be loaded with a block from run N. Whenever
buffers
the floating buffer is empty it will be pushed back on to the stack
of floating buffers. Let M represents the sentinel value.
  The process of 3–way merge with floating buffers is shown in Fig. 1.17 and is explained in a

s
line-by-line manner below.

s
Line 1: Take three floating buffers (one for each run) from the stack of floating buffers and assign

re
a buffer to each run. Load the first block from Run1 into floating buffer b1 and first block from
Run2 into floating buffer b2 and first block from Run3 into floating buffer b3. LastKey[1] = 27,

yP
LastKey[2] = 31 and LastKey[3] = 30. Here N = 1, as the LastKey[1] is minimum. The next block is
to be loaded from Run1.
sit
Line 2: The minimum keys (22 and 25) in the input buffers are merged into the output buffer
and the next record in Run1 is loaded into the popped buffer from the stack and is added to
er

the linked queue at Run1. Now N = 1, as the LastKey[1] is minimum. (Here, Run1 and Run3 are
having equal keys 30, but the minimum indexed run is considered as per our assumptions). As
iv

the output buffer is full, store its contents on to the disk.


Un

Line 3: The minimum keys (26 and 27) in the input buffers are merged into the output buffer
and the next record in Run1 is loaded into the popped buffer from the stack and is added to the
linked queue at Run1. Now N = 3, as the LastKey[3] is minimum. The empty buffer in the linked
d

queue at Run1 is pushed back on to the stack. As the output buffer is full, store its contents onto
or

the disk.
xf

Line 4: The minimum keys (28 and 30) in the input buffers are merged into the output buffer
and the next record in Run3 is loaded into the popped buffer from the stack and is added to the
O

linked queue at Run3. Now N = 2, as the LastKey[2] is minimum. The empty buffer in the linked
queue at Run1 is pushed back on to the stack. As the output buffer is full, store its contents on
to the disk.
Line 5: The minimum keys (30 and 31) in the input buffers are merged into the output buffer
and the next record in Run2 is loaded into the popped buffer from the stack and is added to the
linked queue at Run2. Now set N = 1, as the LastKey[1] is minimum. As the output buffer is full,
store its contents on to the disk.
Line 6: The minimum keys (31 and 32) in the input buffers are merged into the output buffer
and the next record in Run1 is loaded into the popped buffer from the stack and is added to the
linked queue at Run1. Now set N = 3, as the LastKey[3] is minimum. The empty buffer in the linked
queue at Run1 is pushed back on to the stack. As the output buffer is full, store its contents on
to the disk.
Line 7: The minimum keys (33 and 35) in the input buffers are merged into the output buffer
and the next record in Run3 is loaded into the popped buffer from the stack and is added to the

© Oxford University Press. All rights reserved.


14 Advanced Data Structures

linked queue at Run3. Now set N = 2, as the LastKey[2] is minimum. As the output buffer is full,
store its contents on to the disk.
Line 8: The minimum keys (35 and 36) in the input buffers are merged into the output buffer
and the next record in Run2 is loaded into the popped buffer from the stack and is added to the
linked queue at Run2. Now set N = 3, as the LastKey[3] is minimum. The empty buffer in the linked
queue at Run3 is pushed back on to the stack. As the output buffer is full, store its contents on
to the disk.
Line 9: The minimum keys (38 and 40) in the input buffers are merged into the output buffer
and the next record in Run3 is loaded into the popped buffer from the stack and is added to the
linked queue at Run3. Now set N = 2, as the LastKey[2] is minimum. The empty buffer in the linked
queue at Run2 is pushed back on to the stack. As the output buffer is full, store its contents on
to the disk.
Line 10: The minimum keys (42 and 45) in the input buffers are merged into the output buffer
and the next record in Run2 is loaded into the popped buffer from the stack and is added to the

s
linked queue at Run2. Now, the minimum key of the last loaded records from each runs is +∞

s
(M) which indicates there is no next load operation. The empty buffer in the linked queue at Run3

re
is pushed back on to the stack. As the output buffer is full, store its contents on to the disk.
Line 11: The minimum keys (52 and 62) in the input buffers are merged. As there is no next

yP
(N value) in the previous step, no load operation is performed and no need to update N. The empty
buffer in the linked queue at Run2 is pushed back on to the stack. As the output buffer is full,
sit
store its contents on to the disk.
Line 12: The minimum keys (72 and M) in the input buffers are merged into the output buffer.
er

As the sentinel recorded is loaded merged the process terminates with this step.
iv
Un

Line Run 1 Run 2 Run 3 Output Next Block From

1 22 27 25 31 26 30 no output Run 1
d

2 27 28 30 31 26 30 22 25 Run 1
or

3 28 30 31 32 31 30 26 27 Run 3
xf

4 31 32 31 30 33 35 28 30 Run 2
O

5 32 31 36 38 33 35 30 31 Run 1

6 35 M 36 38 33 35 30 31 Run 3

7 M 36 38 35 42 45 33 35 Run 2

8 M 38 40 62 42 45 35 36 Run 3

9 M 62 42 45 52 M 38 40 Run 2

10 M 62 72 M 52 M 42 45 no next

11 M 72 M M 52 62 no next

12 M M 72 M

Figure 1.17 Status of linked queues for 3 runs

© Oxford University Press. All rights reserved.


External Sorting 15

  The algorithm depicting the process of k–way merge using floating buffers is presented in
Fig. 1.18.

Let k be the no. of runs, s be the buffer size


Step 1: Set up k linked queues, insert the first block of each run in an input buffer
attached to the corresponding queue. The remaining k input buffers are pushed
into a linked stack of empty input buffers. Let the output buffer be represented
as ou.
Step 2: Let LastKey[i] be the key value recently placed into the input buffer from run i
where 1 ≤ i ≤ k. Let N = i where LastKey[i] is minimum. N is the run from which the
next block is to be loaded. If LastKey[N] ≠ +∞, then insert the next block from
run N into free input buffer.
Step 3: Select s minimum keys from all the input buffers, merge them into the output
buffer ou and initiate the writing of ou onto disk. Merging continues until
the output buffer gets full. If an input buffer becomes empty before the output
buffer gets full, the next buffer on the same queue is considered and the empty

s
buffer is returned to the stack of empty buffers. However, if an input buffer

s
becomes empty and the output buffer gets +∞ merged into it, the empty buffer is

re
left on the queue. If the output buffer gets +∞ merged into it, the contents of
ou are copied onto disk, the merge terminates and the control goes to Step 6.

yP
Step 4: Wait for any ongoing disk input/output to complete.
Step 5: Repeat Steps 2 to 4 until there is no input block to be read.
Step 6: EXIT
sit
Fig 1.18 Algorithm for k–way merge with floating buffers
er
iv

1.4 RUN GENERATION


Un

With the existing internal sorting techniques, it is possible to generate runs whose size is equal
to the number of records that can be held in the internal memory. However, by using the loser’s
tree, it can be performed in a better way. A loser tree can be understood clearly by knowing firstly
d

about winner trees. A winner tree is a complete binary tree which helps us to obtain the smallest
or

key from a set of keys where each node represents the smaller of its two children. The root consists
of the smallest value in the entire tree.
xf

The construction of the winner tree may be compared with the tournament tree. In a tournament
O

tree, each external node represents a player and each internal node represents the winner of the
match played between the players represented by its children nodes. Similarly, each internal node
of the winner tree represents the winner of a tournament and the root node is the overall winner
of the entire tournament. Each internal node consists of smallest value of its children and the
overall winner is the key with the smallest value in the entire tree. The keys or records are present
in the external nodes or leaf nodes only. All the internal nodes consist of only the pointers to the
winners of each tournament played in the respective levels of the tree. Here, in the k–way merge
sort, leaf nodes of the winner tree consists of the first record from the k runs being merged. The
concept of merging using loser trees can be better understood using the following example.

Example 1.6 Consider merging of 4 runs using winner tree and loser tree for the runs shown
below.

© Oxford University Press. All rights reserved.


16 Advanced Data Structures

15 10 24 8
25 18 56 12
30 20 68 16
46 93 76 28
Run 1 Run2 Run 3 Run 4

Solution To construct winner tree, all the first keys from all the 4 runs will be initially loaded
into the external nodes as shown in Fig. 1.19.

8 10

s
10 8 10 12

s
re
15 10 24 8 15 10 24 12

yP

Figure 1.19 Winner tree Figure 1.20 Winner tree after first record is output
sit
The tournament is played between the sibling nodes and the resulting winner (smallest key) is
er

placed in the parent node. The root node consists of the overall winner of the tournament which is
the result of tournament played between the winners of the previous level in the tree. Every new
iv

comparison takes place at every level of the tree, starting from the leaf nodes towards the root.
Un

In the Fig. 1.19, the internal nodes 10 and 8 represent the winner of the respective tournaments
played between 15, 10, and 24, 8 respectively. The root node of the winner tree consists of the
overall winner, i.e., 8. This value pointed by root is stored onto the disk.
d

When a record is output in the winner tree, the next record to be placed in the external node
or

is taken from the run from which the smallest record obtained as the output in the previous pass.
xf

Hence, in this example, the smallest record at root, 8, is taken from Run4. So, the next value
that is input is to be taken from Run4. The next value in Run4 to be input is 12. The value 12 is
O

compared against the other values which are basically the losers in the previous pass. The winner
tree constructed after first record is output is shown in Fig. 1.20. The process repeats until all the
elements from the runs are processed.
Another efficient way of generating runs is by using loser trees in which the number of
comparisons is reduced. The winner tree can be restructured in such a way that instead of placing
pointers to winners as internal nodes, pointers to losers can be placed as internal nodes. Such trees
are called as loser trees. A special node is added on top of the root node to represent the overall
winner of the tournament. The leaf nodes of loser tree consist of the first records of all the runs
as in winner tree.
The loser tree for Example 1.6 is shown in the Fig. 1.21. The internal nodes 15 and 24 represent
the losers of the respective tournaments played between 15, 10 and 24, 8 respectively. The root
node consists of loser of the winner nodes 10 and 8, i.e., 10. The overall winner consists of the
winner of 10 and 8 (i.e., 8).

© Oxford University Press. All rights reserved.

Chapter-1.indd 16 08-03-2019 13:42:07


External Sorting 17

In the first step, the overall winner is the output and therefore the value 8 is copied onto the
disk. The next value is to be considered from Run4 as the output value 8 is from Run4. The value
to be input is now 12. This is copied into the leaf nodes. But once the overall winner is output, the
records with which the input needs to play these tournaments are readily available in the parent
nodes. As a result, the sibling nodes are not accessed when the tournaments are being played.
The loser tree after the first record is output is shown in Fig. 1.22. The value 12 in leaf node is
compared with its parent 24 and not compared with its sibling 24. The root node consists of loser
of the nodes 10 and 12, i.e., 12. The overall winner consists of the winner of 10 and 12 (i.e., 10).
The process repeats until all the elements from the runs are processed.
Overall winner Overall winner

8 10

s s
10 12

re
yP
15 24 15 24
sit
15 10 24 8 15 10 24 12

er

Figure 1.21 Loser tree Figure 1.22 Loser tree after the first record is output
iv

The time required to generate all the runs for an n run list of size k is O(n log k), as it takes
Un

O(log k) time to adjust the loser tree each time a record is output.

1.5 OPTIMAL MERGING OF RUNS


d

When the run size is equal, the task of merging is very simple and can be implemented using loser
or

tree as discussed in the previous section. But when runs of different sizes are to be merged, the
process of merging is challenging and the main question is how the merging can be optimized.
xf

For merging the runs with different sizes, a merge tree is used.
O

A merge tree is a tree constructed by merging the runs of different sizes. In a merge tree, the
circular node is called as internal node and the square node is called as external node which is
used to represent the initial runs. The possible ways of merging the runs of different sizes can be
demonstrated clearly by considering an example of four runs of length 3, 6, 8, and 14 respectively.
The runs are merged in two different ways using 2–way merge. The two Merge trees generated
are shown in Fig. 1.23(a) and Fig. 1.23(b). In the first merge tree shown in Fig. 1.23 (a), the merging
is started by combining the runs of size 3 and 6 represented as external nodes. The output is a
single run of size 9 represented as an internal node. This run with size 9 is next merged with the
run of size 8 (external node) which results in the run of size 17 represented as an internal node.
This run is finally merged with the run of size 14(external node) to result in a single run of size
31 represented as a root node.
In the second merge tree shown in Fig. 1.23 (b), firstly the merging proceeds by combining
the runs of size 3 and 6 represented as external nodes to result in a run of size 9 represented as an
internal node. The next step is to merge the next external nodes, i.e., runs of size 8 and 14 to result
in a run of size 22 represented as an internal node. Finally, the two internal node are merged, i.e.,

© Oxford University Press. All rights reserved.


18 Advanced Data Structures

merge the run of size 9 with run of size 22 to obtain a single run of size 31 represented as a root
node.

14

3 6 3 6 8 14

(a) (b)

Figure 1.23 2–way merge tree

s s
Now the question is which among the two is optimal, i.e., which type of merge has the minimum

re
merging time. This can be easily obtained by calculating the weighted external path length. The
weighted external path length (WEPL) of a tree T is the total merge time calculated by adding

yP
the product of weight of an external node and the depth of the external node from the root node.
The weight of the external node is the run size. The depth of the external node is the length of the
sit
path from the root node to the external node. The WEPL can be calculated using the following
equation.
er

WEPL(T) = S(weight of external node i) * (depth of node i from the root)


For the Figure 1.23 (a),
iv

WEPL(T) = 3 * 3 + 6*3 + 8*2 + 14*1


Un

= 57
For the Figure 1.23 (b),
d

WEPL(T) = 3*2+6*2+8*2+14*2
or

= 62
The cost of a k–way merge of n runs of length qi where 1≤ i ≤ n is reduced by utilizing a merge
xf

tree of degree k that has the minimum WEPL. Applying the same concept to Fig. 1.23(a) and
Fig. 1.23(b), we observe that Fig. 1.23(a) has minimum WEPL value of 57. Therefore, the merging
O

technique followed in Fig. 1.23(a) is considered as optimal merge.


Another way of finding a binary tree with minimum WEPL is by using Huffman code. Consider
the messages, M1, M2........Mn+1, for which we have to derive the optimal set of codes. Each code
is binary string used for transmission of messages. At the receiver end, the code is decoded using
a decode tree, which is a binary tree, where the external nodes represent the messages.
The binary bits in the code of a message determine the branching needed at each level to reach
the correct external node. Let ‘0’ represent a left branch and ‘1’
represent a right branch of the decode tree as shown in Fig. 1.24.
O 1
The codes 00, 01, and 1 represent the messages M1, M2, and M3
respectively. These codes are called Huffman codes. M3
The size of message Mi is directly proportional to the number of O 1
bits in the code and the cost of decoding the code word is same as
WEPL. It is concluded that, the decoding time can be minimized M1 M2
by constructing a decode tree with minimum WEPL.
Figure 1.24 Decode tree

© Oxford University Press. All rights reserved.

Chapter-1.indd 18 08-03-2019 14:18:16


External Sorting 19

Huffman worked on this problem and came out with a solution by constructing a Binary tree
with minimum weighted external path length called the Huffman tree. The Huffman algorithm
is applied for constructing the Huffman tree. The two main activities done using the Huffman
algorithm are constructing a Huffman tree and finding the Huffman codes by traversing the nodes
from root to leaf nodes in the Huffman tree. In the Huffman tree, the leaf nodes represent the
characters and the internal nodes represent the intermediary values. The process of constructing
a Huffman tree and finding the Huffman codes is demonstrated Example 1.7.

Note HUFFMAN TREE Huffman coding is an entropy encoding algorithm developed by David A. Huffman
that is widely used as a lossless data compression technique. The Huffman coding algorithm uses a variable
length code table to encode a source character where the variable-length code table is derived on the basis of
the estimated probability of occurrence of the source character.
The key idea behind Huffman algorithm is that it encodes the most common characters using shorter strings of
bits than those used for less common source characters.
The algorithm works by creating a binary tree of nodes that are

s
stored in an array. A node can be either a leaf node or an internal

s
node. Initially, all the nodes in the tree are at the leaf level and store

re
the source character and its frequency of occurrence (also known
as weight).

yP
While the internal node is used to store the weight and contains links
to its child nodes, the external node contains the actual character.
Conventionally, a '0' represents following the left child and a '1'
sit
represents following the right child. A finished tree that has n leaf
nodes will have n – 1 internal nodes.
er

The running time of the algorithm depends on the length of the


Figure 1.25 Binary tree
paths in the tree. So, before going into further details of Huffman
iv

coding, let us first learn how to calculate the length of the paths in the tree. The external path length of a binary
Un

tree is defined as the sum of all path lengths summed over each path from the root to an external node. The
internal path length is also defined in the same manner. The internal path length of a binary tree is defined as
the sum of all path lengths summed over each path from the root to an internal node. Look at the binary tree
d

given in Fig. 1.25.


The internal path length, LI = 0 + 1 + 2 + 1 + 2 + 3 + 3 = 12
or

The external path length, LE = 2 + 3 + 3 + 2 + 4 + 4 + 4 + 4 = 26


Note that, LI + 2 * n = 12 + 2 * 7 = 12 + 14 = 26 = LE
xf

Thus, LI + 2n = LE, where n is the number of internal nodes. Now if the tree has n external nodes and each external
O

node is assigned a weight, then the weighted path length P is defined as the sum of the weighted path lengths.
Therefore, P = W1L1 + W2L2 + …. + WnLn
where, Wi and Li are the weight and path length of an external node Ni.

Example 1.7 Consider six characters q1 to q6 with the following values for each node.
q1 = 4, q2 = 6, q3 = 8, q4 = 9, q5 = 15, q6 = 28. Construct a Huffman tree and find the code for each
character.
Solution Step 1: Create six leaf nodes for the six characters as shown below.

4 6 8 9 15 28

Step 2: Select two minimum nodes from the given character list. In this example q1 and q2 values
are minimum. Construct a Huffman tree as shown in Fig. 1.26 with q1 as left child and q2 as right
child. Calculate the internal node value as sum of the left and right child, i.e., 10. Replace q1
and q2 in the list with this value 10. The list now contains 10 (internal node value) in Fig. 1.26,
q3 = 8, q4 = 9, q5 = 15, and q6 = 28.
© Oxford University Press. All rights reserved.
20 Advanced Data Structures

10 17

4 6 8 9

Figure 1.26 Binary tree with nodes q1 and q2 Figure 1.27 Binary tree with nodes q3 and q4

Step 3: Consider the next two minimum nodes from the list. The values of q3 and q4 are minimum.
Construct a tree with q3 as left child and q4 as right child. Create an internal node with value as
the sum of the values of left and right child, i.e., 17 as shown in Fig. 1.27. Replace q3 and q4 in
the list with this value 17. The list now contains 10, 17, q5 = 15, and q6 = 28.
Step 4: Consider the next two minimum nodes from the list. The values 10 and q5 are minimum.
Construct the tree with the internal node 10 as left child and q5 as right child. Create an internal
node with value as the sum of the values of left and right child, i.e., 25 as shown in Fig. 1.28.
Replace 10 and q5 in the list with this value 25. The list now contains 17, 25, and q6 = 28.

s
Step 5: Consider the next two minimum nodes from the

s
re
25
list. The values 17 and 25 are minimum. Construct the
tree with the internal node 17 as left child and 25 as right

yP
child. Create an internal node with value as the sum of the 10 15
values of left and right child, i.e., 42 as shown in Fig 1.28.
sit
Replace 17 and 25 in the list with this value 42. The list 4 6
now contains, 42 and q6 = 28.
er

Step 6: Consider the next two minimum nodes from the Figure 1.28 Binary tree after q5 is inserted
list. The values 28 and 42 are minimum. Construct the tree
iv

with the internal node 42 as right child and 28 as left child. Create an internal node with value as
Un

the sum of the values of left and right child, i.e., 70 as shown in Fig. 1.30. Replace 28 and 42 in
the list with this value 70. The list now contains 70.
Consider the next two minimum nodes from the list. The list consists of only one value. The
d

process stops and the final binary tree constructed is the Huffman tree shown in Fig. 1.30.
or

70
xf
O

42 28 42

17 25 17 25

8 9 10 15 8 9 10 15

4 6
4 6

Figure 1.29 Binary tree after nodes 17 and 25 are inserted Figure 1.30 Huffman tree

The Huffman tree in Fig. 1.30, constructed using Huffman algorithm has a minimum WEPL.

© Oxford University Press. All rights reserved.


External Sorting 21

The WEPL = 4*4+6*4+15*3+8*3+9*3+28*1


70
= 16+24+45+24+27+28 0 1
= 164 q6 28 42
Considering ‘0’ represents the left child and ‘1’ 0 1
represents the right child of the Huffman tree
17 25
as shown in Fig. 1.31, the Huffman Code for
each of the character is known by traversing the 0 1 0 1
Huffman tree from the root node to the character q3 8 9 15 q5
10
leaf node. The Huffman codes are q1 = 1100, q4
q2= 1101, q3=100, q4=101, q5=111, q6 = 0. 0 1
q1 4 6 q2

Figure 1.31 Huffman tree with codes

s
The algorithm to construct Huffman tree is shown in Fig. 1.32.

s
re
Step 1: Use the given characters in the list and create a leaf node for each character.

yP
Step 2: Select two nodes with minimum values in the given list of characters.
Step 3: Construct a binary tree by creating a new internal node with value as the sum
sit
of two minimum values from the list. The minimum value of the list is considered as a
left child and the next minimum value is considered as a right child. Replace the two
nodes selected in the list with the internal node value.
er

Step 4: Repeat Steps 2 and 3 until the list has only one value.
iv

Step 5: EXIT
Un

Figure 1.32 Algorithm for constructing Huffman tree

The time complexity for constructing a Huffman tree is O(n log n), where n is the number of
d

unique characters.
or
xf

Programming Example
1. Write a C++ program to construct a Huffman tree.
O

#include<iostream>
#include<vector>
#include<string>
#include <bits/stdc++.h>
using namespace std;
class Huffman;
//A Tree Node
class node
{
node *Left;
node *Right;
double value;
string character;
string Code;
friend class Huffman;
};
class Huffman
{

© Oxford University Press. All rights reserved.


22 Advanced Data Structures

public:
node SetHuffmanTree();
void BFS(node * temproot,string s);
void SetHuffmanCode();
vector<node> V;
// Use nodeArray to record all the nodes that may be created in the whole process
node extractMin()
{
double temp = (double) INT_MAX;
vector<node>::iterator i,pos;
//Declaring iterator to access vector elements
for(i=V.begin();i!=V.end();i++)
{
if(temp>(*i).value)
//comparing values
{
pos=i;
temp=(*i).value;

s
}

s
}

re
node tempNode = (*pos);
V.erase(pos);

yP
return tempNode;
}
};
sit
node Huffman::SetHuffmanTree()
{
er

while(!V.empty())//Checking whether vector is empty or not


{
iv

node *InternalNode = new node;


node *tempNode1 = new node;
Un

node *tempNode2 = new node;


*tempNode1=extractMin(); //Finding first minimum element
*tempNode2=extractMin(); //Finding second minimum element
d

InternalNode->Left = tempNode1;
or

InternalNode->Right = tempNode2;
//create new internal node by Adding values
InternalNode->value = tempNode1->value+tempNode2->value;
xf

V.push_back(*InternalNode);
O

if(V.size() == 1)//only the root node exsits


{
break;
}
}
//Returning root node
return V[0];
}
void Huffman::BFS(node *Ptrroot,string str)
{
node *root1=new node;
root1=Ptrroot;
root1->Code =str;
if(root1==NULL)
{
cout<<"Elements not in vector";
}
else if(root1->Left == NULL && root1->Right == NULL)
{

© Oxford University Press. All rights reserved.


External Sorting 23

cout<<" "<<root1->character<<"\t\t"<<root1->Code<<endl;
}
else
{
//Appending 0 for left child
root1->Left->Code =str.append("0");
str.erase(str.end()-1);
//Appending 1 for right child
root1->Right->Code = str.append("1");
str.erase(str.end()-1);
BFS(root1->Left,str.append("0"));
str.erase(str.end()-1);
BFS(root1->Right,str.append("1"));
str.erase(str.end()-1);
}
}
void Huffman::SetHuffmanCode()

s
{

s
int size,i;

re
double F;
string tempString="";

yP
cout<<"Enter the number of characters to encode "<<endl;
cin>>size;
for(i=0;i<size;i++)
sit
{
cout<<"Enter character to be encoded and its value "<<endl;
node InternalNode;
er

cin>>tempString;
cin>>F;//Reading values
iv

InternalNode.value=F;
InternalNode.character=tempString;
Un

InternalNode.Left=NULL;
InternalNode.Right=NULL;
V.push_back(InternalNode);
d

}
or

node root = SetHuffmanTree();//build Huffman tree


cout<<"Character"<<"\t"<<"Code"<<endl;
xf

BFS(&root,"");//visiting arbitrary nodes


}
O

int main()
{
Huffman h;
h.SetHuffmanCode();
return 0;
}

Output
Enter the number of characters to encode!
6
Enter character to be encoded and its value
q1 4
Enter character to be encoded and its value
q2 6
Enter character to be encoded and its value
q3 8
Enter character to be encoded and its value
q4 9

© Oxford University Press. All rights reserved.


24 Advanced Data Structures

Enter character to be encoded and its value


q5 15
Enter character to be encoded and its value
q6 28
Character Code
q6 0
q3 100
q4 101
q1 1100
q2 1101
q5 111

POINTS TO REMEMBER
∑ External sorting is a term for a class of sorting ∑ The input delay in buffer handling can be overcome

s
algorithms that can handle massive amounts of data. with the help of floating buffers.

s
∑ In external sorting, each of the sorted block of ∑ A winner tree is a complete binary tree which helps

re
records is called a run. us to obtain the smallest key from a set of keys
∑ k–way merge algorithms are a specific type of where each node represents the smaller of its two

yP
sequence merge algorithms that specialize in taking children.
in multiple sorted lists and merging them into a ∑ The weighted external path length (WEPL) of a tree
sit
single sorted list. T is the total merge time calculated by adding the
∑ In k–way merge, read, write and merge operations product of weight of an external node and the depth
er
are performed in serial. In order to improve its of the external node from the root node.
efficiency, these operations are carried out in parallel ∑ Using Huffman algorithm, a Huffman tree with
iv

by using buffer handling. minimum WEPL can be constructed.


Un

EXERCISES
d
or

Review Questions Programming Exercises


xf

1. Write an algorithm for k–way merge sort. 1. Write a program to implement huffman tree.
2. Explain the step by step process of merging runs 2. Write a program to implement loser tree.
O

in k–way merge sort.


3. Write an algorithm for k–way merge sort with Multiple-choice Questions
floating buffers. 1. In external sorting, number of runs that can be
4. Explain the importance of loser trees in run merged in every pass are called:
generation. (a) Degree of merging (b) Degree of passing
5. Explain the steps involved in constructing a (c) Degree of sorting (d) Degree of runs
Huffman tree. 2. Files that can fit in available buffer space in phase
6. Illustrate the importance of buffer handling for of external sorting must be read into:
parallel operations in k–way merge sort. (a) Multilevel indexes (b) Search nodes
7. Elaborate the usage of Huffman tree in the process (c) Main memory (d) Processing unit
of optimal merging of runs in k–way merge sort. 3. Procedure of sorting algorithms for larger records
8. Consider an example of six characters q1 to q6 that does not fit in main memory and are stored
with the following values of each node q1=5, on disk is classified as:
q2 = 9, q3 = 12, q4 = 13, q5 = 16. Construct the (a) Parser sorting (b) External sorting
minimum binary tree using huffman algorithm. (c) Internal sorting (d) Secondary sorting

© Oxford University Press. All rights reserved.


External Sorting 25

4. Files from which algorithms are selected for (a) finding the median (b) merging
records are called: (c) selection (d) all of the above
(a) Scan files (b) Spanned files
(c) Non dense files (d) Indexed files True or False
5. In sort merge strategy, small sub files sorted are 1. k–way merge is used for external sorting.
called: 2. The number of merge passes over the runs can be
(a) Runs (b) Folders increased by using a high-order merge than 2–way
(c) Blocks (d) Indexes merge.
6. Finding the location of a given item in a collection 3. In k–way merge with buffer handling, assigning
of items is called: two buffers per run solves the problem of input
(a) Discovering (b) Finding delay.
(c) Searching (d) Mining 4. Loser trees are also called as tournament trees.
7. Which of the following is an external sorting? 5. A winner tree is a tree in which every internal node
(a) Insertion sort (b) Bubble sort of tree in level i is the winner of the tournament

s
(c) Merge sort (d) Tree sort played by the child nodes at level i-1.

s
8. Which of the following is an internal sorting? 6. Sorting is always performed on the elements

re
(a) Tape sort (b) 2–way merge sort stored in primary memory.
(c) Merge sort (d) Tree sort 7. The process of sorting a list stored in a file in

yP
9. Merging k sorted tables into a single sorted table secondary memory is known as internal sorting.
is called: 8. The time required to read or write is considered
sit
(a) k–way merging (b) kth merge to be significant in evaluating the performance of
(c) k+1 merge (d) k-1 merge external sorting.
er

10. The function used to modify the way of sorting


the keys of records is called: Fill in the Blanks
iv

(a) Indexing function 1. k–way merge on m runs requires ______ passes


Un

(b) Hash function over the data.


(c) Addressing function 2. In k–way merge with buffer handling, for parallel
(d) All of the above operations ______ input buffers and ______
d

11. What is the time complexity of Huffman Coding? output buffers are used.
or

(a) O(N) (b) O(N(logn)^2) 3. ______ is used for generating runs in which
(c) O(nlogn) (d) O(N^2) number of comparisons are reduced.
xf

12. Which of the following is not a correct statement: 4. In k–way merge with ______, k-linked queues are
(a) internal sorting is used if the number of items
O

used.
to be sorted is very large 5. In the decode tree, ______ represents a left branch
(b) External sorting is used if the number of items and ______ represents a right branch.
to be sorted is very large 6. The 2-way merge sort principle could be extended
(c) External sorting needs auxiliary storage to k ordered lists in which case it is termed as
(d) Internal sorting needs ___________.
13. What is the basic algorithm used for external 7. By using __________ the smallest element can
sorting? be determined in O(log k) time.

© Oxford University Press. All rights reserved.

You might also like