DSA - Unit VI File Organization
DSA - Unit VI File Organization
File Management
A file management system is used for file maintenance (or management) operations. It
is a type of software that manages data files in a computer system.
File management, in its simplest form, refers to the process of organizing and
controlling data files stored on the cloud, or a device, such as a computer or
smartphone. Virtual filing systems involve the way files are named, stored, retrieved,
and organized.
File Handling
File Handling is the storing of data in a file using a program. In C programming
language, the programs store results, and other data of the program to a file using file
handling in C. Also, we can extract/fetch data from a file to work with it in the
program.
File handling in C enables us to create, update, read, and delete the files stored on the
local file system through our C program. The following operations can be performed
on a file.
o Creation of the new file
o Opening an existing file
o Reading from the file
o Writing to the file
o Deleting the file
o Moving to a specific location in a file
o Closing a file
What is File?
A file a container in a computer system that stores data, information, settings, or commands,
which are used with a computer program. In graphical user interface (GUI), such as
Microsoft operating systems, represent the files as icons, which associate to the program that
opens the file. For instance, the picture is shown as an icon;
Mode Description
Grade 1/4
Records:
A collection of related fields treated as a single as a single unit is called a record. For
example a employee‟s record includes a set of fields that contains Employer Number,
Employee Name, Grade and designation etc.
Files:
A collection of related records treated as a single unit is called file. File is also known as data
set. Files are stored in disk like hard disk, CD-ROM or DVD-ROM etc. A Student file may
contain the records of hundreds of students. Each student‟s record consists of same fields but
each field contains different data.
Types of file:
1) Text File: A text file is the one in which data is stored in the form of ASCII characters
and is normally used for storing a stream of characters. Text files are organized around lines,
each of which ends with a newline character („\n‟). The source code files are themselves text
files.
2) Binary File
A binary file is the one in which data is stored in the file in the same way as it is stored in the
main memory for processing. It is stored in binary format instead of ASCII characters. It is
normally used for storing numeric information (int, float, double). Normally a binary file can
be created only from within a program and its contents can be read only by a program.
File Organization:
File organization ensures that records are available for processing. It is used to determine an
efficient file organization for each base relation.
File organization defines the way the data record is stored and retrieved from the files. File
organization provides the data record relationship specifying how the data records are
mapped to the disk blocks. We can either store fixed length record in the files or use flexible
file structure to support data records of varying length.
There are multiple ways of file organization such as sequential, direct, indexed sequential and
others. We can select the file organization type based on the use case, information retrieval
performance, ease of information management and others.
Given below are the most commonly used file organizations:
Sequential files – The data records are stored in a pre-defined order. For instance, the data
records are sequenced based on the key values. In this file organization type we need to
sequentially access the record.
Relative files – The data record is stored at a fixed location that can be accessed using a
static key value. As we can access the data record using a fixed value, we can retrieve or
insert the data record randomly in this file organization type.
Direct files – Similar to the relative file that supports non-integer key value.
Indexed sequential files – In this file organization we add an index to provide the random
access to the sequential files.
Indexed files – The data records are indexed to improve the performance of the data
retrieval. Figure 1 provides various file organizations that are most commonly used.
This file organization is useful for immediate access to large amount of information. It is
used in accessing large databases.
It is also called as hashing.
Types of indexing
1. Primary Index
In primary indexing, the data file is sorted or clustered based on the primary key as
shown in the below figure.
An index file (also known as the index table) is created alongside the data file.
The index file contains pairs of primary key values and pointers to the corresponding
data records.
Each entry in the index file corresponds to a block or page in the data file.
Advantages
1. Gives quick access to records, particularly for Small datasets.
2. Effective for range searches since each key value has an entry.
Disadvantages
1. Can be memory-intensive and may require a significant amount of storage space.
2. Insertions and deletions result in a higher maintenance overhead because the index must
be updated more frequently.
Sparse Indexing:
Sparse indexing involves having fewer index entries than records. The index entries point
to blocks of records rather than individual records. While it reduces storage overhead, it
may require additional disk accesses during retrieval.
OR
Sparse index contains an index entry only for some records. In the place of pointing to all
the records in the main table index points records in a specific gap. This indexing helps you
to overcome the issues of dense indexing in DBMS.
No of Index Entry = No of Block
Advantages
1. Uses less storage space than thick indexes, particularly for large datasets.
2. Lessens the effect that insertions and deletions have from index maintenance
operations.
Disadvantages
1. Since there may not be an index entry for every key value, access may involve
additional steps.
2. might not be as effective as dense indexes for range queries.
The index size is larger in dense index. In sparse index, the index size is smaller.
Time to locate data in index table is less. Time to locate data in index table is more.
There is more overhead for insertions and Sparse indexing have less overhead for
deletions in dense index. insertions and deletions.
Records in dense index need not to be In case of sparse index, records need to be
clustered. clustered.
Computing time in RAM (Random access In sparse index, computing time in RAM is
memory) is less with dense index. more.
Data pointers in dense index point to each In sparse index, data pointers point to
record in the data file. fewer records in data file.
2. Secondary Index
Secondary indexing is a database management technique used to create additional
indexes on data stored in a database. The main purpose of secondary indexing is to
improve the performance of queries and to simplify the search for specific records
within a database. A secondary index provides an alternate means of accessing data in a
database, in addition to the primary index. The primary index is typically created when
the database is created and is used as the primary means of accessing data in the
database. Secondary indexes, on the other hand, can be created and dropped at any
time, allowing for greater flexibility in managing the database.
Benefits
Improved Query Performance: Secondary indexes can improve the performance of
queries by reducing the amount of data that needs to be scanned to find the desired
records.
Flexibility: Secondary indexes provide greater flexibility in managing a database, as
they can be created and dropped at any time.
Simplified Search: Secondary indexes simplify the search for specific records within a
database, making it easier to find the desired data.
Reduced Data Storage Overhead: Secondary indexes use a compact data structure
that requires less space to store compared to the original data.
Types of Secondary Indexes
B-tree Index: A B-tree index is a type of index that stores data in a balanced tree
structure. B-tree indexes are commonly used in relational databases and provide
efficient search, insert, and delete operations.
Hash Index: A hash index is a type of index that uses a hash function to map data to a
specific location within the index. Hash indexes are commonly used in non-relational
databases, such as NoSQL databases, and provide fast access to data.
Bitmap Index: A bitmap index is a type of index that uses a bitmap to represent the
data in a database. Each bit in the bitmap represents a specific record in the database,
and the value of the bit indicates whether the record is present or not. Bitmap indexes
are commonly used in data warehousing and business intelligence applications, as they
provide efficient access to large amounts of data.
When to Use Secondary Indexing
Secondary indexing should be used in database management systems when there is a need
to improve the performance of data retrieval operations that search for data based on
specific conditions. Secondary indexing is particularly useful in the following scenarios:
Queries with Complex Search Criteria: Large Data Sets
Frequently Accessed Data:
Sorting and Aggregating Data:
Data Structure:
3. Clustering Index
What is Clustering Indexing?
Clustering indexing is a database indexing technique that is used to physically arrange the
data in a table based on the values of the clustered index key. This means that the rows in
the table are stored on disk in the same order as the clustered index key. With a clustered
index, the database can more efficiently retrieve data because it doesn‟t have to scan the
entire table to find the data it needs. Instead, it can use the clustered index to quickly locate
the data, resulting in faster query execution times and improved overall performance.
Advantages
Improved Query Performance: Clustering indexing results in faster query
performance, as the data is stored in a way that makes it easier to retrieve the desired
information. This is because the index is built based on the clustered data, reducing the
number of disk I/Os required to retrieve the data.
Reduced Disk Space Usage: Clustering indexing reduces the amount of disk space
required to store the index. This is because the index contains only the information
necessary to retrieve the data, rather than storing a copy of the data itself.
Better Handling of Complex Queries: Clustering indexing provides better
performance for complex queries that involve multiple columns. This is because the
data is stored in a way that makes it easier to retrieve the relevant information.
Streams in C++ :-
We give input to the executing program and the execution program gives back the output.
The sequence of bytes given as input to the executing program and the sequence of bytes
that comes as output from the executing program are called stream. In other words, streams
are nothing but the flow of data in a sequence.
The input and output operation between the executing program and the devices like
keyboard and monitor are known as “console I/O operation”. The input and output
operation between the executing program and files are known as “disk I/O operation”.
Classes for File stream operations :-
The I/O system of C++ contains a set of classes which define the file handling methods.
These include ifstream, ofstream and fstream classes. These classes are derived from
fstream and from the corresponding iostream class. These classes, designed to manage the
disk files, are declared in fstream and therefore we must include this file in any program
that uses files.
1. ios:-
ios stands for input output stream.
This class is the base class for other classes in this class hierarchy.
This class contains the necessary facilities that are used by all the other derived
classes for input and output operations.
2. istream:-
istream stands for input stream.
This class is derived from the class „ios‟.
This class handle input stream.
The extraction operator(>>) is overloaded in this class to handle input streams from
files to the program execution.
This class declares input functions such as get(), getline() and read().
3. ostream:-
ostream stands for output stream.
This class is derived from the class „ios‟.
This class handle output stream.
The insertion operator(<<) is overloaded in this class to handle output streams to files
from the program execution.
This class declares output functions such as put() and write().
4. streambuf:-
This class contains a pointer which points to the buffer which is used to manage the
input and output streams.
5. fstreambase:-
This class provides operations common to the file streams. Serves as a base for fstream,
ifstream and ofstream class.
This class contains open() and close() function.
6. ifstream:-
This class provides input operations.
It contains open() function with default input mode.
Inherits the functions get(), getline(), read(), seekg() and tellg() functions from the
istream.
7. ofstream:-
This class provides output operations.
It contains open() function with default output mode.
Inherits the functions put(), write(), seekp() and tellp() functions from the ostream.
8. fstream:-
This class provides support for simultaneous input and output operations.
Inherits all the functions from istream and ostream classes through iostream.
9. filebuf:-
Its purpose is to set the file buffers to read and write.
We can also use file buffer member function to determine the length of the file.
In C++, files are mainly dealt by using three classes fstream, ifstream, ofstream available in
fstream headerfile.
ofstream: Stream class to write on files
ifstream: Stream class to read from files
fstream: Stream class to both read and write from/to files.
Example 1:
/* File Handling with C++ using ifstream & ofstream class object*/
/* To write the Content in File*/
/* Then to read the content of file*/
#include <iostream>
// Driver Code
int main()
{
// Creation of ofstream class object
ofstream fout;
string line;
// fout.open("sample.txt", ios::app)
fout.open("sample.txt");
// Press -1 to exit
if (line == "-1")
break;
Example 2:
/* File Handling with C++ using fstream class object */
/* To write the Content in File */
/* Then to read the content of file*/
#include <iostream>
#include <fstream>
// Driver Code
int main()
{
// Creation of fstream class object
fstream fio;
string line;
// Press -1 to exit
if (line == "-1")
break;
while (fio) {
return 0;
}
Example 3:
Q: write a single file handling program in c++ to reading and writing data on a
file.
#include<iostream>
#include<fstream>
ofstream fout("d:/student.doc");
fout.close();
ifstream fin("d:/student.doc");
fin.close();
cout<<endl<<rno<<"\t"<<name<<"\t"<<fee;
return 0;
}
Example 4:
// Q: WA C++ file handling program to read data from the file called student.doc
#include<iostream>
#include<fstream>
main()
{
int rno,fee;
char name[50];
ifstream fin("d:/student.doc");
fin>>rno>>name>>fee; //read data from the file student
fin.close();
cout<<endl<<rno<<"\t"<<name<<"\t"<<fee;
return 0;
}
ios::out Creates file if it does not exist. All existing data is erased from the file as new data is
written into the file.
ios::app Creates file if it does not exist. New data is written at the end of the file without
disturbing the existing data.
ios::ate Open file for output. Data can be written anywhere in the file, similar to append.
ios::trunc Discard file content, which is the default action for ios::out.
Linked organization
In a linked organization, the elements or components of the system are connected through
relationships, but the order in which they occur may not be fixed. In Linked Organization,
elements might or might not be stored in consecutive memory locations and the order is
determined by the links between elements. This makes it easy to insert and delete elements
without requiring any movement of other elements and it can be extended or reduced
according to requirements.
Linked organizations differ from sequential organizations essentially in that the logical
sequence of recordsis generally different from the physical sequence.
• In sequential i th record is placed at location li, then the i+1strecord is placed at li + c
where c is the length of i th record or some fixed constant.
• In linked organization the next logical record is obtained by following link value from
present record. Linking in order of increasing primary key eases insertion deletion.
• Searching for a particular record is difficult since no index is available, so only sequential
search possible.
• We can facilitate indexes by maintaining indexes corresponding to ranges of employee
numbers eg. 501-700, 701-900. all records with same range will be linked together in a list.
• We can generalize this idea for secondary key level also. We just set up indexes for each
key and allow recordsto be in more than one list. This leads to the multilist structure for file
representation.
1. Data is stored in nodes that are linked Data is stored in a linear sequence
together
2. Each node contains data and a pointer to Each element is stored one after the other
the next node
3. Allows for fast insertions and deletions Allows for fast traversal and access
5. Can be used for implementing data Can be used for implementing data
structures like linked lists and trees structures like arrays and stacks
7. Can be used for dynamic data structures Suitable for static data structures
10. Pointers may be pointing to null or non- All elements are stored in contiguous
existent memory in case of broken links memory
11. Can have multiple pointers in case of Only one pointer is required to traverse
bidirectional links the list
12. Can have a circular linked list Only a linear sequential list is possible
13. Can have multiple head and tail pointers Only one head and tail pointer is required
14. Can have variable length of data in each Fixed length of data in each element
node
15. Can be used for implementing more Can be used for simple data structures like
complex data structures like graphs. queues and stacks.
Coral Rings
This doubly linked multilist structure is used. Each list is circular list with headnode •
„Alink‟field is used to link all records with same key value
• „Blink‟ is used for some records back pointer and for others it is pointer to head
• node. Owing to these back pointers, deletion is easy without going to start.
• Indexes are maintained as per multilists.
Inverted Files
• Inverted files are similar to multilists. Multilists records with the same key value are linked
together with link information being kept in individual record. In case of inverted files the
link information is kept in index itself.
• EG. We assume that every key is dense. Since the index entries are variable length, index
maintenance becomes complex fro multilists. Benefits being Boolean queries require only
one access per record satisfying the query. Queries of type k1=xx and k2=yy can be handled
similarly by intersecting two lists.
• The retrieval works in two steps. In the first step, the indexes are processed to obtain a list
of records satisfying the query and in the second, these records are retrieved using the list.
The no. of disk accesses needed is equal to the no. of records being retrieved + the no. to
process the indexes.
• Inverted files represent one extreme of file organization in which only the index structures
are important. The records themselves can be stored in any way.
• Inverted files may also result in space saving compared with other file structures when
record retrieval doesn‟t require retrieval of key fields. In this case key fields may be deleted
from the records unlike multilist structures
Cellular Partitions
To reduce the file search times, the storage media may be divided into cells. A cell may be an
entire disk pack or it may simply be a cylinder.
• Lists are localized to lie within a cell.
• A multilist organization in which the list for key1=prog list included records on several
different cylinders then we could break the list into several smaller lists where each prog list
included only those records in the same cylinder.
• The index entry for prog will now contain several entries of the type (addr, length) where
addr is a pointer to start of a list of records with key1=prog and length is the no. of records on
the list.
• By doing this all records of the same cell may be accessed without moving the read/write
heads.
Multiway Merging
Merging is a process of combining two or more types of structures into one single
structure, which is an important component of algorithms such as merge sort.
If we have two or generally more arrays, we can merge them and get a single array or
list.
Practically speaking, the size of the merged array is the same as the total size of the
merged arrays.
Two-Way Merge
A two-way merging, also known as binary merging, is generally an algorithm that takes two
sorted lists and merges them into one list in sorted order. It‟s a widely used approach in
merge sort that outputs the minimum item in each step given list in sorted order and builds a
sorted list with all items in any of the input lists in proportion to the total of the lengths of the
input lists.
If we are merging two arrays size and respectively, then the merged array size will be the
sum of and . Furthermore, merging requires maximum comparisons to get the merged array.
The detailed implementation of the two-way merge is shown in the diagrams below.
We compare the first items in two arrays in sorted order and then append the smaller element
to the output array:
After appending the smaller item to our output array, we continue comparing and appending
them to until the input arrays are empty:
At this point, is empty. The other input array must still be empty, so we take the smaller item
in and append it to . We repeat this process until is empty.
k-way merge
The k-way merge pattern involves merging k sorted arrays into a single sorted array. It
exploits the sorted input to achieve merged order efficiently.
-way merge algorithms, also known as multiway merges, are algorithms that take -sorted lists
as input and produce one sorted list as an output with size equal to the sum of sizes of all
input arrays.
The term “-way merge algorithms” refers to merging approaches that accept more than two
sorted lists:
The detailed implementation of the – way merge is shown in the diagram below:
Use k-way merge when:
You need to combine multiple sorted arrays or lists
Finding the kth smallest/largest element in sorted arrays
We first create tournament trees by comparing the list‟s index items and append the winner
with the least value to the output list: