Adbms Pyq
Adbms Pyq
Question 1 (2022)
a) Explain different types of anomalies.
[Frequently Asked]
Anomalies are problems that can occur in poorly designed databases where data is stored redundantly . Redundancy means storing the same piece of information in
multiple places, which can lead to inconsistencies when data is added, updated, or deleted .
1. Update Anomaly
What it is: This occurs when an update to a single piece of data is not applied to all of its repeated instances. If a fact is stored in multiple rows and
you update only some of them, the database becomes inconsistent .
Example: Imagine an employee's name ( Ename ) is stored in three different tables: Employee , Emp_Salary , and Emp_Designation . If the
employee's name changes, you must update it in all three tables. If you miss one, the data will be inconsistent .
2. Insertion Anomaly
What it is: This anomaly happens when you cannot add certain information to the database unless some other, unrelated information is also present .
Example: You cannot add a new employee to the Emp_Salary table until that employee has been added to the main Employee table first. This
prevents you from storing salary information for a newly hired employee who is not yet in the main system .
3. Deletion Anomaly
What it is: This occurs when deleting one piece of data unintentionally causes the loss of other, different data. Deleting data from one table may
require deleting related data from other tables to maintain consistency .
Example: If you delete an employee record from the main Employee table, you must also delete their corresponding records from the Emp_Salary
and Emp_Designation tables. If you don't, those tables will contain information about a non-existent employee, leading to inconsistency .
[Important Definition]
Relational model constraints, also known as integrity constraints, are rules that the database management system (DBMS) enforces to ensure the correctness,
consistency, and completeness of data .
1. Domain Integrity
This ensures that all values in a column are of the correct type and are within a valid range .
NOT NULL: Ensures that a column cannot have a NULL (unknown or missing) value. For example, every student must have a name .
CHECK: Restricts the values that can be entered into a column by a specific condition. For example, CHECK(Age > 18) ensures that the value for age must be
greater than 18 .
DEFAULT: Assigns a default value to a column if no value is specified during an insert operation. For example, DEFAULT 'NA' could be used for a customer's
name if it's not provided .
2. Entity Integrity
This ensures that each row (or entity) in a table is uniquely identifiable .
UNIQUE: Guarantees that all values in a column are unique. No two rows can have the same value in that column. However, it allows for one NULL value .
PRIMARY KEY: A primary key is a column (or set of columns) that uniquely identifies every row in a table. It is similar to a UNIQUE constraint but does not
allow NULL values .
3. Referential Integrity
This preserves the defined relationships between tables when records are entered or deleted .
FOREIGN KEY: This is the main mechanism for enforcing referential integrity. A foreign key is a column in one table that refers to the primary key of another
table . This creates a link between the two tables.
Example: An Emp table might have a Did (Department ID) column that is a foreign key referencing the Did primary key in the Department table. This
ensures that you cannot assign an employee to a department that does not exist .
Dealing with Foreign Key Violations: When a record in a parent table (the one with the primary key) is deleted or updated, the DBMS can take several actions on
the dependent records in the child table:
RESTRICT / NO ACTION: Prevents the deletion or update on the parent table if there are matching records in the child table. This is the default behavior
.
CASCADE: If a record in the parent table is deleted, all corresponding records in the child table are also automatically deleted .
SET NULL: If a record in the parent table is deleted, the foreign key values in the corresponding child table records are set to NULL .
SET DEFAULT: The foreign key values in the child table are set to their specified default value .
c) Explain the concept of polymorphism in detail.
[Important Definition]
Polymorphism is an object-oriented concept that gives an object the ability to take on multiple forms or patterns . It allows the same programming code, operator, or
symbol to have different implementations and work with different data types .
Core Idea: The core idea is "one interface, multiple functions." An action is defined in a general way, and the specific implementation of that action changes
based on the object that performs it.
Example: The textbook uses the example of vehicles. A car, a truck, and a bike all have a "break" function. The action vehicle.break() is the same interface
for all of them. However, the internal mechanism for breaking is different for each vehicle (e.g., disc brakes vs. drum brakes). The break operation is
therefore polymorphic—its behavior changes depending on which type of vehicle performs it .
Question 2
a) Explain basic properties of OOP.
Object-Oriented Programming (OOP) concepts are used in Object-Oriented Database Management Systems (OODBMS) to handle complex data. The basic properties
are:
Every object is given a unique, system-generated Object Identifier (OID) that is permanent and independent of the object's attribute values (its state) .
This is different from a relational primary key, which is based on data values.
2. Encapsulation
This is the bundling of data (attributes) and the operations (methods) that work on that data into a single unit, called an object . It is related to data
hiding , which protects an object's internal state from outside interference .
3. Inheritance
This allows a new type (a subtype or child class) to be created from an existing type (a supertype or parent class). The new type inherits all the
attributes and operations of the existing type and can add its own unique characteristics .
4. Complexity
Objects in an OODBMS can have complex internal structures. An object's state can contain other objects, allowing for nested and multi-level data
models .
5. Persistency
This is the ability of an object to exist in memory even after the program that created it has finished executing. Persistent objects are stored
permanently in the database and can be retrieved later .
6. Relationship
Relationships between two objects are maintained using their unique OIDs. One object can store the OID of another object to create a reference or link
between them .
[Important Definition]
A database schema is the logical structure, design, or blueprint of a database . It defines the database entities (like tables), their attributes (columns), the relationships
between them, and all the constraints that apply to the data . The schema is defined during database design and is not expected to change frequently .
Schema: The definition of the database structure (the empty tables and their columns). It's the constant framework.
Instance: The actual data stored in the database at a specific point in time. An instance changes whenever data is added, updated, or deleted .
This is the lowest level and describes how the data is physically stored on storage devices like disks . It deals with file organization, data structures,
and access paths (like indexes).
This level describes the logical structure of the entire database for all users . It shows the tables, columns, data types, relationships, and constraints
without getting into the physical storage details. This is the level most database administrators and developers work with.
Data abstraction and encapsulation are related but distinct OOP concepts. The textbook states that the idea of encapsulation comes from abstract data types
(ADTs), which is the principle behind data abstraction .
Data Abstraction is a design concept focused on hiding complex implementation details while showing only the essential features of an object. It answers the
question, "What does the object do?". For example, you know a car can accelerate, but you don't need to know the details of the engine's combustion cycle to
do it.
Encapsulation is the implementation mechanism for abstraction and data hiding. It is the process of bundling an object's data (attributes) and the methods
that operate on that data into a single unit . It answers the question, "How is it implemented?". Encapsulation controls access to the data, often by making it
private and only allowing it to be modified through public methods.
Basis of
Data Abstraction Encapsulation
Difference
Main Focus Hiding implementation complexity. Bundling data and methods together.
To simplify the view of a system by exposing To achieve data hiding and protect the internal state of an object
Goal
only essential functionalities. from direct outside access.
Achieved using abstract classes and interfaces Achieved by making data members private and providing public
Implementation
(conceptual). methods to access them (practical).
Question 4
a) Explain architecture of distributed database.
A distributed database is a collection of logically related databases that are physically distributed across multiple sites connected by a computer network . The
architecture of a Distributed DBMS (DDBMS) defines how functionality is split across different processes.
1. Client-Server Systems
Architecture: This model consists of multiple client processes and one or more server processes . Clients are responsible for the user interface and
application logic, while servers are responsible for data management, query processing, and transaction execution . A client sends a query to a server,
which processes it and returns the result .
Limitation: A standard client-server architecture does not allow a single query to transparently access data from multiple servers at once .
Architecture: This is an extension of the client-server model where there is no longer a strict distinction between clients and servers. It is a collection of
database servers, each capable of managing its local data .
Functionality: When a server receives a query that requires data from other servers, it takes on the role of a coordinator. It decomposes the query into
sub-queries, sends them to the appropriate servers for execution, and then combines their results to answer the original query .
3. Middleware Systems
Architecture: This approach uses a special software layer called middleware to manage queries and transactions across multiple, independent, and
often heterogeneous database servers .
Functionality: The middleware layer handles the complex task of distributed query processing, but it generally does not store any data itself . The
individual database servers only need to be capable of handling local requests. This architecture is useful for integrating existing legacy database
systems .
Parallelism in databases is used to improve performance by executing operations simultaneously. This is achieved in two ways:
1. Inter-query Parallelism
Important Definition: This form of parallelism involves executing different, independent queries at the same time on different processors .
Goal: The main goal of inter-query parallelism is to increase the overall throughput of the system, which is the number of transactions or queries that can be
completed in a given period .
Example: If you have 10 separate queries that each take 2 seconds to run sequentially (totaling 20 seconds), a system with 10 processors can run all of them
in parallel, completing all 10 queries in just 2 seconds .
2. Intra-query Parallelism
Important Definition: This form of parallelism involves taking a single large or complex query, breaking it down into smaller sub-tasks , and executing those
tasks simultaneously on multiple processors .
Goal: The main goal of intra-query parallelism is to reduce the response time of a single, long-running query .
How it works: This can be achieved in two ways:
1. Executing different operations within the same query in parallel (e.g., performing two separate joins at the same time).
2. Parallelizing a single operation (e.g., using multiple processors to sort one large table) .
Unit of
Multiple different queries. A single query.
Parallelism
Systems with many small, independent Systems with large, complex queries (like in data
Best For
transactions. warehousing).
[Important Definition]
A database catalog is a system-owned repository that stores metadata—that is, data about the data. This includes table and column definitions, constraints, indexes,
and user permissions.
In a distributed database system, the catalog must also store information about how the data is distributed across the network. Distributed Catalog Management
refers to the strategies for storing, managing, and accessing this distributed metadata.
Data Fragmentation: The catalog must store the fragmentation schema, which describes how each global relation is broken down into fragments (e.g., the
conditions for horizontal fragments or the column lists for vertical fragments) .
Data Replication: It needs to maintain a replication schema, which details how many copies (replicas) of each fragment exist and where they are located .
Data Allocation: The catalog holds the allocation schema, which describes the specific site where each fragment or replica is physically stored .
This information is critical for the DDBMS, especially for the query optimizer, which needs to know the location and characteristics of data fragments to create an
efficient execution plan that minimizes costs like data transfer over the network .
[PYQ]
Question 1 (2023)
a) Explain referential integrity and domain integrity constraint with suitable example.
[Important Definition]
Integrity constraints are rules that ensure the correctness and consistency of data in a database. Domain and referential integrity are two fundamental types.
1. Domain Integrity
Definition: Domain integrity ensures that all values entered into a column are valid. It specifies the set of permissible values for an attribute . This is enforced
using three main constraints:
NOT NULL: This constraint ensures that a column cannot store a NULL value . For example, every student in a Student table must have a name.
CHECK: This constraint validates that a value meets a specific condition before being inserted . For example, in an Employee table, a constraint
CHECK(Age > 18) would ensure no one under 18 can be registered.
DEFAULT: This constraint provides a default value for a column when no value is specified by the user during an insert operation . For instance, a
column Status could have a default value of 'Active' .
2. Referential Integrity
Definition: Referential integrity is a rule that preserves the defined relationships between tables. It ensures that a foreign key value in one table always refers to
a valid, existing primary key in another table .
Example: Consider two tables, Employee and Department .
Department table has a primary key Did (Department ID).
Employee table has a foreign key Did that refers to the Department table's primary key.
This constraint ensures that you cannot add an employee to the Employee table with a Did that does not exist in the Department table. This
prevents having "orphan" records and maintains data consistency .
Object-Oriented Database Management Systems (OODBMS) are built on the principles of Object-Oriented Programming to handle complex data structures. The basic
properties are:
1. Identity (Object Identity): Every object has a unique, system-generated Object Identifier (OID) that is immutable and independent of the object's attribute
values. This allows two objects to have the same state but still be distinct entities .
2. Encapsulation: This is the bundling of an object's data (attributes) and the methods (operations) that act on that data into a single unit. It hides the internal
implementation details from the outside world .
3. Inheritance: This allows a new class (subtype) to be created from an existing class (supertype), inheriting all its attributes and methods. This promotes code
reuse and the creation of class hierarchies .
4. Complexity: Objects can represent complex real-world entities with nested structures. An object's state can be built from other objects, allowing for multi-level
complexity .
5. Persistency: An object is persistent if it continues to exist in the database even after the program that created it has terminated. This is a core feature that
separates a database object from a transient program variable .
The Relational Data Model, first proposed by E. F. Codd, is a way of logically structuring data using tables, also known as relations . It is the foundation of most
modern database systems.
Relation: A relation is a table with columns and rows . Each table in a database has a unique name.
Attribute: An attribute is a named column of a relation. It represents a property of the entity (e.g., Name , Age , Address for a Student entity) .
Tuple: A tuple is a row in a relation. It represents a single record or a single instance of an entity (e.g., all the information for one specific student) .
Domain: A domain is the set of all permissible values for a given attribute. For example, the domain for an Age attribute might be integers between 0 and 120 .
Question 2
a) Explain Database schema with suitable example.
[Important Definition]
A database schema is the logical blueprint or structure of a database . It is defined during the design phase and specifies the tables, the columns in each table, the
data types for each column, the relationships between tables, and the constraints on the data . The schema represents the logical view of the entire database and is
not expected to change frequently .
Schema: The design or structure of the database (e.g., the CREATE TABLE statements).
Instance: The actual data contained in the database at a specific moment in time. The instance changes as data is added, updated, or deleted .
Example of a Database Schema: An example is the HR Schema for a company database, which might include several related tables :
EMPLOYEE Table: With columns like employee_id (Primary Key), first_name , last_name , job_id (Foreign Key), department_id (Foreign Key).
DEPARTMENTS Table: With columns department_id (Primary Key), department_name , location_id (Foreign Key).
JOBS Table: With columns job_id (Primary Key), job_title , min_salary .
LOCATIONS Table: With columns location_id (Primary Key), street_address , city .
This structure, showing the tables, their columns, and the primary/foreign key relationships, represents the database schema.
[Important Definition]
An OID (Object Identifier) is a unique identifier that is assigned by the Object-Oriented Database system to each object when it is created . Unlike a primary key in a
relational database, an OID is not based on the attribute values of the object. Its purpose is to provide a permanent and unique identity for an object throughout its
lifetime .
Uniqueness: An OID is unique system-wide. Once assigned to an object, it can never be assigned to another object .
System Generated: OIDs are generated automatically by the database system; they cannot be manually assigned or changed by the user .
Invariant: The OID of an object cannot be altered during its entire lifetime .
Immutable: Once an OID is used for an object, it cannot be regenerated or reused for another object, even after the original object is deleted .
Invisible: The OID is generally not visible to the end-user .
Independent: The value of an OID is independent of the state (attribute values) of the object. Two objects can have identical attribute values but will always
have different OIDs .
c) Compare Relational and Object Oriented Databases.
Relational Database Management Systems (RDBMS) and Object-Oriented Database Management Systems (OODBMS) represent different approaches to storing and
managing data.
Data is stored in two-dimensional tables consisting Data is stored as objects, which encapsulate both data
Data Model
of rows and columns . (attributes) and behavior (methods) .
Data Best suited for simple, structured data that fits well Designed to handle complex, semi-structured, or unstructured
Structure into tables. data (e.g., multimedia, nested objects) .
Relationships are represented using foreign keys to Relationships are represented by directobject references (OIDs),
Relationships
link rows in different tables . where one object can contain a reference to another object .
Query Uses the standardized Structured Query Language Uses Object Query Language (OQL), which is designed to work
Language (SQL) . with objects and their methods .
Question 3
a) Explain Shared Disk and Shared Memory Parallel Database Architecture.
These are two common architectures for building parallel database systems, where multiple processors work together to improve performance.
Architecture: In this model, multiple CPUs are connected via a high-speed interconnection network to a single, global shared memory and a single set of
shared disks. All processors can access all memory and all disks .
How it Works: Processors communicate with each other by reading and writing to the shared memory, which is very fast. Each processor typically has a local
cache to reduce contention for the shared memory bus .
Advantages:
Very efficient and fast communication between processors .
Simple to implement load balancing.
Disadvantages:
The shared memory bus or network can become a bottleneck as more processors are added .
It is not easily scalable beyond a certain number of processors (e.g., 32 or 64) .
Architecture: In this model, each processor has its own private memory, but all processors can access a single, shared set of disks through an interconnection
network . These systems are often called clusters .
How it Works: Since each processor has its own memory, there is no memory bus bottleneck. Processors coordinate access to the shared data on the disks.
Advantages:
Provides good fault tolerance. If one processor fails, the other processors can still access all the data on the shared disks and take over its tasks .
It is more scalable than a shared memory architecture.
Disadvantages:
The interconnection network to the disks can become a bottleneck as a large amount of data may need to be shipped between the processors'
memories .
[Important Definition]
Parallel Query Optimization is the process of finding the most efficient query execution plan in a parallel database environment . Since data and processing are
distributed across multiple nodes, the optimizer must consider not only traditional costs like disk I/O but also communication costs, data partitioning, and how to best
parallelize each step of the query.
Two-Phase Approach to Optimization: A common strategy for parallel query optimization involves two steps :
1. Join Ordering (Uni-processor Optimization): In the first phase, a traditional query optimizer is used to determine the best sequence of operations (like joins)
without considering parallelism. This produces a sequential query plan .
2. Parallel Resource Allocation: In the second phase, this sequential plan is adapted for a parallel environment. The optimizer decides how to partition the data,
which operations to run in parallel, and how to allocate processors and memory to each task in the plan .
Fragmentation and replication are two key strategies used in distributed database design to manage how data is stored across multiple sites.
1. Fragmentation
Definition: Fragmentation is the process of breaking a single logical database or table into smaller pieces called fragments. These fragments can then be
stored on different computers in the network . The key rule is that the original table must be reconstructible from its fragments .
Types of Fragmentation:
Horizontal Fragmentation: The table is split into groups of rows (tuples). Each fragment contains a subset of the rows based on a specific condition
(e.g., all customers from 'Pune' in one fragment, 'Mumbai' in another) . The original table is reconstructed using UNION .
Vertical Fragmentation: The table is split into groups of columns (attributes). Each fragment contains a subset of columns, but the primary key must be
included in all fragments to link them back together . The original table is reconstructed using JOIN .
2. Replication
Definition: Replication is the process of creating and maintaining multiple copies of data or data fragments on different sites in the network .
Goals of Replication:
Increased Availability: If one site containing the data fails, users can still access the data from another site that has a replica. This makes the system
more fault-tolerant .
Faster Query Evaluation: Queries can be performed on a local copy of the data instead of accessing a remote site, which reduces network delays and
improves response time .
Question 4
a) Explain parallel join operation.
A parallel join is a technique used in parallel databases to speed up the process of joining two large tables (relations). The fundamental strategy is to decompose the
large join into a collection of smaller, independent joins that can be executed simultaneously on different processors .
1. Partitioning: Both relations that need to be joined (say R1 and R2) are partitioned across the available processors using the exact same partitioning function
on their join attributes . This is the most crucial step.
2. Data Distribution: This partitioning ensures that all the tuples from R1 and R2 that could possibly match each other (i.e., have the same join attribute value) are
sent to the same processor .
3. Local Join: Each processor then performs a join operation locally on the subset of data it has received. Since all matching tuples are guaranteed to be on the
same processor, no further communication between processors is needed for the join itself.
4. Result Aggregation: The results from the local joins on all processors are then combined to produce the final result of the overall join.
Types of Parallel Joins: The partitioning method determines the type of parallel join:
Parallel Hash Join: If hash partitioning is used on the join attribute, it results in a parallel hash join algorithm .
Parallel Sort-Merge Join: If range partitioning is used on the join attribute, it results in a parallel sort-merge join algorithm, with the added benefit that the final
output is already sorted .
The Two-Phase Commit (2PC) protocol is a distributed algorithm that ensures all parts of a transaction are either committed or aborted across all participating sites,
thus guaranteeing atomicity.
Roles:
Coordinator: The transaction manager at the site where the transaction originated.
Subordinates (or Cohorts): The transaction managers at the other sites involved in the transaction.
1. The Coordinator sends a PREPARE message to all Subordinates, asking them if they are ready to commit .
2. Upon receiving the PREPARE message, each Subordinate determines if it can successfully commit its part of the transaction.
If it can, it writes a prepare record to its log, becomes "prepared," and sends a YES vote to the Coordinator .
If it cannot (due to an error), it writes an abort record to its log and sends a NO vote to the Coordinator .
Phase 2: The Commit (Decision) Phase 3. The Coordinator collects the votes from all Subordinates. 4. The Coordinator makes a final decision: - If all Subordinates
voted YES : The Coordinator decides to COMMIT. It writes a commit record to its own log and sends a COMMIT message to all Subordinates . - If even one
Subordinate voted NO (or timed out): The Coordinator decides to ABORT. It writes an abort record to its own log and sends an ABORT message to all Subordinates .
5. Each Subordinate receives the final decision from the Coordinator. It writes a commit or abort record to its log as instructed, performs the required action, and
releases any locks. 6. Each Subordinate sends an ACK (acknowledgment) message back to the Coordinator to confirm it has completed the action . 7. Once the
Coordinator receives ACK from all Subordinates, it considers the transaction complete and writes an end record to its log .
[Important Definition]
Distributed transaction processing is the mechanism that manages transactions in a distributed database system to ensure they maintain the database's consistency
and integrity, especially in the face of concurrent operations and system failures .
Key Concepts:
[PYQ]
Question 1(2024)
a) Explain Entity Integrity Constraints & Referential Integrity Constraints with example.
[Important Definition]
Integrity constraints are rules used to maintain the accuracy and consistency of data in a relational database. Entity integrity and referential integrity are two of the
most critical types.
1. Entity Integrity
Definition: This constraint ensures that each row (or entity) in a table is uniquely identifiable. It is enforced primarily by the PRIMARY KEY . A primary key is a
column or a set of columns whose values uniquely identify a single row in the table .
Rules: The entity integrity constraint has two main rules:
1. A primary key's value must be unique for each row.
2. A primary key cannot contain NULL values .
Example: In a STUDENT table with columns (ID, Name, Email) , the ID column would be the primary key. Entity integrity ensures that every student has a
unique ID and that no student record can exist without an ID. An attempt to insert a new student with an existing ID or a NULL ID would be rejected by the
database.
2. Referential Integrity
Definition: This constraint is used to maintain consistency between related tables. It ensures that a relationship between two tables remains valid. It is
enforced using a FOREIGN KEY . A foreign key is a column in one table that points to the primary key of another table .
Rules: Referential integrity ensures that a value in the foreign key column must either match an existing value in the primary key of the referenced table or be
NULL .
Example: Consider an EMPLOYEE table with a Did column (Department ID) and a DEPARTMENT table where Did is the primary key. The Did in the
EMPLOYEE table is a foreign key referencing DEPARTMENT . Referential integrity ensures that you cannot assign an employee to a department that does not
exist (i.e., the Did value you insert into EMPLOYEE must already be present in the DEPARTMENT table) .
b) Write a short note on i) Class Inheritance & its types ii) Polymorphism
Definition: Inheritance is a fundamental concept in object-oriented systems where a new class (the subtype) can be created based on an existing class (the
supertype ) . The subtype automatically inherits all the attributes and functions (methods) of its supertype, and can also have its own additional, specific
functions . This concept is similar to specialization in the ER model and promotes reusability .
Type Hierarchy: Inheritance leads to the creation of a type hierarchy (or class hierarchy), which shows the supertype/subtype relationships among all the
classes in the system .
Example: A PERSON class can be defined as a supertype with general attributes like Name , Address , and Birthdate . From this, two subtypes can be
created:
STUDENT subtype-of PERSON : Inherits all attributes from PERSON and adds its own specific attributes like Standard , Fees , Roll_No .
TEACHER subtype-of PERSON : Inherits all attributes from PERSON and adds Salary , Major , Grade .
Types of Inheritance: The provided textbook explains the general concept of creating subtypes from supertypes to form a hierarchy but does not detail
different types of inheritance (like single, multiple, or multi-level).
ii) Polymorphism
Definition: Polymorphism is an object-oriented property that allows an object to take on multiple forms . It means that the same operation or function name
can have different implementations depending on the object it is called on .
Core Idea: The main idea is "one name, multiple forms." This allows for writing more generic and flexible code.
Example: A Vehicle class could have a method called move() . Subtypes like Car , Boat , and Airplane would all inherit this method but implement it
differently. Car.move() would implement driving, Boat.move() would implement sailing, and Airplane.move() would implement flying. The action is
polymorphic because the same method name ( move() ) results in different behaviors. The textbook uses a similar example of a break operation on different
vehicles .
Question 2
a) Explain insert, update & Delete anomalies with suitable example.
[Frequently Asked]
Anomalies are data inconsistencies that occur in a database, typically due to data redundancy in a poorly designed (non-normalized) schema . Redundant data leads
to problems when modifying the database.
1. Insertion Anomaly
Definition: This occurs when you are unable to insert a new piece of information into the database because another, unrelated piece of information is
missing .
Example: Consider a single table that stores both employee details and the project they are assigned to: (Emp_Id, Ename, Address, Project_Id,
Project_Name) . If a new employee is hired but not yet assigned to a project, you cannot add their details to the table because the Project_Id would
be null, and if the project ID is part of the primary key, this is not allowed. You are blocked from inserting a valid fact (a new employee) because another
fact (a project assignment) is not yet available.
2. Update Anomaly
Definition: This occurs due to data redundancy. If a piece of information is stored in multiple rows, and an update is made to only some of them, the
database becomes inconsistent .
Example: Using the same employee-project table, if an employee like 'Sachin' works on two different projects, his Address ('Malad') will be listed in
two separate rows . If 'Sachin' moves to a new address, you must update this information in both rows. If you only update one, the database will have
conflicting information about where 'Sachin' lives, leading to an inconsistent state.
3. Deletion Anomaly
Definition: This occurs when the deletion of one piece of information unintentionally leads to the loss of another, different piece of information that
you wanted to keep .
Example: In the employee-project table, if an employee works on only one project, and that project is completed and its record is deleted, you might
lose all information about that employee. For instance, if 'Suhas' only works on project 'FID', deleting the row for project 'FID' would also remove
'Suhas's Ename and Address from the database entirely, even though 'Suhas' is still an employee of the company.
The Object-Oriented paradigm was developed to better handle the complexity of modern applications. In the context of databases, Object-Oriented Database Systems
(OODBS) were proposed as an alternative to relational systems, particularly for domains where complex objects play a major role .
Handling Complex Data: Traditional relational databases struggle to efficiently store and manage complex data types like multimedia files, geographic
information systems (GIS) data, and computer-aided design (CAD) models. OODBS are designed to handle these complex, nested, and non-structured data
types naturally as objects.
Overcoming Impedance Mismatch: In application development, programmers work with objects. When using a relational database, these objects must be
translated into a tabular format (rows and columns), a process which is often complex and inefficient. This is called the "object-relational impedance
mismatch." OODBS eliminate this by allowing objects to be stored directly, creating a more seamless integration between the application and the database.
1. Identity (Object Identity): Each object has a unique Object Identifier (OID) that is independent of its values .
2. Encapsulation: Data (attributes) and operations (methods) are bundled together into a single object, hiding the internal details .
3. Inheritance: New classes (subtypes) can be created from existing classes (supertypes), inheriting their properties and promoting code reuse .
4. Complexity: Objects can have complex internal structures, containing other objects or collections of objects .
5. Persistency: Objects can be stored permanently in the database and exist even after the program that created them terminates .
6. Polymorphism: The ability for an operation to be implemented in different ways depending on the object it is acting upon .
Question 3
a) Explain in details architectures for parallel database.
[Frequently Asked]
A parallel database system uses multiple processors and disks in parallel to improve performance. There are three main architectures:
Architecture: Multiple CPUs are connected to a single, global shared memory and a shared set of disks through a common communication bus or
interconnection network . All processors have equal access to all memory and disk space.
Functionality: Processors communicate by writing to and reading from the shared memory, which is extremely fast. To reduce contention, each processor has
a private cache .
Advantages:
Efficient and fast inter-processor communication .
Simplified load balancing.
Disadvantages:
The shared memory bus becomes a bottleneck as the number of CPUs increases, limiting scalability .
Not scalable beyond 32 or 64 processors.
Architecture: In this model, each processor has its own private (local) memory, but all processors can access a common set of shared disks through an
interconnection network . These systems are often called clusters.
Functionality: Since memory is not shared, the memory bus is not a bottleneck. Processors communicate by passing messages over the network, and all have
access to the entire database on the shared disks.
Advantages:
High degree of fault tolerance. If one processor fails, others can take over its tasks since the database is accessible to all .
Disadvantages:
The network connecting the processors to the shared disks can become a bottleneck as data is shipped between processors .
Architecture: Each processor (or node) is part of a self-contained unit with its own local memory and its own local disk(s) . The nodes communicate with each
other exclusively by sending messages over a high-speed network. Data is partitioned across all the nodes.
Functionality: Queries are processed locally on a node as much as possible. If data from another node is needed, a message is sent to that node to request it.
Advantages:
Highly scalable. Thousands of processors can be added, as there are no shared resources to create a bottleneck .
Eliminates I/O and memory contention found in other architectures.
Disadvantages:
The cost of accessing data on a non-local node is high, as it requires network communication .
System performance is highly dependent on how well the data is partitioned across the nodes.
A distributed transaction management system is responsible for ensuring the properties of transactions (like atomicity and consistency) are maintained in a
distributed database system, where a single transaction can span multiple physical sites.
Key Concepts:
A distributed transaction is a transaction that accesses data on more than one site. It is treated as a single logical unit but is composed of several sub-
transactions , each executing on a different site .
The main challenge is ensuring atomicity, meaning the transaction must either commit at all participating sites or abort at all sites. A partial commit would
leave the database in an inconsistent state.
The Two-Phase Commit (2PC) Protocol: This is the standard protocol used to manage distributed transactions and ensure atomicity.
Roles:
Coordinator: The transaction manager at the site where the transaction originated .
Subordinates: The transaction managers at all other sites involved .
Step-by-Step Process:
Phase 2: The Decision Phase 4. The Coordinator collects votes from all Subordinates. 5. It makes a global decision: - If ALL Subordinates voted YES , the
Coordinator decides to COMMIT. - If ANY Subordinate voted NO (or failed to respond), the Coordinator decides to ABORT . 6. The Coordinator force-writes its
final decision ( commit or abort ) to its own log. 7. The Coordinator sends the final decision ( COMMIT or ABORT message) to all Subordinates . 8. Each
Subordinate receives the final decision, writes it to its log, and then either commits or aborts its sub-transaction accordingly. 9. Each Subordinate sends an
ACK (acknowledgment) message to the Coordinator. The transaction is complete once the Coordinator receives all acknowledgments .
Question 4
a) Differentiate between Interquery Parallelism & Intraquery Parallelism.
Inter-query and Intra-query parallelism are two distinct strategies for improving performance in a parallel database system. The key difference lies in what is being
parallelized: multiple queries or a single query.
Basis of
Inter-query Parallelism Intra-query Parallelism
Difference
Multiple, independent queries are executed A single, large query is broken down into smaller sub-tasks, which are
Definition
simultaneously on different processors . then executed simultaneously on different processors .
Primary To increase system throughput (the number of To reduce the response time (the execution time) of a single, complex
Goal queries completed per unit of time) . query .
Unit of
The unit of parallelism is an entire query. The unit of parallelism is a sub-task or an operation within a query.
Work
Running a query for Customer A on CPU 1 at the A query to calculate the total sales across all regions is broken down;
Example same time as a query for Customer B runs on CPU CPU 1 calculates sales for the North, and CPU 2 calculates sales for
2. the South in parallel.
b) Explain distributed database system architecture and how data is stored in distributed database system.
A distributed database system architecture defines how different DBMS-related processes and components are structured across a network. There are three main
approaches:
Client-Server Systems: This architecture separates functionality between client processes (which handle user interaction) and server processes (which
manage data and transactions). A client sends a query to a server, which executes it and returns the result .
Collaborating Server Systems: This is an evolution where multiple database servers can work together to execute a single query. When a server receives a
query that needs data from other sites, it coordinates with those other servers to generate the final result .
Middleware Systems: This architecture uses a special software layer (middleware) to integrate several independent, and possibly heterogeneous, database
systems. The middleware handles the complexity of executing a query across these different systems, while each system only needs to manage its local data .
Data is physically stored across multiple sites using two primary techniques: fragmentation and replication.
Data Fragmentation: This is the process of splitting a database into smaller logical units called fragments, which can be stored at different sites .
Horizontal Fragmentation: The table is divided by rows. For example, an Orders table could be fragmented by year, with all 2023 orders at Site A and
all 2024 orders at Site B .
Vertical Fragmentation: The table is divided by columns. For example, an Employee table could be split into a fragment with personal info (EmpID,
Name, Address) at one site and another with confidential info (EmpID, Salary, SSN) at a more secure site .
Data Replication: This is the process of storing multiple copies of data fragments at different sites .
Need for Replication: Replication is done to increase availability (if one site fails, data is still accessible) and improve performance (queries can be
executed on a local copy, reducing network delay) .