[go: up one dir, main page]

0% found this document useful (0 votes)
9 views25 pages

Adbs Notes

Uploaded by

Umar Binmazi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views25 pages

Adbs Notes

Uploaded by

Umar Binmazi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Unit #5

Advanced Transaction Processing and Concurrency Control

1] What is a transaction?
➔ A transaction is a sequence of one or more operations that form a logical unit
of work. Transactions ensure the consistency and integrity of a database by
either completing all operations or leaving the database unchanged in case of a
failure. The main operations of a transaction are:
- Read(A): Read operations R(A) reads the value of A from the database and
stores it in a buffer in the main memory.
- Write (A): Write operation W(A) writes the value back to the database
from the buffer.
(Note: It doesn’t always need to write it to a database back it just writes the
changes to buffer this is the reason where dirty read comes into the picture)

Example: Suppose an employee of bank transfers Rs 800 from X's account to Y's
account. This small transaction contains several low-level tasks:

X's Account Y's Account


Open_Account(X) Open_Account(Y)
Old_Balance = X.balance Old_Balance = Y.balance
New_Balance = Old_Balance - 800 New_Balance = Old_Balance + 800
X.balance = New_Balance Y.balance = New_Balance
Close_Account(X) Close_Account(Y)

- Properties of a transaction
The transaction has the four properties known as ACID properties. These
are used to maintain consistency in a database, before and after the
transaction. Let’s understand the ACID properties
1] Atomicity: A transaction is treated as a single, indivisible unit of
execution. It is either fully completed or fully aborted. It states that all
operations of the transaction take place at once if not, the transaction is
aborted. There is no midway, i.e., the transaction cannot occur partially.
Atomicity involves the following two operations:
- Abort: If a transaction aborts, then all the changes made are not
visible.
- Commit: If a transaction commits, then all the changes made are
visible.
Example: Consider the following transaction T consisting of T1 and T2:
Transfer of 100 from account X to account Y.

If the transaction fails after completion of T1 but before completion of


T2.( say, after write(X) but before write(Y)), then the amount has been
deducted from X but not added to Y. This results in an inconsistent
database state. Therefore, the transaction must be executed in its
entirety in order to ensure the correctness of the database state.

2] Consistency: This means that integrity constraints must be maintained


so that the database is consistent before and after the transaction. It
refers to the correctness of a database. Referring to the example above,
The total amount before and after the transaction must be maintained.
- Total before T occurs = 500 + 200 = 700.
- Total after T occurs = 400 + 300 = 700.
Therefore, the database is consistent. Inconsistency occurs in case T1
completes but T2 fails. As a result, T is incomplete.

3] Isolation: This property ensures that multiple transactions can occur


concurrently without leading to the inconsistency of the database state.
Transactions occur independently without interference. Changes
occurring in a particular transaction will not be visible to any other
transaction until that particular change in that transaction is written to
memory or has been committed. This property ensures that the execution
of transactions concurrently will result in a state that is equivalent to a
state achieved these were executed serially in some order.
Let X= 500, Y = 500.
Consider two transactions T and T”.

Suppose T has been executed till Read (Y) and then T’’ starts. As a result,
interleaving of operations takes place due to which T’’ reads the correct
value of X but the incorrect value of Y and sum computed by
- T’’: (X+Y = 50, 000+500=50, 500)
is thus not consistent with the sum at end of the transaction:
- T: (X+Y = 50, 000 + 450 = 50, 450).
This results in database inconsistency, due to a loss of 50 units. Hence,
transactions must take place in isolation and changes should be visible
only after they have been made to the main memory.

4] Durability: This property ensures that once the transaction has


completed execution, the updates and modifications to the database are
stored in and written to disk and they persist even if a system failure
occurs. These updates now become permanent and are stored in non-
volatile memory. The effects of the transaction, thus, are never lost.

Schedule: Schedule is process of grouping transactions into one and


executing them in a predefined order. It is a sequence of execution of
operation from various transactions. Schedule is required in database
because when multiple transactions execute in parallel, they may affect
the result of each other. So, to resolve this the order of the transactions
are changed by creating a schedule.
Types of Schedules:
- Serial Schedule: A schedule in which the transactions are defined to
execute one after another is called serial schedule.
- Non- Serial Schedule: A schedule in which the transactions are
defined to execute in any order is called non-serial schedule.
- Transaction Control:
Following are the commands used for transaction control:
i. COMMIT: The COMMIT command is used to save changes made by the
transaction in the database.
Syntax:
COMMIT;
ii. ROLLBACK: The ROLLBACK command is used to undo saved changes
made by the transaction in the database.
Syntax:
ROLLBACK;

iii. SAVEPOINT: SAVEPOINT is a point in the transaction where transaction


can be rolled back without entire transaction rollback.
Syntax:
SAVEPOINT savepoint_name;

- Transaction states: States through which a transaction goes during its


lifetime. These are the states which tell about the current state of the
Transaction and also tell how we will further do the processing in the
transactions. These states govern the rules which decide the fate of the
transaction whether it will commit or abort.

The basic states in a simple transaction model are often:


1] Active: The initial state where the transaction is actively performing its
operations.
2] Partially Committed: The transaction has executed all its operations
successfully and is ready to be committed. In this state, the changes made
by the transaction are typically recorded but not yet permanent.
3] Committed: The transaction has been successfully completed, and its
changes are made permanent in the database. This is the point where the
transaction moves from the "partially committed" state to the
"committed" state.
4] Failed: An error occurs during the transaction, and it cannot be
completed.
5] Aborted: If an error occurs during the transaction or if it is explicitly
rolled back, the transaction enters the "aborted" state. In this state, any
changes made by the transaction are undone or rolled back.
6] Terminated: After a transaction has either been committed or aborted,
it enters the "terminated" state. This signifies the end of the transaction's
life cycle.
(NOTE: "aborted" specifically refers to a state where a transaction is rolled back due to an
error or an explicit request. "Failed" is a more general term encompassing various issues
that can prevent successful completion. "Terminated" is a final state indicating the end of a
transaction's life cycle, whether it completed successfully, was aborted, or encountered a
failure.)

-Serializability: Serializability is a concept in database management that


ensures that the execution of multiple transactions produces a result as if
they were executed in some sequential order.

1. Dirty Read Anomaly: A dirty read occurs when one transaction reads a
value that has been modified by another transaction but not yet
committed. If the second transaction is later rolled back, the first
transaction will have read an inconsistent or "dirty" value.
Schedule S1
T1 | T2
R(A) |
…..(Never committed) W(A) |
| R(A)……(Dirty Read)
| W(A)
| commit;

In this scenario, T1 has read a value that was modified by T2 but was not
committed. If T2 is rolled back, the read operation in T1 was based on
incomplete or erroneous information.
2. Lost Update Anomaly: A lost update anomaly occurs when the updates
made by one transaction are overwritten by another transaction before
the first transaction has been committed. This leads to the loss of changes
made by the first transaction.
Example:
Consider two transactions, T1 and T2:
T1:
Reads a value X.
T2 reads the same value X.
T1 updates and commits the value of X.
T2 updates and commits the same value of X, overwriting T1's changes.

2] Concurrency control and its techniques/protocols


- Concurrently control is a very important concept which ensures the
simultaneous execution or manipulation of data by several processes or
user without resulting in data inconsistency.
- Concurrency control provides a procedure that is able to control
concurrent execution of the operations in the database.
- Concurrency control is provided in a database to:
(i) enforce isolation among transactions.
(ii) preserve database consistency through consistency preserving
execution of transactions.
(iii) resolve read-write and write-read conflicts.
- There are two types of concurrency control protocols
• Locked based concurrency control protocol
• Timestamp based concurrency control protocol

1] Locked based concurrency control protocol


A lock is a mechanism that provides exclusive access to a resource for a
transaction. When a transaction acquires a lock on a resource, it means
that it has the right to perform specific operations (read or write) on that
resource, and other transactions may be prevented from accessing or
modifying it until the lock is released.
- Shared Lock (S): also known as Read-only lock. As the name suggests it can
be shared between transactions because while holding this lock the
transaction does not have the permission to update data on the data item.
- Exclusive Lock (X): Data item can be both read as well as written. This is
Exclusive and cannot be held simultaneously on the same data item
- Types of time-based protocol:
1] Simplistic Lock protocol
2] Pre-claiming lock protocol
3] Two phase locking protocol (2PL)
4] Strict two-phase locking protocol
-
i] Simplistic Lock Protocol: The simplistic lock-based protocol is a
straightforward approach to concurrency control. Transactions request
and acquire locks on data items before performing any read or write
operations. This protocol may lead to inefficiency and reduced
concurrency, as locks are held for the entire duration of a transaction, even
if they are not needed.

ii] Two-Phase Locking (2PL) Protocol: The Two-Phase Locking Protocol is


a more advanced approach than the simplistic protocol. Transactions are
divided into two phases: the growing phase (acquiring locks) and the
shrinking phase (releasing locks). Once a transaction releases a lock in the
shrinking phase, it cannot acquire any new locks. May lead to potential
deadlocks if not managed properly.
iii] Strict Two-Phase Locking Protocol (Strict 2PL): Strict Two-Phase
Locking is an extension of the Two-Phase Locking Protocol. Transactions
follow a more restrictive rule: "No lock can be released until all the locks
needed by the transaction have been acquired." Increased contention for
locks, potentially reducing concurrency.

iv] Preclaiming Lock Protocol: The Preclaiming Lock Protocol is a proactive


approach to concurrency control. Transactions request locks in advance,
claiming a set of locks before starting their execution. Requires a good
understanding of the locks needed in advance, which may not be feasible
for all applications. Possibility of inefficiency if lock requirements change
during execution.

2] Timestamp based protocols: The Timestamp-Based Concurrency


Control Protocol is a technique used to manage concurrent access to
data in a database. It assigns a unique timestamp to each transaction
and uses these timestamps to order the transactions, ensuring a
consistent and serializable execution. The Timestamp-Based Protocol
helps prevent conflicts and maintain isolation between transactions.
- Multiple Granularity Locking:
Multiple Granularity Locking is a concurrency control technique that
allows transactions to acquire locks at different levels of granularity
within a database. Granularity refers to the size or scope of the data
items on which locks are applied. By supporting multiple levels of
granularity, this technique provides a balance between concurrency and
contention for locks.
- Levels of Granularity:
• Database Level: Locks at this level cover the entire database.
• Table Level: Locks at this level cover entire tables.
• Page Level: Locks at this level cover individual pages of a table.
• Row Level: Locks at this level cover individual rows within a table.
• Field or Attribute Level: Locks at this level cover individual fields or
attributes within a row.

--------------------------------------------------------------------------------------------
Unit #4
Data Warehousing, Data Mining, and Information Retrieval

1] Centralized Database Architecture: In a centralized database architecture, all


components of the database system are located on a single server. The server is
responsible for managing data storage, processing queries, and handling user
interfaces.
Characteristics:
Simplified Structure: All database components are on a single server, leading to
a simpler overall architecture.
Limited Scalability: The system's capacity to handle increased workloads is
constrained by the capabilities of the central server.
Single Point of Failure: The central server represents a single point of failure, and
if it goes down, the entire database system becomes inaccessible.
Advantages:
Simplicity: Centralized architectures are easier to set up and manage.
Data Consistency: Easier to maintain data consistency since all data resides in
one location.
Disadvantages:
Scalability Issues, Reliability Concerns
2] Client-Server Database Architecture: In a client-server database architecture,
the database functions are divided between a server and multiple clients that
interact with the server over a network. The server manages data storage and
processing, while clients handle user interfaces and application logic.
Characteristics:
Scalability: Easier to scale by adding more servers or upgrading server hardware
to handle increased workloads.
Distributed Processing: The workload is distributed between the server and
multiple clients, allowing for parallel processing.
Improved Reliability: Multiple servers can be employed, reducing the impact of
a single server failure.
Advantages:
Scalability: Can handle larger datasets and increased user loads more effectively.
Resource Utilization: Distributes processing tasks, optimizing resource
utilization.
Fault Tolerance: Redundancy in servers enhances fault tolerance.
Disadvantages:
Complexity: The architecture is more complex compared to a centralized system.
Cost: Setting up and maintaining the infrastructure, including multiple servers,
can be more expensive.

3] Parallel Database Systems: Parallel database systems use multiple processors


or nodes to perform database operations simultaneously. Each processor
operates independently, and the data is partitioned among them.
- Architecture:
• Shared-Nothing Architecture: Each processor has its own memory and
storage, and there is no shared memory between processors.
• Data Partitioning: The database is divided into partitions, and each
processor works on a subset of the data.
- Processing Model:
• Data Parallelism: Operations are divided into smaller tasks that are
performed in parallel on different processors.
• Task Coordination: A central coordinator may assign tasks to different
processors to ensure balanced workloads.
- Advantages:
• Improved Performance: Parallel processing enables faster execution of
queries and data-intensive operations.
- Challenges:
• Load Balancing: Distributing tasks evenly among processors to avoid
performance bottlenecks.
• Coordination Overhead: Managing the synchronization and
communication between processors.

4] Distributed Database Systems: Distributed database systems store data


across multiple nodes or servers that are connected via a network. Data
distribution can be geographically dispersed.
- Architecture:
• Shared-Everything or Shared-Nothing: In shared-everything, all nodes can
access the entire dataset, while in shared-nothing, each node has its own
data.
• Data Replication or Partitioning: Data can be replicated across nodes or
partitioned based on certain criteria.
- Processing Model:
• Task Distribution: Operations can be distributed among nodes based on
data location or task specialization.
• Coordination and Consistency: Ensuring consistency across distributed
nodes, often requiring coordination mechanisms.
- Advantages:
• Fault Tolerance: Redundant copies of data or task distribution contribute
to system resilience.
• Geographical Distribution: Suitable for applications that require data to be
stored and processed in different locations.
- Challenges:
• Data Consistency: Ensuring that all distributed copies of data remain
consistent.
• Network Latency: Communication delays between nodes can impact
performance.
5] ODBC (Open Database Connectivity): ODBC is a standard application
programming interface (API) that enables communication between applications
and database management systems (DBMS).
- Components:
• Driver Manager: Manages ODBC drivers and provides a standard interface
for applications.
• ODBC Driver: Acts as a bridge between the application and the database,
translating ODBC function calls into commands understood by the specific
DBMS.
- Key Features:
• Database Independence: ODBC allows applications to be independent of
the underlying database, as long as there is an ODBC driver for that
database.
• Data Source Name (DSN): DSN is used to define the connection details,
making it easier to configure and manage connections.
- Usage:
• ODBC is widely used in Windows environments and is supported by
various programming languages, including C, C++, and languages that
support COM (Component Object Model).
6] JDBC (Java Database Connectivity): JDBC is a Java-based API that allows Java
applications to interact with relational databases.
- Components:
• JDBC API: Defines a set of Java interfaces and classes for database
connectivity.
• JDBC Driver: A platform-specific implementation of the JDBC interfaces,
allowing Java applications to connect to a particular database.
- Key Features:
• Platform Independence: JDBC provides a platform-independent way to
connect to databases, making it suitable for Java applications running on
different operating systems.
• SQL Support: Allows the execution of SQL queries and statements from
Java applications.
• Transaction Management: JDBC supports transaction management,
allowing applications to control transactions explicitly.
- Usage:
• JDBC is primarily used in Java applications and is a standard part of the
Java Standard Edition (SE) library.
7] Data Warehousing: Data warehousing involves the collection, storage, and
management of large volumes of structured and sometimes unstructured data
from different sources within an organization. The purpose of a data warehouse
is to provide a centralized and consolidated view of data, making it easier for
decision-makers to analyze and derive insights. Key aspects of data warehousing
include:
- Central Repository: A data warehouse serves as a central repository for
historical and current data from various sources, including transactional
databases, logs, and external systems.
- Integration: Data from diverse sources is integrated into a consistent
format within the data warehouse, ensuring uniformity and facilitating
analysis.
- Subject-Oriented: Data warehouses are organized based on subject areas
or themes, enabling users to focus on specific aspects of the business.
- Historical Data: The inclusion of historical data allows for trend analysis,
forecasting, and the identification of patterns over time.
- Query and Reporting: Users can query and generate reports from the data
warehouse, enabling them to make informed decisions based on
comprehensive and integrated information.
- ETL Processes: Extract, Transform, Load (ETL) processes are employed to
extract data from source systems, transform it into the desired format,
and load it into the data warehouse.

8] Data Mining: Data mining is the process of discovering patterns, relationships,


and valuable information from large datasets using various techniques,
algorithms, and statistical models. The objective of data mining is to uncover
hidden knowledge, extract meaningful patterns, and make predictions or
classifications based on the available data. Key aspects of data mining include:
- Pattern Recognition: Data mining involves identifying patterns, trends, and
associations within the data that may not be immediately apparent.
- Predictive Modeling: Data mining enables the development of predictive
models to forecast future trends or outcomes based on historical data
patterns.
- Classification: Grouping data into predefined categories or classes based
on identified characteristics.
- Clustering: Identifying natural groupings or clusters within the data based
on similarities between data points.
- Association Rule Mining: Discovering relationships and associations
between variables, often used in market basket analysis and
recommendation systems.
- Regression Analysis: Modeling the relationships between variables to
predict or estimate numeric values.
- Anomaly Detection: Identifying outliers or irregularities in the data that
deviate from the expected patterns.
- Decision Support: Data mining provides valuable insights that support
decision-making processes in various domains, such as finance,
marketing, healthcare, and more.

NOTE: In essence, data mining is the analytical process that extracts knowledge
and patterns from data to make decisions based on that knowledge, while data
warehousing provides the infrastructure for storing and managing the data
required for such analysis. This is how Data warehouse helps in decision
support system.

9] Information Retrieval:
- Information retrieval (IR) is the process of obtaining relevant information
from a large repository or database.
- It involves searching for, retrieving, and presenting information in
response to a user's query or request.
- Information retrieval systems are designed to efficiently locate and
deliver relevant documents, files, or resources based on the user's
information needs.
- IR is a crucial component in various applications, including search engines,
databases, digital libraries, and document management systems.
10] Embedded SQL:
- Embedded SQL (Structured Query Language) refers to the integration of
SQL statements directly within the code of a host programming language,
such as C, C++, COBOL, or Java. These statements often use specific syntax
to differentiate them from the host language code.
- It allows developers to perform database operations directly into their
application code, facilitating seamless interaction between the application
and the database.
- The primary goal of embedded SQL is to combine the power of a
programming language with the data manipulation capabilities of SQL,
streamlining database access within application logic.
- Before the application code is compiled, it undergoes a preprocessing
stage where the embedded SQL statements are recognized and
processed. During this stage, the SQL statements are separated, and
necessary transformations are applied.
================================================================
Unit #3
Advanced SQL, Relational Database Design and Data Normalization

1] Functions and Procedures:


- Functions: A function is a named, reusable block of code that performs a
specific task and often returns a value. In database systems, functions are
commonly used to encapsulate logic that can be applied to data stored in the
database.
Example: Let's say we want to create a function that calculates the total salary
for an employee based on their basic salary and any bonuses they might have:
-- Creating a function
CREATE FUNCTION CalculateTotalSalary (basicSalary INT, bonus INT)
RETURNS INT
BEGIN
DECLARE total INT;
SET total = basicSalary + bonus;
RETURN total;
END;
-- Using the function
SELECT employee_id, CalculateTotalSalary(basic_salary, bonus) AS total_salary
FROM employees;
- Procedures: A procedure is a named set of SQL statements that can be stored
and executed on demand. Unlike functions, procedures may not return a value
directly. They are often used to perform operations or actions within the
database.
Example: Let's create a simple procedure that updates the job title of an
employee based on their years of service:
-- Creating a procedure
CREATE PROCEDURE UpdateJobTitle (employee_id INT, years_of_service INT)
BEGIN
IF years_of_service >= 5 THEN
UPDATE employees
SET job_title = 'Senior ' + job_title
WHERE employee_id = employee_id;
ELSE
-- Do nothing if years_of_service is less than 5
END IF;
END;
-- Calling the procedure
CALL UpdateJobTitle(101, 7);

2] Triggers and Views:


- Triggers: Triggers are special types of stored procedures that automatically
execute in response to certain events on a particular table or view. These events
can include INSERT, UPDATE, DELETE, and other database-related actions.
Triggers are often used to enforce business rules, maintain data integrity, or
perform additional actions when changes are made to a table.
- We cannot manually execute/invoked triggers.
- Triggers have no chance of receiving parameters.
- A transaction cannot be committed or rolled back inside a trigger.
Syntax:
CREATE TRIGGER trigger_name
BEFORE/AFTER action_on_database
ON table_name
BEGIN
//
Query_to_trigger_after_action
//END;
- Views: Views are virtual tables derived from one or more base tables or other
views. They allow you to present data in a structured way, abstracting the
underlying complexity of the database schema. Views are useful for simplifying
queries, enhancing security by controlling access to specific columns, and
encapsulating complex SQL logic.
Syntax:
CREATE VIEW view_name AS
SELECT column1, column2, ...
FROM table_name
WHERE condition;

Feature Views Triggers Stored Procedures


Virtual tables based on Automated actions in response Precompiled sets of SQL
Definition queries to specific events statements
Explicitly queried by Implicitly executed on specific Explicitly called by users or
Execution users events applications
Simplify complex Automate actions on specific Encapsulate and modularize
Purpose queries events, enforce rules SQL code
Provide a virtual table Respond to data changes or Encapsulate business logic,
Usage for simplified querying events, maintain data integrity improve code organization
Determines which Depends on SQL
Access columns and rows users N/A (usually involves data permissions and access
Control can see modification) control
May include transaction control Can include transaction
Transaction No direct impact on statements (e.g., COMMIT, control statements to
Control transactions ROLLBACK) ensure atomicity
Can accept input
Does not accept parameters, improving
Parameters parameters Does not accept parameters flexibility
Triggered by specific events
Event Triggered by SELECT such as INSERT, UPDATE, Explicitly called when
Triggering operations DELETE needed
Can be classified as BEFORE or Executed when explicitly
Timing N/A AFTER triggers called
Has access to OLD and NEW
Implicit values representing before and
Variables N/A after changes N/A
3] Normalization:
- It is the process of organizing the data in the database. Normalization is
used to minimize the redundancy from a relation or set of relations.
- Normalization divides the larger table into smaller and links them using
relationships. The normal form is used to reduce redundancy from the
database table.
- The main reason for normalizing the relations is to eliminate undesirable
characteristics like Insertion, Update, and Deletion anomalies. Failure to
eliminate anomalies leads to data redundancy and can cause data
integrity issues and other problems as the database grows.
- Normalization consists of a series of guidelines that helps to guide you in
creating a good database structure.
• Insertion Anomaly: Insertion Anomaly refers to when one cannot insert a
new tuple into a relationship due to lack of data.
• Deletion Anomaly: The delete anomaly refers to the situation where the
deletion of data results in the unintended loss of some other important
data.
• Updatation Anomaly: The update anomaly is when an update of a single
data value requires multiple rows of data to be updated.

4] Functional Dependency:
- In a relational database, a functional dependency exists between two sets
of attributes in a table when the value of one set of attributes uniquely
determines the value of another set. This concept is fundamental to
database normalization and helps maintain data integrity.
- In a functional dependency X→Y, X is called the determinant, and Y is
called the dependent. It means that for every unique combination of X,
there is only one corresponding value of Y.
- Types of Functional Dependencies:
• Trivial Functional Dependency: A functional dependency X→Y is
trivial if Y is a subset of X.
• Non-Trivial Functional Dependency: A functional dependency X→Y
is non-trivial if Y is not a subset of X.
- Attribute Closure: The attribute closure is a set of attributes that can be
functionally determined from a given set of attributes.
5] Normal Form:
i] First Normal Form (1NF)
- A relation will be 1NF if it contains an atomic value.
- It states that an attribute of a table cannot hold multiple values. It must
hold only single-valued attribute.
- First normal form disallows the multi-valued attribute, composite
attribute, and their combinations.
- Example:
ii] Second Normal Form (2NF)
- In the 2NF, relational must be in 1NF.
- In the second normal form, all non-key attributes are fully functional
dependent on the primary key.
iii] Third Normal Form (3NF)
- A relation will be in 3NF if it is in 2NF and not contain any transitive partial
dependency.
- 3NF is used to reduce the data duplication. It is also used to achieve the
data integrity.
- If there is no transitive dependency for non-prime attributes, then the
relation must be in third normal form.
- A relation is in third normal form if it holds atleast one of the following
conditions for every non-trivial function dependency X → Y.
• X is a super key.
• Y is a prime attribute, i.e., each element of Y is part of some
candidate key.

o Super Key: A super key is a set of one or more attributes (columns) that,
taken collectively, can uniquely identify a tuple (row) in a table. It is a
superset of a candidate key, meaning it may include additional attributes
beyond what is necessary for uniqueness.
o Candidate Key: A candidate key is a minimal super key, meaning it is a set
of attributes that uniquely identifies each tuple, and no subset of these
attributes has the same uniqueness property. In other words, removing
any attribute from a candidate key would result in the loss of uniqueness.
o Primary Key: A primary key is a selected candidate key that is chosen to
uniquely identify each tuple in a table. It is a special type of candidate key,
and by definition, it must be unique and non-null for each tuple in the
table. A primary key is unique and NOT NULL.
{super keys are sets of attributes that uniquely identify tuples, candidate keys are minimal super keys, and a
primary key is a selected candidate key used to uniquely identify tuples within a relational table.}
iv] Boyce Codd normal form (BCNF)
- BCNF is the advance version of 3NF. It is stricter than 3NF.
- A table is in BCNF if every functional dependency X → Y, X is the super key
of the table.
- For BCNF, the table should be in 3NF, and for every FD, LHS is super key

v] Fourth normal form (4NF)


- A relation will be in 4NF if it is in Boyce Codd normal form and has no
multi-valued dependency.
- For a dependency A → B, if for a single value of A, multiple values of B
exist, then the relation will be a multi-valued dependency.

o Multivalued Dependency: It occurs when two attributes in a table are


independent of each other but, both depend on a third attribute.
A multivalued dependency consists of at least two attributes that are
dependent on a third attribute that's why it always requires at least three
attributes.

6] Database Design: Database design is a systematic process that involves


defining the structure of a database to meet the specific needs of an
organization. The design process typically consists 3 phases, they are:
i] Initial Phase:
- Requirements Analysis:
Identify and document the information requirements of the organization
or application. This involves understanding the data that needs to be
stored, the relationships between different pieces of information, and the
types of queries or transactions that will be performed.
- User Interviews:
Engage with stakeholders, including end-users, to gather insights into their
needs, expectations, and the nature of the data they work with. This helps
in understanding the user perspective and ensures that the database
design aligns with actual user needs.
- Data Collection:
Collect existing data sources, if any, that will be used in the database. This
may involve studying existing documents, forms, reports, and data files to
understand the current data landscape.
- Initial Problem Statement:
Develop an initial problem statement outlining the goals and objectives of
the database design project based on the gathered requirements.

iii] Conceptual Phase:


- Entity-Relationship Modeling:
Create an Entity-Relationship Diagram (ERD) to represent the entities
(objects), relationships, and attributes in the system. This diagram serves
as a high-level abstraction of the data model and helps in visualizing the
structure of the database.
- Normalization:
Identify and normalize the data to eliminate redundancy and achieve a
more organized and efficient database structure. This involves applying
normalization rules to ensure that the database is in at least the third
normal form (3NF).
- Define Constraints:
Specify any constraints on the data, such as primary keys, foreign keys, and
unique constraints. Constraints ensure data integrity and enforce rules for
maintaining the consistency of the database.
- Review and Feedback:
Collaborate with stakeholders, including users and database
administrators, to review the conceptual design. Collect feedback and
make adjustments based on the input received.

iii] Final Phase:


- Physical Database Design:
Translate the conceptual design into a physical design that specifies how
the database will be implemented on the chosen database management
system (DBMS). This includes defining data types, indexes, and storage
structures.
- Implementation of Security Measures:
Incorporate security measures to control access to the database. This may
involve defining user roles, permissions, and authentication mechanisms
to ensure data privacy and integrity.
- Data Loading:
Populate the database with initial data, either through manual entry or
data migration processes.
- Testing:
Conduct thorough testing of the database to ensure that it functions as
intended. This includes testing for data accuracy, query performance, and
overall system stability.
- Documentation:
Create comprehensive documentation that includes data dictionaries,
schema diagrams, and any other relevant information for future reference
and maintenance.
- Deployment:
Deploy the database to the production environment, making it accessible
for end-users. Monitor the system during the initial period to address any
issues that may arise.
---------------------------------------------------------------------------------------------------------
Unit #1
Introduction to Database System and E-R Models

1] Database architecture and its compounds:


i] Query processing unit (QPU):
- A Query Processing Unit (QPU) is a component within a database
management system (DBMS) responsible for executing queries and
retrieving the requested data from the database.
- Query processing involves a series of steps that transform a high-level
query written by a user or an application into an efficient plan for data
retrieval.
- The primary goal of the Query Processing Unit is to optimize query
execution, minimize response time, and utilize resources effectively. The
typical steps in query processing include:
• Query parsing: The QPU starts by parsing the user's query, breaking
it down into its constituent parts, and checking its syntax and
semantics for correctness.
• Query optimization: The QPU generates multiple execution plans
for the query and evaluates their cost estimates. The goal is to
choose the most efficient plan that minimizes resource usage and
reduces the overall execution time.
• Query compilation: Once the optimal execution plan is determined,
the QPU compiles the query into a form that the database engine
can efficiently execute.
• Execution: During execution, the QPU interacts with other
components, such as the storage manager and buffer manager, to
retrieve and manipulate the data.
• Data retrieval: The QPU retrieves the required data from the
storage structures, such as tables or indexes, based on the
execution plan.
• Result formation: The retrieved data is processed and formatted
according to the query requirements.
• Transaction management: The QPU ensures that the query
execution is consistent with the database's transactional
requirements. This involves managing locks, handling rollbacks in
case of errors, and maintaining the overall integrity of the database.
ii] Storage manager:
- The storage manager is a crucial component in the architecture of a
Database Management System (DBMS), responsible for managing the
storage of data on the physical storage media.
- It interacts with other components to ensure efficient and secure data
storage and retrieval.
- The key storage manager components include:
• Buffer Manager: The buffer manager is responsible for managing a
buffer or cache in the system's memory. It stores a subset of data
from the database, providing faster access to frequently used data.
It reads data from disk into the buffer when needed and writes
modified data back to disk.
• File Manager: The file manager is responsible for managing the
physical storage of data on disk. It interacts with the operating
system's file system to read and write data files. It manages space
allocation within data files and also handles file creation, deletion,
and modification.
• Authorization and Integrity Manager: The authorization and
integrity manager enforce security measures and ensures data
integrity within the database. It controls user access and enforces
rules to maintain data consistency. Implements access control
mechanisms to restrict user privileges.
• Transaction Manager: The transaction manager ensures the
atomicity, consistency, isolation, and durability (ACID properties) of
database transactions. It controls concurrent access to data,
preventing conflicts and ensuring data integrity.
iii] Disc storage:
- The disk storage component in database architecture refers to the physical
storage of data on non-volatile storage devices, typically hard disk drives
or solid-state drives.
- This component plays a critical role in the persistent storage of database
files, including data files, index files, and other structures.
- Let's explore key components related to disk storage in a database system:
• Data Files: Data files store the actual data of the database, including
tables, records, and attributes. These files hold the persistent information
that is retrieved and manipulated by database operations.
• Index Files: Index files are structures that enhance data retrieval efficiency
by providing a quick lookup mechanism. They contain key-value pairs,
where the keys are pointers to the actual data in the data files.
• Data Dictionary: The data dictionary (or metadata repository) is a central
repository that stores metadata, which describes the structure of the
database. It includes information about tables, columns, data types,
constraints, and relationships.
• Statistical Data: Statistical data provides summary information about the
distribution and characteristics of the data in the database. It aids the
query optimizer in making informed decisions about query execution
plans.

You might also like