[go: up one dir, main page]

0% found this document useful (0 votes)
19 views35 pages

ADM (Unit 2)

The document discusses key concepts in database management systems, focusing on transaction processing and the ACID properties: Atomicity, Consistency, Isolation, and Durability. It explains how these properties ensure data integrity and consistency during transactions, as well as the challenges and advantages associated with implementing them. Additionally, it covers various types of transaction schedules, including serial and non-serial schedules, and their implications on concurrency control and database reliability.

Uploaded by

diyadivya528
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views35 pages

ADM (Unit 2)

The document discusses key concepts in database management systems, focusing on transaction processing and the ACID properties: Atomicity, Consistency, Isolation, and Durability. It explains how these properties ensure data integrity and consistency during transactions, as well as the challenges and advantages associated with implementing them. Additionally, it covers various types of transaction schedules, including serial and non-serial schedules, and their implications on concurrency control and database reliability.

Uploaded by

diyadivya528
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

1

Advanced Databases and Mining

UNIT II: Transaction Processing: Consistency, Atomicity, Isolation and Durability,


Serializable
Schedule, Recoverable Schedule, Concurrency Control, Time-stamp based protocols,
Isolation Levels, Online Analytical Processing,
Database performance Tuning and Query optimization: Query Tree, Cost of Query, Join,
Selection and Projection Implementation Algorithms and Optimization Database Security:
Access Control, MAC, RBAC, Authorization, SQL Injection Attacks.

Transaction Processing: Consistency, Atomicity, Isolation and


Durability, Serializable

In the world of Database Management Systems (DBMS), transactions


are fundamental operations that allow us to modify and retrieve data.
However, to ensure the integrity of a database, it is important that these
transactions are executed in a way that maintains consistency,
correctness, and reliability. This is where the ACID properties come into
play.
ACID stands for Atomicity, Consistency, Isolation, and Durability.
These four key properties define how a transaction should be processed
in a reliable and predictable manner, ensuring that the database remains
consistent, even in cases of failures or concurrent accesses.
2

What Are Transactions in DBMS?


A transaction in DBMS refers to a sequence of operations performed as
a single unit of work. These operations may involve reading or writing data
to the database. To maintain data integrity, DBMS ensures that each
transaction adheres to the ACID properties. Think of a transaction like an
ATM withdrawal. When we withdraw money from our account, the
transaction involves several steps:
 Checking your balance.
 Deducting the money from your account.
 Adding the money to the bank's record.
For the transaction to be successful, all steps must be completed. If any
part of this process fails (e.g., if there’s a system crash), the entire
transaction should fail, and no data should be altered. This ensures the
database remains in a consistent state.
The Four ACID Properties
1. Atomicity: "All or Nothing"
Atomicity ensures that a transaction is atomic, it means that either the
entire transaction completes fully or doesn't execute at all. There is no in-
between state i.e. transactions do not occur partially. If a transaction has
multiple operations, and one of them fails, the whole transaction is rolled
3

back, leaving the database unchanged. This avoids partial updates that
can lead to inconsistency.
 Commit: If the transaction is successful, the changes are permanently
applied.
 Abort/Rollback: If the transaction fails, any changes made during the
transaction are discarded.
Example: Consider the following transaction T consisting of T1 and T2 :
Transfer of $100 from account X to account Y .

Example
If the transaction fails after completion of T1 but before completion of T2 ,
the database would be left in an inconsistent state. With Atomicity, if any
part of the transaction fails, the entire process is rolled back to its original
state, and no partial changes are made.
2. Consistency: Maintaining Valid Data States
Consistency ensures that a database remains in a valid state before and
after a transaction. It guarantees that any transaction will take the
database from one consistent state to another, maintaining the rules and
constraints defined for the data. In simple terms, a transaction should only
take the database from one valid state to another. If a transaction violates
any database rules or constraints, it should be rejected, ensuring that only
consistent data exists after the transaction.
Example: Suppose the sum of all balances in a bank system should
always be constant. Before a transfer, the total balance is $700. After the
transaction, the total balance should remain $700. If the transaction fails
in the middle (like updating one account but not the other), the system
should maintain its consistency by rolling back the transaction
Total before T occurs = 500 + 200 = 700 .
Total after T occurs = 400 + 300 = 700 .
4

3. Isolation: Ensuring Concurrent Transactions Don't Interfere


This property ensures that multiple transactions can occur concurrently
without leading to the inconsistency of the database state. Transactions
occur independently without interference. Changes occurring in a
particular transaction will not be visible to any other transaction until that
particular change in that transaction is written to memory or has been
committed.
This property ensures that when multiple transactions run at the same
time, the result will be the same as if they were run one after another in a
specific order. This property prevents issues such as dirty reads (reading
uncommitted data), non-repeatable reads (data changing between two
reads in a transaction), and phantom reads (new rows appearing in a
result set after the transaction starts).
Example: Let’s consider two transactions:Consider two
transactions T and T''.
 X = 500, Y = 500
5

Transaction T:
 T wants to transfer $50 from X to Y.
 T reads Y (value: 500), deducts $50 from X (new X = 450), and adds
$50 to Y (new Y = 550).
Transaction T'':
 T'' starts and reads X (value: 500) and Y (value: 500), then calculates
the sum: 500 + 500 = 1000.
But, by the time T finishes, X and Y have changed
to 450 and 550 respectively, so the correct sum should be 450 + 550 =
1000. Isolation ensures that T'' should not see the old values
of X and Y while T is still in progress. Both transactions should be
independent, and T'' should only see the final state after T commits. This
prevents inconsistent data like the incorrect sum calculated by T''
4. Durability: Persisting Changes
This property ensures that once the transaction has completed execution,
the updates and modifications to the database are stored in and written to
disk and they persist even if a system failure occurs. These updates now
become permanent and are stored in non-volatile memory. In the event
of a failure, the DBMS can recover the database to the state it was in after
the last committed transaction, ensuring that no data is lost.
Example: After successfully transferring money from Account A to
Account B, the changes are stored on disk. Even if there is a crash
6

immediately after the commit, the transfer details will still be intact when
the system recovers, ensuring durability.
How ACID Properties Impact DBMS Design and
Operation
The ACID properties, in totality, provide a mechanism to ensure the
correctness and consistency of a database in a way such that each
transaction is a group of operations that acts as a single unit, produces
consistent results, acts in isolation from other operations, and updates
that it makes are durably stored.
1. Data Integrity and Consistency
ACID properties safeguard the data integrity of a DBMS by ensuring that
transactions either complete successfully or leave no trace if interrupted.
They prevent partial updates from corrupting the data and ensure that
the database transitions only between valid states.
2. Concurrency Control
ACID properties provide a solid framework for managing concurrent
transactions. Isolation ensures that transactions do not interfere with
each other, preventing data anomalies such as lost updates, temporary
inconsistency, and uncommitted data.
3. Recovery and Fault Tolerance
Durability ensures that even if a system crashes, the database can
recover to a consistent state. Thanks to
the Atomicity and Durability properties, if a transaction fails midway, the
database remains in a consistent state.
Property Responsibility for maintaining properties

Atomicity Transaction Manager

Consistency Application programmer

Isolation Concurrency Control Manager

Durability Recovery

Advantages of ACID Properties in DBMS


1. Data Consistency: ACID properties ensure that the data remains
consistent and accurate after any transaction execution.
2. Data Integrity: It maintains the integrity of the data by ensuring that
any changes to the database are permanent and cannot be lost.
3. Concurrency Control: ACID properties help to manage multiple
transactions occurring concurrently by preventing interference between
them.
7

4. Recovery: ACID properties ensure that in case of any failure or crash,


the system can recover the data up to the point of failure or crash.
Disadvantages of ACID Properties in DBMS
1. Performance Overhead: ACID properties can introduce performance
costs, especially when enforcing isolation between transactions or
ensuring atomicity.
2. Complexity: Maintaining ACID properties in distributed systems (like
microservices or cloud environments) can be complex and may require
sophisticated solutions like distributed locking or transaction
coordination.
3. Scalability Issues: ACID properties can pose scalability challenges,
particularly in systems with high transaction volumes, where traditional
relational databases may struggle under load.

Schedule, Recoverable Schedule, Concurrency Control, Time-stamp


based protocols, Isolation Levels, Online Analytical Processing,
Types of Schedules in DBMS
Schedule, as the name suggests, is a process of lining the transactions
and executing them one by one. When there are multiple transactions that
are running in a concurrent manner and the order of operation is needed
to be set so that the operations do not overlap each other, Scheduling is
brought into play and the transactions are timed accordingly. The basics
of Transactions and Schedules is discussed in Concurrency Control,
and Transaction Isolation Levels in DBMS articles. Here we will discuss
various types of schedules.
8

1. Serial Schedules: Schedules in which the transactions are executed


non-interleaved, i.e., a serial schedule is one in which no transaction
starts until a running transaction has ended are called serial
schedules. Example: Consider the following schedule involving two
transactions T 1 and T 2 .
T1 T2

R(A)

W(A)

R(B)

W(B)

R(A)

R(B)
9

1. where R(A) denotes that a read operation is performed on some data


item 'A' This is a serial schedule since the transactions perform serially
in the order T 1 —> T 2
2. Non-Serial Schedule: This is a type of Scheduling where the
operations of multiple transactions are interleaved. This might lead to a
rise in the concurrency problem. The transactions are executed in a
non-serial manner, keeping the end result correct and same as the
serial schedule. Unlike the serial schedule where one transaction must
wait for another to complete all its operation, in the non-serial
schedule, the other transaction proceeds without waiting for the
previous transaction to complete. This sort of schedule does not
provide any benefit of the concurrent transaction. It can be of two types
namely, Serializable and Non-Serializable Schedule. The Non-Serial
Schedule can be divided further into Serializable and Non-Serializable.
1. Serializable: This is used to maintain the consistency of the
database. It is mainly used in the Non-Serial scheduling to verify
whether the scheduling will lead to any inconsistency or not. On the
other hand, a serial schedule does not need the serializability
because it follows a transaction only when the previous transaction
is complete. The non-serial schedule is said to be in a serializable
schedule only when it is equivalent to the serial schedules, for an n
number of transactions. Since concurrency is allowed in this case
thus, multiple transactions can execute concurrently. A serializable
schedule helps in improving both resource utilization and CPU
throughput. These are of two types:
1. Conflict Serializable: A schedule is called conflict serializable if it
can be transformed into a serial schedule by swapping non-
conflicting operations. Two operations are said to be conflicting if
all conditions satisfy:
 They belong to different transactions
 They operate on the same data item
 At Least one of them is a write operation
2. View Serializable: A Schedule is called view serializable if it is view
equal to a serial schedule (no overlapping transactions). A
conflict schedule is a view serializable but if the serializability
contains blind writes, then the view serializable does not conflict
serializable.
2. Non-Serializable: The non-serializable schedule is divided into two
types, Recoverable and Non-recoverable Schedule.
1. Recoverable Schedule: Schedules in which transactions commit
only after all transactions whose changes they read commit are
called recoverable schedules. In other words, if some transaction
T j is reading value updated or written by some other transaction
T i , then the commit of T j must occur after the commit of
10

T i . Example - Consider the following schedule involving two


transactions T 1 and T 2 .
T1 T2

R(A)

W(A)

W(A)

R(A)

commit

commit

2. This is a recoverable schedule since T 1 commits before T 2 , that


makes the value read by T 2 correct. There can be three types of
recoverable schedule:
 Cascading Schedule: Also called Avoids cascading
aborts/rollbacks (ACA). When there is a failure in one
transaction and this leads to the rolling back or aborting other
dependent transactions, then such scheduling is referred to as
Cascading rollback or cascading abort. Example:
11

2. Cascadeless Schedule: Schedules in which transactions read


values only after all transactions whose changes they are
going to read commit are called cascadeless schedules.
Avoids that a single transaction abort leads to a series of
transaction rollbacks. A strategy to prevent cascading aborts
is to disallow a transaction from reading uncommitted changes
from another transaction in the same schedule. In other
words, if some transaction T j wants to read value updated or
written by some other transaction T i , then the commit of
T j must read it after the commit of T i . Example: Consider the
following schedule involving two transactions T 1 and T 2 .
T1 T2

R(A)

W(A)

W(A)

commit

R(A)

commit

3. This schedule is cascadeless. Since the updated value of A is


read by T 2 only after the updating transaction i.e.
T 1 commits. Example: Consider the following schedule
involving two transactions T 1 and T 2 .
T1 T2

R(A)

W(A)

R(A)

W(A)
12

T1 T2

abort

abort

4. It is a recoverable schedule but it does not avoid cascading


aborts. It can be seen that if T 1 aborts, T 2 will have to be
aborted too in order to maintain the correctness of the
schedule as T 2 has already read the uncommitted value
written by T 1 .
3. Strict Schedule: A schedule is strict if for any two
transactions T i , T j , if a write operation of T i precedes a
conflicting operation of T j (either read or write), then the
commit or abort event of T i also precedes that conflicting
operation of T j . In other words, T j can read or write updated
or written value of T i only after
T i commits/aborts. Example: Consider the following schedule
involving two transactions T 1 and T 2 .
T1 T2

R(A)

R(A)

W(A)

commit

W(A)

R(A)

commit

4. This is a strict schedule since T 2 reads and writes A which is


written by T 1 only after the commit of T 1 .
2. Non-Recoverable Schedule:Example: Consider the following
schedule involving two transactions T 1 and T 2 .
13

T1 T2

R(A)

W(A)

W(A)

R(A)

commit

abort

3. T 2 read the value of A written by T 1 , and committed. T 1 later


aborted, therefore the value read by T 2 is wrong, but since
T 2 committed, this schedule is non-recoverable .
Note - It can be seen that:
1. Cascadeless schedules are stricter than recoverable schedules or are
a subset of recoverable schedules.
2. Strict schedules are stricter than cascadeless schedules or are a
subset of cascadeless schedules.
3. Serial schedules satisfy constraints of all recoverable, cascadeless and
strict schedules and hence is a subset of strict schedules.
The relation between various types of schedules can be depicted as:

Example:
Consider the following schedule:
S:R1(A), W2(A), Commit2, W1(A), W3(A), Commit3, Commit1
14

Which of the following is true? (A) The schedule is view serializable


schedule and strict recoverable schedule (B) The schedule is non-
serializable schedule and strict recoverable schedule (C) The schedule is
non-serializable schedule and is not strict recoverable schedule. (D) The
Schedule is serializable schedule and is not strict recoverable schedule
Solution:
The schedule can be re-written as:-
T1 T2 T3

R(A)

W(A)

Commit

W(A)

W(A)

Commit

Commit

First of all, it is a view serializable schedule as it has view equal serial


schedule T1 —> T2 —> T3 which satisfies the initial and updated reads
and final write on variable A which is required for view serializability. Now
we can see there is write – write pair done by transactions T 1 followed by
T3 which is violating the above-mentioned condition of strict schedules as
T3 is supposed to do write operation only after T 1 commits which is
violated in the given schedule. Hence the given schedule is serializable
but not strict recoverable.
So, option (D) is correct.
Timestamp based Concurrency Control
Timestamp-based concurrency control is a method used in database
systems to ensure that transactions are executed safely and consistently
without conflicts, even when multiple transactions are being processed
simultaneously. This approach relies on timestamps to manage and
coordinate the execution order of transactions. Refer to the timestamp of
a transaction T as TS(T).
What is Timestamp Ordering Protocol?
15

The Timestamp Ordering Protocol is a method used in database systems


to order transactions based on their timestamps. A timestamp is a unique
identifier assigned to each transaction, typically determined using the
system clock or a logical counter. Transactions are executed in the
ascending order of their timestamps, ensuring that older transactions get
higher priority.
For example:
 If Transaction T1 enters the system first, it gets a timestamp TS(T1) =
007 (assumption).
 If Transaction T2 enters after T1, it gets a timestamp TS(T2) = 009
(assumption).
This means T1 is "older" than T2 and T1 should execute before T2 to
maintain consistency.
Key Features of Timestamp Ordering Protocol:
Transaction Priority:
 Older transactions (those with smaller timestamps) are given higher
priority.
 For example, if transaction T1 has a timestamp of 007 times and
transaction T2 has a timestamp of 009 times, T1 will execute first as it
entered the system earlier.
Early Conflict Management:
 Unlike lock-based protocols, which manage conflicts during execution,
timestamp-based protocols start managing conflicts as soon as a
transaction is created.
Ensuring Serializability:
 The protocol ensures that the schedule of transactions is serializable.
This means the transactions can be executed in an order that is
logically equivalent to their timestamp order.
Basic Timestamp Ordering

Precedence Graph for TS ordering


The Basic Timestamp Ordering (TO) Protocol is a method in database
systems that uses timestamps to manage the order of transactions. Each
16

transaction is assigned a unique timestamp when it enters the system


ensuring that all operations follow a specific order making the
schedule conflict-serializable and deadlock-free.
 Suppose, if an old transaction T i has timestamp TS(T i), a new
transaction Tj is assigned timestamp TS(T j) such that TS(Ti) < TS(Tj).
 The protocol manages concurrent execution such that the timestamps
determine the serializability order.
 The timestamp ordering protocol ensures that any conflicting read and
write operations are executed in timestamp order.
 Whenever some Transaction T tries to issue a R_item(X) or a
W_item(X), the Basic TO algorithm compares the timestamp of T with
R_TS(X) & W_TS(X) to ensure that the Timestamp order is not
violated.
This describes the Basic TO protocol in the following two cases:
Whenever a Transaction T issues a W_item(X) operation, check the
following conditions:
 If R_TS(X) > TS(T) and if W_TS(X) > TS(T) , then abort and rollback T and
reject the operation. else,
 Execute W_item(X) operation of T and set W_TS(X) to TS(T) to the
larger of TS(T) and current W_TS(X).
Whenever a Transaction T issues a R_item(X) operation, check the
following conditions:
 If W_TS(X) > TS(T) , then abort and reject T and reject the operation,
else
 If W_TS(X) <= TS(T), then execute the R_item(X) operation of T and
set R_TS(X) to the larger of TS(T) and current R_TS(X).
Whenever the Basic TO algorithm detects two conflicting operations that
occur in an incorrect order, it rejects the latter of the two operations by
aborting the Transaction that issued it.
17

1/5

Advantages of Basic TO Protocol


 Conflict Serializable: Ensures all conflicting operations follow the
timestamp order.
 Deadlock-Free: Transactions do not wait for resources, preventing
deadlocks.
 Strict Ordering: Operations are executed in a predefined, conflict-free
order based on timestamps.
Drawbacks of Basic Timestamp Ordering (TO) Protocol
 Cascading Rollbacks : If a transaction is aborted, all dependent
transactions must also be aborted, leading to inefficiency.
 Starvation of Newer Transactions : Older transactions are prioritized,
which can delay or starve newer transactions.
 High Overhead: Maintaining and updating timestamps for every data
item adds significant system overhead.
 Inefficient for High Concurrency: The strict ordering can reduce
throughput in systems with many concurrent transactions.

Transaction Isolation Levels in DBMS


The levels of transaction isolation in DBMS determine how the
concurrently running transactions behave and, therefore, ensure data
consistency with performance being even. There are four basic levels-
Read Uncommitted, Read Committed, Repeatable Read, and Serializable
that provide different degrees of data protection from providing fast
access with possible incoherence and strict accuracy at the cost of
performance. It depends upon choosing the right one based on whether
the need is speed or data integrity.
What is the Transaction Isolation Level?
In a database management system, transaction isolation levels define the
degree to which the operations in one transaction are isolated from the
operations of other concurrent transactions. In other words, it defines how
and when the changes made by one transaction are visible to others to
assure data consistency and integrity.
As we know, to maintain consistency in a database, it follows ACID
properties. Among these four properties (Atomicity, Consistency, Isolation,
and Durability) Isolation determines how transaction integrity is visible to
other users and systems. It means that a transaction should take place in
a system in such a way that it is the only transaction that is accessing the
resources in a database system.
Isolation levels define the degree to which a transaction must be isolated
from the data modifications made by any other transaction in the database
18

system. A transaction isolation level is defined by the following


phenomena:
 Dirty Read - A Dirty read is a situation when a transaction reads data
that has not yet been committed. For example, Let's say transaction 1
updates a row and leaves it uncommitted, meanwhile, Transaction 2
reads the updated row. If transaction 1 rolls back the change,
transaction 2 will have read data that is considered never to have
existed.
 Non Repeatable read - Non-repeatable read occurs when a
transaction reads the same row twice and gets a different value each
time. For example, suppose transaction T1 reads data. Due
to concurrency, another transaction T2 updates the same data and
commit, Now if transaction T1 rereads the same data, it will retrieve a
different value.
 Phantom Read - Phantom Read occurs when two same queries are
executed, but the rows retrieved by the two, are different. For
example, suppose transaction T1 retrieves a set of rows that satisfy
some search criteria. Now, Transaction T2 generates some new rows
that match the search criteria for Transaction T1. If transaction T1 re-
executes the statement that reads the rows, it gets a different set of
rows this time.
Based on these phenomena, The SQL standard defines four isolation
levels:
1. Read Uncommitted - Read Uncommitted is the lowest isolation level.
In this level, one transaction may read not yet committed changes
made by other transactions, thereby allowing dirty reads. At this level,
transactions are not isolated from each other.
2. Read Committed - This isolation level guarantees that any data read
is committed at the moment it is read. Thus it does not allow dirty read.
The transaction holds a read or write lock on the current row, and thus
prevents other transactions from reading, updating, or deleting it.
3. Repeatable Read - This is the most restrictive isolation level. The
transaction holds read locks on all rows it references and writes locks
on referenced rows for update and delete actions. Since other
transactions cannot read, update or delete these rows, consequently it
avoids non-repeatable read.
4. Serializable - This is the highest isolation level.
A serializable execution is guaranteed to be serializable. Serializable
execution is defined to be an execution of operations in which
concurrently executing transactions appears to be serially executing.
The Table given below clearly depicts the relationship between
isolation levels, read phenomena, and locks:
19

Anomaly Serializable is not the same as Serializable. That is, it is


necessary, but not sufficient that a Serializable schedule should be free of
all three phenomena types. Transaction isolation levels are used in
database management systems (DBMS) to control the level of interaction
between concurrent transactions.
The four standard isolation levels are:
1. Read Uncommitted: This is the lowest level of isolation where a
transaction can see uncommitted changes made by other transactions.
This can result in dirty reads, non-repeatable reads, and phantom
reads.
2. Read Committed: In this isolation level, a transaction can only see
changes made by other committed transactions. This eliminates dirty
reads but can still result in non-repeatable reads and phantom reads.
3. Repeatable Read: This isolation level guarantees that a transaction
will see the same data throughout its duration, even if other
transactions commit changes to the data. However, phantom reads are
still possible.
4. Serializable: This is the highest isolation level where a transaction is
executed as if it were the only transaction in the system. All
transactions must be executed sequentially, which ensures that there
are no dirty reads, non-repeatable reads, or phantom reads.
The choice of isolation level depends on the specific requirements of the
application. Higher isolation levels offer stronger data consistency but can
also result in longer lock times and increased contention, leading to
decreased concurrency and performance. Lower isolation levels provide
more concurrency but can result in data inconsistencies.
In addition to the standard isolation levels, some DBMS may also support
additional custom isolation levels or features such as snapshot isolation
and multi-version concurrency control (MVCC) that provide alternative
solutions to the problems addressed by the standard isolation levels.
Advantages of Transaction Isolation Levels
 Improved concurrency: Transaction isolation levels can improve
concurrency by allowing multiple transactions to run concurrently
without interfering with each other.
 Control over data consistency: Isolation levels provide control over
the level of data consistency required by a particular application.
20

 Reduced data anomalies: The use of isolation levels can reduce data
anomalies such as dirty reads, non-repeatable reads, and phantom
reads.
 Flexibility: The use of different isolation levels provides flexibility in
designing applications that require different levels of data consistency.
Disadvantages of Transaction Isolation Levels
 Increased overhead: The use of isolation levels can increase
overhead because the database management system must perform
additional checks and acquire more locks.
 Decreased concurrency: Some isolation levels, such as Serializable,
can decrease concurrency by requiring transactions to acquire more
locks, which can lead to blocking.
 Limited support: Not all database management systems support all
isolation levels, which can limit the portability of applications across
different systems.
 Complexity: The use of different isolation levels can add complexity to
the design of database applications, making them more difficult to
implement and maintain.
OLAP Full Form
OLAP stands for Online Analytical Processing (OLAP) could be a
innovation that's utilized to organize expansive business databases and
back business intelligence. OLAP databases are separated into one or
more cubes, and each cube is organized and designed by a cube
administrator to fit the way simply recover and analyze data so that it is
less demanding to form and utilize the PivotTable reports and PivotChart
reports that you just require.

Characteristics of OLAP
 Fast It characterizes which the framework focused on to provide the
foremost input to the client inside almost five seconds, with the basic
investigation taking no more than one moment and exceptionally few
taking more than 15 seconds.
 Analysis It characterizes which the strategy can adapt with any trade
rationale and measurable examination that's significant for the work
and the client, keep it simple sufficient for the target client. In spite of
the fact that a few preprogramming may be needed, we don't think it
21

acceptable in the event that all application definitions need to be permit


the client to characterize modern Adhoc calculations as portion of the
examination and to record on the information in any wanted strategy,
without having to program so we avoids items (like Oracle Discoverer)
 Share It characterizes which the framework devices all the security
prerequisites for understanding and, in case numerous type in
association is required, concurrent overhaul area at an appropriated
level, not all capacities require client to compose information back, but
for the expanding number which does, the framework ought to be able
to oversee numerous upgrades in a convenient, secure way.
 Multidimensional OLAP framework must give a multidimensional
conceptual see of the information, counting full bolster for chains of
command, as usually certainly the foremost consistent strategy to
analyze commerce and organizations.
 Information The framework ought to be able to hold all the information
required by the applications. Information sparsity ought to be taken
care of in an proficient way.
Advantages of OLAP
 Quick inquiry execution due to optimized capacity, multidimensional
ordering and caching.
 Smaller on-disk measure of information compared to information put
away in social database due to compression techniques.
 Automated computation of higher level totals of the data.
 It is exceptionally compact for most measurement information sets.
Array models give common indexing.
 Effective information extraction accomplished through the pre-
structuring of amassed information.
Disadvantages of OLAP
 Inside a few OLAP Arrangements the preparing step (information
stack) can be very long, particularly on expansive information volumes.
This is often ordinarily helped by doing as it were incremental handling,
i.e., preparing as it were the data which have changed (usually modern
information) rather than reprocessing the whole information set.
 Some OLAP techniques present data redundancy.

Database performance Tuning and Query optimization: Query Tree, Cost


of Query

Query Tree

 A Query Tree is a tree-based data structure that represents a database query,


especially in relational algebra form.
 Each node in the tree represents a relational operation:
o Leaf nodes represent base relations (tables).
22

o Internal nodes represent operations like selection (σ), projection (π), join (⨝),
union (∪), etc.

Purpose:

 Helps visualize and understand the logical flow of operations in a query.


 Used by query optimizers to systematically explore different ways to execute a query.

Example:

For a SQL query like:

sql
CopyEdit
SELECT name FROM Employees WHERE department = 'Sales';

The query tree might look like:

bash
CopyEdit
π_name
|
σ_department='Sales'
|
Employees

Where π is projection, σ is selection.

Benefits:

 Enables reordering of operations for optimization.


 Facilitates cost estimation by breaking down query into steps.

2. Cost of Query

 Query cost is an estimate of the resources needed to execute a query.


 Usually measured in:
o Disk I/O operations (most expensive)
o CPU cycles
o Memory usage

Why Cost Matters:

 The query optimizer generates multiple query plans (different query trees).
 It estimates the cost of each plan.
 The plan with the lowest cost is chosen for execution.

Factors Affecting Cost:


23

 Size of tables involved.


 Availability and type of indexes.
 Join method chosen.
 Selectivity of predicates (how many tuples qualify).

Cost Estimation Example:

 A selection operation cost depends on whether an index exists.


o With an index, cost is low (logarithmic scan).
o Without an index, cost is high (full table scan).
 A join operation cost varies greatly by join algorithm:
o Nested Loop Join: high cost if no indexes.
o Hash Join: cheaper if hash fits in memory.

Join
A Join operation combines rows from two or more tables based on a related column between
them. It's one of the most expensive and critical operations in query processing, so optimizing
joins is key.

Common Join Algorithms:

1. Nested Loop Join (NLJ)


o For each tuple in the outer relation, scan all tuples in the inner relation to find
matching tuples.
o Simple but inefficient for large datasets.
o Cost: O(|R| × |S|), where R and S are relations.
2. Block Nested Loop Join (BNLJ)
o Improvement over NLJ by processing blocks of tuples instead of one tuple at a
time.
o Reduces I/O by loading blocks in memory.
3. Sort-Merge Join (SMJ)
o Sort both relations on the join attribute.
o Merge the sorted lists to find matches.
o Efficient if inputs are already sorted or can be sorted quickly.
4. Hash Join
o Build a hash table on the smaller relation using the join attribute.
o Probe the hash table with tuples from the other relation.
o Very efficient for equality joins, especially when hash fits in memory.

Join Optimization Tips:

 Choose the smallest relation as the build input in hash join.


 Use indexes if available.
 Reorder joins to join smaller relations first.
 Consider join types: inner join, outer join, semi-join affect optimization.
24

Selection
Selection operation filters rows from a relation that satisfy a condition (predicate).

Implementation Algorithms for Selection:

1. Linear Search (Full Table Scan)


o Scan every tuple in the relation and apply the predicate.
o Costly for large tables.
2. Index-based Selection
o Use an index to quickly find qualifying tuples.
o Much faster if predicate is on indexed attribute.
o Types of indexes:
 B-tree index: Efficient for range queries and equality.
 Hash index: Efficient for equality queries.
3. Conjunctive Selection Optimization
o When multiple conditions are combined (AND), apply the most restrictive and
cheapest condition first to reduce intermediate results.
4. Disjunctive Selection (OR conditions)
o Might require scanning multiple indexes or a full scan.

Selection Optimization Tips:

 Use indexes wherever possible.


 Push selection operators as close to the base tables as possible in query trees
(predicate pushdown).
 Combine multiple selection predicates to reduce scans.

Projection Implementation Algorithms in Advanced


Databases and Data Mining
1. Challenges in Advanced Settings

 High-dimensional data: Projection often involves selecting subsets of many


attributes in large datasets.
 Massive data volumes: Data too large to fit in memory, requiring external memory
algorithms.
 Complex data types: Including unstructured, semi-structured, and complex nested
data.
 Duplicate elimination at scale: Often costly in big data scenarios.
 Integration with mining algorithms: Projection may be part of preprocessing or
feature selection.

2. Common Projection Techniques in Advanced Databases

a) Bitmap-based Projection
25

 Used in columnar and bitmap-indexed databases.


 Each attribute is stored separately as a bitmap or column.
 Projection corresponds to retrieving specific columns or bitmaps and combining
them.
 Efficient for:
o Large read-heavy workloads.
o Analytical queries in data warehouses.
 Advantages:
o Fast column access.
o Low I/O for projecting few attributes.
 Example:
o Apache Parquet, ORC, and other columnar formats use this for fast projection.

b) Compressed Projection

 Projection implemented over compressed data formats.


 Avoids decompressing entire tuples, decompresses only the needed columns.
 Uses techniques like dictionary encoding, run-length encoding.
 Reduces I/O and CPU, especially important for big data.

c) Projection in Distributed Systems

 In systems like Hadoop, Spark, and distributed NoSQL stores:


o Projection is pushed down to data nodes to reduce network traffic.
o Partial projection results are combined at the coordinator node.
o Optimized to minimize data shuffling.
 Techniques:
o Map-side projection: Select columns before shuffle.
o Column pruning: Avoid reading irrelevant columns from storage.

d) Approximate Projection

 In data mining or approximate query processing:


o Projection may involve sampling or sketching.
o Allows fast approximate results when exact duplicate elimination is too
expensive.
 Used in:
o OLAP cubes,
o Large-scale feature selection.

3. Projection Algorithms in Data Mining Context


26

 Often part of feature selection or dimensionality reduction.


 Projection algorithms include:
o Principal Component Analysis (PCA): Projects data onto principal
components, reducing dimensionality.
o Random Projection: Uses random matrices to reduce dimensions with
provable distance preservation.
o Feature Subset Selection: Selecting subsets of attributes based on statistical
or mining criteria.
 These are more advanced “projection” operations, beyond simple attribute
selection, and used to optimize mining algorithms.

4. Duplicate Elimination in Advanced Contexts

 Duplicate elimination can be:


o Performed with distributed hash or sort algorithms,
o Or approximated with Bloom filters or count-min sketches for scalability.

Optimization of Database Security in Advanced


Databases and Data Mining

Introduction
In the era of Big Data and advanced analytics, databases and data mining systems handle
enormous volumes of heterogeneous and sensitive data distributed across cloud and hybrid
environments. Ensuring robust security in such advanced database systems while maintaining
performance and usability is a critical challenge.

Optimization of database security refers to designing and implementing security


mechanisms that provide strong protection with minimal impact on system efficiency,
scalability, and usability. This is particularly important in advanced databases and data
mining due to the complexity and scale of operations.

1. Security Challenges in Advanced Databases and Data


Mining
1.1. Scalability and Performance Constraints

 Large-scale data repositories containing billions of tuples with many attributes.


 Complex and computationally intensive queries and mining algorithms.
 Need for low-latency response in real-time or streaming analytics.

1.2. Data Heterogeneity and Distribution


27

 Data stored in diverse formats (relational, NoSQL, graph, spatial).


 Distributed storage across multiple physical locations or clouds.
 Varying security policies and compliance requirements across systems.

1.3. Sensitive Information Exposure

 Data mining operations may inadvertently reveal sensitive or personal data.


 Risk of privacy breaches due to aggregation or inference attacks.
 Potential insider threats due to complex data access needs.

1.4. Dynamic and Fine-Grained Access Needs

 Users with varied roles require differentiated access rights.


 Access control must adapt dynamically to mining tasks and context.
 Support for multi-tenant environments where different users share infrastructure.

2. Optimization Goals for Database Security


 Efficiency: Security mechanisms must minimize overhead in CPU, memory, and I/O.
 Scalability: Solutions must work effectively at petabyte scale and beyond.
 Strong Security Guarantees: Confidentiality, integrity, availability, and privacy.
 Minimal Usability Impact: Security should not hinder legitimate mining or
querying.
 Adaptability: Support dynamic security policies in evolving data mining
environments.

3. Key Areas and Optimization Techniques


3.1. Access Control Optimization

Role-Based Access Control (RBAC) and Attribute-Based Access Control (ABAC):

 Policy Indexing and Caching:


To speed up authorization, frequently used policy decisions are cached, avoiding
costly re-evaluations. Index structures over policies allow rapid lookup of permissions
relevant to a user and data object.
 Predicate Pushdown:
Access control predicates are embedded directly into query execution plans. This
enables filtering unauthorized data early, reducing data movement and processing
overhead.
 Dynamic and Context-Aware Access:
Security decisions are dynamically adapted based on user context (e.g., location, time)
or mining task requirements. Such context-aware controls reduce unnecessary denials
and improve performance.
28

 Distributed Enforcement:
In distributed databases, access checks are performed at the data nodes to minimize
communication overhead and ensure localized enforcement.

3.2. Encryption Optimization

Selective and Columnar Encryption:

 Encrypt only sensitive attributes rather than entire tables or databases, reducing
encryption and decryption costs.
 In columnar stores (common in analytics), encrypt individual columns so queries on
unencrypted columns remain efficient.

Hardware-Accelerated Cryptography:

 Utilize CPU extensions like AES-NI for fast symmetric encryption.


 Offload cryptographic computations to dedicated hardware accelerators when
available.

Encrypted Query Processing:

 Use searchable encryption or order-preserving encryption schemes optimized for


typical query patterns, enabling queries without full decryption.
 Design specialized encrypted indexes for range, equality, or similarity searches that
minimize performance penalties.

3.3. Privacy-Preserving Data Mining (PPDM)

Differential Privacy:

 Add mathematically calibrated noise to query results or mining outputs to protect


individual data contributions.
 Optimize privacy budget allocation across multiple queries to maximize data utility.

Scalable Anonymization:

 Use distributed algorithms for k-anonymity and l-diversity that scale horizontally.
 Minimize data distortion by optimizing grouping and generalization techniques.

Secure Multi-Party Computation (SMPC):

 Protocol optimizations reduce communication rounds and computation.


 Parallelism and specialized protocols for common mining tasks (e.g., secure
clustering, classification).
29

3.4. Intrusion Detection and Anomaly Detection

 Apply machine learning models trained on normal query and access patterns to detect
anomalies indicative of attacks or misuse.
 Hierarchical and focused monitoring reduces overhead by concentrating on high-risk
areas.
 Integrate detection with automated response to quarantine suspicious activities.

3.5. Mitigation of SQL Injection and Query-based Attacks

 Enforce parameterized queries and prepared statements throughout the system.


 Use static and dynamic code analysis tools to detect vulnerabilities early in the
development cycle.
 Optimize input validation routines to balance security and performance.

3.6. Security-Aware Query Optimization

 Extend query optimizers to include security costs in cost models.


 Generate query execution plans that minimize exposure of sensitive data and avoid
expensive security checks when possible.
 Rewriting queries to incorporate access predicates and privacy-preserving
transformations.

4. System-Level Security Optimization


Trusted Execution Environments (TEE)

 Use hardware enclaves (e.g., Intel SGX) to execute sensitive computations securely
even on untrusted cloud servers.
 TEEs reduce the need for heavy cryptographic overhead by providing hardware-
isolated execution.

Secure Audit and Logging

 Employ intelligent logging with anomaly detection to reduce storage and analysis
overhead.
 Use blockchain or distributed ledgers for tamper-evident audit trails.

5. Research Trends and Future Directions


30

 Adaptive Security Models: Using machine learning to dynamically adjust access


policies and detect threats.
 Privacy-Utility Trade-off Frameworks: Formal methods and algorithms to balance
mining accuracy and privacy guarantees.
 Blockchain for Decentralized Security: Use of smart contracts and decentralized
trust for multi-tenant and federated systems.
 Quantum-Resistant Cryptography: Preparing for future threats with quantum-
secure algorithms.

Access Control in Advanced Databases and Data Mining

1. Introduction
Advanced databases and data mining systems manage large-scale, often heterogeneous, and
sensitive datasets. The complexity and volume of data, combined with the dynamic nature of
mining operations, necessitate sophisticated access control mechanisms that balance
security, privacy, and performance.

Access control in this context is not only about restricting unauthorized access but also about
enabling controlled, privacy-preserving data mining and analytic operations.

2. Challenges in Access Control for Advanced Databases


and Data Mining
 Heterogeneity: Data may be stored in relational, NoSQL, graph, or distributed
formats.
 Large Scale: Massive datasets require scalable access control enforcement.
 Fine-Grained Control: Mining algorithms may need access to subsets (rows,
columns) of data.
 Dynamic and Contextual Access: Access rights may depend on mining tasks, user
roles, or data sensitivity.
 Privacy Preservation: Access control must integrate with privacy policies to prevent
leakage through mining results.
 Distributed Environment: Data and mining operations span multiple sites or cloud
environments with varying policies.

3. Access Control Models in Advanced Databases and


Mining
3.1 Enhanced Role-Based Access Control (RBAC)
31

 Dynamic Role Assignment: Roles assigned or revoked based on context (time,


location, current mining task).
 Task-Based Access Control (TBAC): Access tied to specific mining tasks or
workflows.
 Hierarchical RBAC: Roles inherit permissions, supporting complex organizational
structures in mining projects.

3.2 Attribute-Based Access Control (ABAC)

 Enables fine-grained and context-aware controls.


 Attributes can include user identity, role, data classification, request time, device type,
and mining task parameters.
 Ideal for enforcing privacy constraints dynamically during mining.

3.3 Usage Control (UCON) Models

 Extend traditional access control by adding continuity, mutability, and obligation


constraints.
 Useful in mining where access may need to be revoked or adjusted during data
analysis based on detected privacy risks or policy changes.

4. Access Control Mechanisms Specific to Data Mining


4.1 Row and Column Level Access Control

 Mining algorithms often require subsets of data.


 Enforce row-level filtering to exclude unauthorized records.
 Column-level control restricts access to sensitive attributes (e.g., personally
identifiable information).

4.2 Query Rewriting and Policy Embedding

 Access control policies are embedded into mining queries.


 Queries are rewritten to include predicates that enforce access rights, preventing
unauthorized data exposure during mining.

4.3 Secure Multi-Party Computation (SMPC)

 Allows multiple parties to collaboratively mine data without revealing raw data to
each other.
 Enforces access restrictions cryptographically rather than by system policy alone.

4.4 Privacy-Preserving Mining Access Control

 Integrates with privacy models like differential privacy.


 Controls what mining results a user can see, preventing inference of sensitive data.
32

 Enforces result-level access control, filtering or perturbing outputs based on user


privileges.

5. Architecture Considerations
Distributed Enforcement

 Access control decisions are pushed close to data sources to minimize data leakage.
 Nodes enforce local policies consistent with global access control.

Policy Management and Scalability

 Centralized policy management with distributed enforcement mechanisms.


 Use of policy indexing, caching, and policy conflict resolution for efficiency.

Audit and Monitoring

 Continuous monitoring of mining access patterns to detect policy violations or


anomalies.
 Integration with intrusion detection systems specific to mining environments.

6. Optimization Techniques
 Policy Caching: Reduce overhead of repetitive access checks.
 Predicate Pushdown: Filter unauthorized data early in query plans.
 Role and Attribute Hierarchies: Simplify policy management.
 Adaptive Access Control: Dynamically adjust permissions based on mining phase or
detected risk.
 Hardware-Assisted Enforcement: Use of Trusted Execution Environments (TEEs)
to secure sensitive operations.

7. Case Study: Access Control in Privacy-Preserving Data


Mining (PPDM)
 Users submit mining requests with specific privileges.
 System enforces access and privacy policies on raw data and mining results.
 Results are perturbed or masked if user lacks required permissions.
 Access logs and audits ensure accountability.
33

MAC, RBAC, Authorization, and SQL Injection Attacks in


Advanced Databases and Data Mining

1. Mandatory Access Control (MAC) in Advanced


Databases and Data Mining
Overview:

 MAC enforces system-defined security policies based on data classification and user
clearance levels.
 Every data object (e.g., database record or mining dataset) and subject (user or mining
process) is assigned a security label (e.g., Confidential, Secret, Top Secret).
 Access decisions are based on these labels and enforced without user discretion.

Role in Advanced Databases and Mining:

 Advanced databases often contain sensitive or classified data requiring strict control.
 In distributed or cloud mining environments, MAC ensures that mining algorithms or
users only access data consistent with their clearance.
 MAC enforces non-bypassable restrictions, preventing data leaks even if a mining
algorithm tries to access unauthorized data.
 Example: In healthcare data mining, MAC ensures that only users with appropriate
clearance can analyze patient records at a certain sensitivity level.

Challenges:

 Rigidity can limit flexibility for complex mining workflows.


 Integration with dynamic mining tasks and fine-grained data subsets can be complex.

2. Role-Based Access Control (RBAC) in Advanced


Databases and Mining
Overview:

 RBAC simplifies permission management by assigning users to roles aligned with


organizational functions.
 Permissions to perform operations on data are assigned to roles, not users directly.
 Supports hierarchical roles and separation of duties.

Importance in Advanced Databases and Mining:

 Mining teams often have members with distinct roles (data scientist, analyst, auditor).
 RBAC enables controlled access to mining data and tools based on these roles.
34

 Dynamic mining projects benefit from flexible role assignments, where roles can be
activated/deactivated based on project phase or user context.
 RBAC supports least privilege by ensuring users have only necessary permissions.

Enhancements:

 Integration with workflow and task-based models to align mining tasks with access
control.
 Incorporation of attribute-based conditions for fine-grained control alongside roles.

3. Authorization in Advanced Databases and Mining


Overview:

 Authorization is the process of granting or denying permissions to authenticated users


or processes.
 Determines who can perform what actions on which data in a database or mining
system.

Advanced Database and Mining Context:

 Authorization must consider fine-grained controls—down to rows or columns in


large datasets.
 Mining queries often involve complex operations; authorization frameworks must
evaluate both data access and operation legitimacy.
 In multi-tenant cloud mining environments, authorization ensures data isolation and
privacy between tenants.
 Often combined with privacy-preserving mechanisms to authorize access without
exposing raw sensitive data.

Techniques:

 Embedding authorization rules into mining query processors.


 Policy-driven authorization engines that evaluate dynamic contexts.
 Integration with encryption and secure computation protocols to enforce access
without exposing raw data.

4. SQL Injection Attacks in Advanced Databases and


Mining
Overview:

 SQL Injection (SQLi) is a critical security threat where attackers insert malicious SQL
code into input fields.
35

 Allows unauthorized data access, data corruption, or complete compromise of the


database.

Relevance to Advanced Databases and Mining:

 Mining systems often expose query interfaces for data exploration or model training.
 If these interfaces accept unsanitized input, attackers can execute SQLi to access
sensitive mining datasets.
 Successful SQLi can compromise both data confidentiality and integrity, impacting
mining outcomes.

Prevention Strategies:

 Parameterized queries and prepared statements: Separate code from data,


preventing injection.
 Input validation and sanitization: Reject or escape suspicious characters or patterns.
 Use of stored procedures and ORM frameworks that inherently avoid direct query
string manipulation.
 Monitoring and anomaly detection: Identify unusual query patterns indicating
possible SQLi.
 Principle of least privilege: Limit database account permissions used by mining
applications to minimize damage.

Advanced Considerations:

 SQLi detection in complex mining query interfaces that support dynamic, ad-hoc
queries.
 Protection in distributed or federated mining systems where input passes through
multiple layers.

You might also like