DBMS Notes Unit 5
DBMS Notes Unit 5
Distributed Databases: Architecture, Data Storage, Transaction Processing, Query processing and
optimization – NOSQL Databases: Introduction – CAP Theorem – Document Based systems – Key
value Stores – Column Based Systems – Graph Databases. Database Security: Security issues – Access
control based on privileges – Role Based access control – SQL Injection – Statistical Database security
– Flow control – Encryption and Public Key infrastructures – Challenges
5.1 Distributed Databases: Architecture, Data Storage, Transaction Processing, Query processing
and optimization
Distributed database system
A distributed database is basically a database that is not limited to one system, it is spread over
different sites, i.e, on multiple computers or over a network of computers. A distributed database system
is located on various sites that don’t share physical components. This may be required when a particular
database needs to be accessed by various users globally. It needs to be managed such that for the users it
looks like one single database.
(iii) Federated architecture: In this architecture, each site in the distributed database system
maintains its own independent database, but the databases are integrated through a middleware layer
that provides a common interface for accessing and querying the data.
(iv) Shared−Nothing Architecture: Data is divided up and spread among several nodes in a
shared−nothing architecture, with each node in charge of a particular portion of the data. Resources are
not shared across nodes, and each node runs independently. Due to the system's capacity to add
additional nodes as needed without affecting the current nodes, this design offers great scalability and
fault tolerance.
Query Processing:
Query processing refers to activities including translation of high level languages (HLL) queries
into operations at physical file level, query optimization transformations, and actual evaluation of
queries.
The translation process in query processing is similar to the parser of a query. When a user
executes any query, for generating the internal form of the query, the parser in the system checks the
synatax of the query, verifies the name of the relation in the database, the tuple, and finally the rquired
attribute value. The parser creates a tree of the query, known as ‘parse-tree’. Further, translate it into the
form of relational algebra. With this, it evently replaces all the use of the views when used in the query.
Suppose a user executes a query. There are various methods of extracting the data from the databases.
In SQL, a user wants to fetch the records of the employees whose salary is greater than or equal to
10000. For doing this, the query is
Select emp_name from Employee where salary > 10000;
It is translated to the relational algebra as:
salary>10000 (πemp_name (Employee))
πemp_name (salary>10000 (Employee))
Evaluation
For this, with addition to the relational algebra translation, it is required to annotate the translated
relational algebra expression with the instructions used for specifying and evalauting each operation.
Thus, after translating the user query, the system executes a query evaluation plan. A query execution
engine is responsible for generating the output of the given query. It takes the query execution
plan(query evaluation plan), executes it, and finally makes the output for the user query.
Optimization
Query optimization is the process of selecting an efficient execution plan for evaluating the
query. After parsing of the query, parsed query is passed to query optimizer, which generates different
execution plans to evaluate parsed query and select the plan with least estimated cost. Catalog manager
helps optimizer to choose best plan to execute query generating cost of each plan.
Query optimization is a process of transforming a query into an equivalent form which can be
evaluated more efficiently. The essence of query optimization is to find an execution plan that
minimizes time needed to evaluate a query. To achieve this optimization goal, we need to accomplish
two main tasks. First one is to find out the best plan and the second one is to reduce the time involved in
executing the query plan. The next step is an optimization step that transforms the initial algebraic query
using relational algebra transformation into other algebraic queries until the best one is found.
Types of NoSQL databases and the name of the database system that falls in that category are:
NoSQL databases also use sharing to scale horizontally. Data is partitioned and placed on
numerous machines in this process to preserve its ordering. The high flexibility of NoSQL
allows it to handle massive data volumes efficiently.
Finally, the auto replication capability of NoSQL also helps ensure high availability, as the
solution responds with replication to the last-known consistent state in case of any failure.
d) Minimal downtime
NoSQL databases are robust and feature minimal downtime. Business continuity is bolstered
through serverless architecture and the creation of multiple copies of data to be retained across
nodes. In case of a node malfunction, a different node will grant access to its copy of the data.
1. Consistency
Consistency means that all the nodes (databases) inside a network will have the same copies of a
replicated data item visible for various transactions. It guarantees that every node in a distributed cluster
returns the same, most recent, and successful write. It refers to every client having the same view of the
data. There are various types of consistency models. Consistency in CAP refers to sequential
consistency, a very strong form of consistency.
For example, a user checks his account balance and knows that he has 500 rupees. He spends
200 rupees on some products. Hence the amount of 200 must be deducted changing his account balance
to 300 rupees. This change must be committed and communicated with all other databases that hold this
user’s details. Otherwise, there will be inconsistency, and the other database might show his account
balance as 500 rupees which is not true.
For example, user A is a content creator having 1000 other users subscribed to his channel.
Another user B who is far away from user A tries to subscribe to user A’s channel. Since the distance
between both users are huge, they are connected to different database node of the social media network.
If the distributed system follows the principle of availability, user B must be able to subscribe to user
A’s channel.
For example, take the example of the same social media network where two users are trying to
find the subscriber count of a particular channel. Due to some technical fault, there occurs a network
outage, the second database connected by user B losses its connection with first database. Hence the
subscriber count is shown to the user B with the help of replica of data which was previously stored in
database 1 backed up prior to network outage. Hence the distributed system is partition tolerant.
The CAP theorem implies that a distributed system can only provide two out of three properties:
a) CA (Consistency and Availability) - These types of system always accept the request to view
or modify the data sent by the user and they are always responded with data which is
consistent among all the database nodes of a big, distributed network.
b) AP (Availability and Partition Tolerance) - These types of system are distributed in nature,
ensuring that the request sent by the user to view or modify the data present in the database
nodes are not dropped and are processed in presence of a network partition.
c) CP (Consistency and Partition Tolerance) - These types of system are distributed in nature,
ensuring that the request sent by the user to view or modify the data present in the database
nodes are dropped instead of responding with inconsistent data in presence of a network
partition.
So in the document data model, each document has a key-value pair below is an example for the
same.
{
"Name" : "Yashodhra",
"Address" : "Near Patel Nagar",
"Email" : "yahoo123@yahoo.com",
"Contact" : "12345"
}
Features:
Document Type Model: As we all know data is stored in documents rather than tables or
graphs, so it becomes easy to map things in many programming languages.
Flexible Schema: Overall schema is very much flexible to support this statement one must
know that not all documents in a collection need to have the same fields.
Distributed and Resilient: Document data models are very much dispersed which is the reason
behind horizontal scaling and distribution of data.
Manageable Query Language: These data models are the ones in which query language allows
the developers to perform CRUD (Create Read Update Destroy) operations on the data model.
Examples of Document Data Models:
- Amazon DocumentDB
- MongoDB
- Cosmos DB
- ArangoDB
- Couchbase Server
- CouchDB
Advantages:
Schema-less: These are very good in retaining existing data at massive volumes because there
are absolutely no restrictions in the format and the structure of data storage.
Faster creation of document and maintenance: It is very simple to create a document and apart
from this maintenance requires is almost nothing.
Open formats: It has a very simple build process that uses XML, JSON, and its other forms.
Built-in versioning: It has built-in versioning which means as the documents grow in size there
might be a chance they can grow in complexity. Versioning decreases conflicts.
Disadvantages:
Weak Atomicity: It lacks in supporting multi-document ACID transactions. A change in the
document data model involving two collections will require us to run two separate queries
i.e. one for each collection. This is where it breaks atomicity requirements.
Consistency Check Limitations: One can search the collections and documents that are not
connected to an author collection but doing this might create a problem in the performance of
database performance.
Security: Nowadays many web applications lack security which in turn results in the leakage
of sensitive data. So it becomes a point of concern, one must pay attention to web app
vulnerabilities.
When a query is executed in a column-oriented data store, it retrieves only the specified columns
of data, which can result in faster query performance. Additionally, column-oriented data stores can use
compression techniques to reduce storage space and improve performance.
One potential downside of column-oriented data stores is that they may require more complex
queries to retrieve entire rows of data, as the data is spread across multiple columns. However, this can
often be mitigated by using specialized query languages and optimization techniques that are designed
for columnar data stores.
Representation:
The graph database is based on graph theory. The data is stored in the nodes of the graph and
the relationship between the data are represented by the edges between the nodes.
Example
We have a social network in which five friends are all connected. These friends are Anay,
Bhagya, Chaitanya, Dilip, and Erica. A graph database that will store their personal information may
look something like this:
Advantages: Frequent schema changes, managing volume of data, real-time query response time, and
more intelligent data activation requirements are done by graph model.
Disadvantages: Note that graph databases aren’t always the best solution for an application. We will
need to assess the needs of application before deciding the architecture.
effective integrity semantics verification tools for assessment of data quality, based on many techniques
such as record linkage.
3. Database survivability
Database systems need to operate and continued their functions even with the reduced
capabilities, despite disruptive events such as information warfare attacks A DBMS in addition to
making every effort to prevent an attack and detecting one in the event of the occurrence should be able
to do the following:
Confident: We should take immediate action to eliminate the attacker’s access to the system
and to isolate or contain the problem to prevent further spread.
Damage assessment: Determine the extent of the problem, including failed function and
corrupted data.
Recover: Recover corrupted or lost data and repair or reinstall failed function to reestablish a
normal level of operation.
Reconfiguration: Reconfigure to allow the operation to continue in a degraded mode while
recovery proceeds.
Fault treatment: To the extent possible, identify the weakness exploited in the attack and takes
steps to prevent a recurrence.
Encryption: Data encryption is an effective way to protect sensitive data in transit and at rest.
However, it can also be a challenge to implement and manage encryption keys and ensure that
encrypted data is not compromised.
Access Control: Access control involves regulating the access to data within the database. It
can be challenging to implement access control mechanisms that allow authorized users to
access the data they need while preventing unauthorized users from accessing it.
Auditing and Logging: DBMS must maintain an audit trail of all activities in the database.
This includes monitoring who accesses the database, what data is accessed, and when it is
accessed. This can be a challenge to implement and manage, especially in large databases.
Database Design: The design of the database can also impact security. A poorly designed
database can lead to security vulnerabilities, such as SQL injection attacks, which can
compromise the confidentiality, integrity, and availability of data.
Malicious attacks: Cyberattacks such as hacking, malware, and phishing pose a significant
threat to the security of databases. DBMS must have robust security measures in place to
prevent and detect such attacks.
Physical Security: Physical security of the database is also important, as unauthorized physical
access to the server can lead to data breaches.
System privileges — A system privilege is the right to perform an activity on a specific type of
object. For example, the privilege to delete rows of any table in a database is system privilege.
There are a total of 60 different system privileges. System privileges allow users to CREATE,
ALTER, or DROP the database objects.
For example, a security analyst can configure a firewall but can’t view customer data, while a
sales rep can see customer accounts but can’t touch firewall settings.
Benefits of RBAC
Create systematic, repeatable assignment of permissions
Easily audit user privileges and correct identified issues
Quickly add and change roles, as well as implement them across APIs
More effectively comply with regulatory and statutory requirements for confidentiality and
privacy
(ii) Role authorization: The user must be authorized to take on the role or roles they have
been assigned.
(iii) Permission authorization: Permissions or privileges are granted only to users who have
been authorized through their role assignments.
Models of RBAC
There are four separate models for implementing RBAC, but each model begins with the same
core structure. Each successive model builds new functionality and features upon the previous model.
1) Core RBAC - Sometimes called Flat RBAC, this model is the required foundation for any
RBAC system. It follows the three basic rules of RBAC.
2) Hierarchical RBAC - This model adds role hierarchies that replicate the reporting structure of
an organization. In a role hierarchy, each role inherits the permissions of the role beneath it
and gains new permissions.
3) Constrained RBAC - This model adds capabilities for enforcing separation of duties (SOD).
Separation of duties helps prevent conflicts of interest by requiring two people to complete
certain tasks.
4) Symmetric RBAC - This model is the most advanced, flexible and comprehensive version of
RBAC. In addition to the capabilities of the previous models, it adds deeper visibility into
permissions across an enterprise.
ii. Upgrading consistence:- Each association should conform to nearby, state and government
guidelines. Organizations for the most part really like to carry out RBAC frameworks to meet the
administrative and legal necessities for secrecy and protection since leaders and IT offices can all the
more actually oversee how the information is gotten to and utilized. This is especially significant for
monetary establishments and medical organizations that oversee delicate information.
iii. Giving chairmen expanded perceivability:- RBAC gives network overseers and
administrators greater perceivability and oversight into the business, while likewise ensuring approved
clients and visitors on the framework are just given admittance to what they need to take care of their
responsibilities.
iv. Diminishing gamble of breaks and information spillage:- Executing RBAC implies
confining admittance to delicate data, consequently lessening the potential for information breaks or
information spillage.
ii. Complexity: Implementing RBAC can be complex, requiring significant planning and
resources. Setting up roles and permissions for each user can be time-consuming and may require
ongoing maintenance to ensure that access levels remain appropriate.
iii. Lack of Granularity: RBAC can lack granularity in terms of the level of access granted to
users. This can result in situations where users have access to more resources than they need to perform
their job, which can increase the risk of data breaches and other security incidents.
iv. Overreliance on Roles: RBAC can result in overreliance on roles, which can limit the
ability to make granular access decisions based on specific user attributes, such as their level of trust or
the sensitivity of the data they are accessing.
SQL Injection (SQLi) is a security vulnerability that occurs when an attacker is able to
manipulate a web application’s database queries by inserting malicious SQL code into user input fields.
These injected queries can manipulate the underlying database to retrieve, modify, or delete sensitive
data. In some cases, attackers can even escalate privileges, gaining full control over the database or
server.
SQL Injection typically works when a web application improperly validates user input, allowing
an attacker to inject malicious SQL code. For example, if a web application takes user input (e.g., a
username or password) and directly inserts it into an SQL query without proper sanitization, an attacker
can manipulate the query to perform unintended actions.
Example
Suppose we have an application based on student records. Any student can view only his or her
records by entering a unique and private student ID.
SELECT * FROM STUDENT WHERE
STUDENT-ID == 12222345 or 1 = 1
SQL Injection based on 1=1 is always true. As we can see in the above example, 1=1 will return
all records for which this holds true. So basically, all the student data is compromised. Now the
malicious user can also similarly use other SQL queries.
Query 1:
SELECT * FROM USER WHERE
USERNAME = “” AND PASSWORD=””
Now the malicious attacker can use the ‘=’ operator to retrieve private and secure user
information. So following query when executed retrieves protected data, not intended to be shown to
users.
Query 2:
SELECT* FROM User WHERE
(Username = “” OR 1=1) AND
(Password=”” OR 1=1).
Since '1'='1' is always true, the attacker could gain unauthorized access to the application.
to protect the privacy of individual information are beyond the scope of this book. We will illustrate the
problem with a very simple example, which refers to the relation shown in Figure 24.3. This is a
PERSON relation with the attributes Name, Ssn, Income, Address, City, State, Zip, Sex, and
Last_degree.
A population is a set of tuples of a relation (table) that satisfy some selection condition. Hence,
each selection condition on the PERSON relation will specify a particular population of PERSON
tuples. For example, the condition Sex = ‘M’ specifies the male population; the condition ((Sex = ‘F’)
AND (Last_degree = ‘M.S.’ OR Last_degree = ‘Ph.D.’)) specifies the female population that has an
M.S. or Ph.D. degree as their highest degree; and the condition City = ‘Houston’ specifies the
population that lives in Houston.
Statistical queries involve applying statistical functions to a population of tuples. For example,
we may want to retrieve the number of individuals in a population or the average income in the
population. However, statistical users are not allowed to retrieve individual data, such as the income of a
specific person. Statistical database security techniques must prohibit the retrieval of individual data.
This can be achieved by prohibiting queries that retrieve attribute values and by allowing only queries
that involve statistical aggregate functions such as COUNT, SUM, MIN, MAX, AVERAGE, and
STANDARD DEVIATION. Such queries are sometimes called statistical queries.
In some cases it is possible to infer the values of individual tuples from a sequence of statistical
queries. This is particularly true when the conditions result in a population consisting of a small number
of tuples. As an illustration, consider the following statistical queries:
Q1: SELECT COUNT (*) FROM PERSON WHERE <condition>;
Q2: SELECT AVG (Income) FROM PERSON WHERE <condition>;
Now suppose that we are interested in finding the Salary of Jane Smith, and we know that she
has a Ph.D. degree and that she lives in the city of Bellaire, Texas. We issue the statistical query Q1
with the following condition:
(Last_degree=‘Ph.D.’ AND Sex=‘F’ AND City=‘Bellaire’ AND State=‘Texas’)
If we get a result of 1 for this query, we can issue Q2 with the same condition and find the Salary
of Jane Smith. Even if the result of Q1 on the preceding condition is not 1 but is a small number—say 2
or 3—we can issue statistical queries using the functions MAX, MIN, and AVERAGE to identify the
possible range of values for the Salary of Jane Smith.
The possibility of accessing individual information from statistical queries is reduced by using
the following measures –
Partitioning of Database – This means the records of database must be not be stored as bulk in
single record. It must be divided into groups of some minimum size according to
confidentiality of records. The advantage of Partitioning of database is queries can refer to any
complete group or set of groups, but queries cannot access the subsets of records within a
group. So, attacker can access at most one or two groups which are less private.
If no statistical queries are permitted whenever number of tuples in population specified by
selection condition falls below some threshold.
Prohibit sequences of queries that refer repeatedly to same population of tuples.
• The simple flow policy specifies two classes of information - Confidential(C) and non
confidential (N). According to flow policy only the information flow from confidential to non
confidential class is not allowed.
Convert Channel
• A covert channel is a type of attack that creates a capability to transfer information objects
between processes that are not supposed to be allowed to communicate.
• This convert channel violates the security or the policy.
• The convert channel allows information to pass from higher classification level to lower
classification level through improper means.
• The security experts believe that one way to avoid convert channels is for as programmers to
not gain the access to sensitive data.
The process of encoding messages is known as encryption. The sender sends the original text.
The original text called plaintext, The encrypted form of plaintext it is called as ciphertext. This
encrypted text travel through the network. When it reaches at the receiving computer, the recipient
understands the meaning and decodes the message to extract the correct meaning out of it. This process
is called as decryption.
The sender applies the encryption algorithm and recipient applies the decryption algorithm. Both
the sender and the receiver must agree on this algorithm for any meaningful communication. The
algorithm basically takes one text as input and produces another as the output. Therefore, the algorithm
contains the intelligence for transforming message.
For example: If we want to send some message through an e-mail and we wish that nobody
except the friend should be able to understand it. Then the message can be encoded using some
intelligence. For example if the alphabets A to Z are encoded as follows-
That means last three letters are placed in reverse order and then first three letters are in straight
manner. Continuing this logic the A to Z letters are encoded. Now if I write the message
"SEND SOME MONEY"
it will be
QBSA QRTB TRSBN
This coded message is called cipher text.
Types of Cryptography
There are two types encryption schemes based in key used for encryption and decryption.
1. Symmetric key encryption: It is also known as secret key encryption. In this method, only one key is
used. The same key is shared by sender and receiver for encryption and decryption of messages. Hence
both parties must agree upon the key before any transmission begins and nobody else should know
about it. At the sender's end, the key is used to change the original message into an encoded form. At the
receiver's end using the same key the encoded message is decrypted and original message is obtained.
Data Encryption Standard (DES) uses this approach. The problem with this approach is that of key
agreement and distribution.
2. Asymmetric key encryption: It is also known as public key encryption. In this method, different keys
are used. One key is used for encryption and other key must be used for decryption. No other key can
decrypt the message-not even the original key used for encryption. One of the two keys is known as
public key and the other is the private key. Suppose there are two users X and Y.
• X wants to send a message to Y. Then X will convey its public key to Y but the private key of
X will be known to X only.
• Y should know the private key of Y and X should know the Y's public key.
Digital Signature
A digital signature is a mathematical scheme for demonstrating the authenticity of a digital
message or document. If the recipient gets a message with digital signature then he believes that the
message was created by a known sender. Digital signatures are commonly used for software
distribution, financial transactions, and in other cases where it is important to detect forgery or
tampering.
(ii) Then X executes an algorithm on the original plaintext to calculate a Message Digest, also
known as hash. This algorithm takes the original plaintext in the binary format, apply the
hashing algorithm. As an output a small string of binary digits gets created. This hashing
algorithm is public and anyone can use it. The most popular message digest algorithms are
MD5 and SHA-1. X encrypts the message digest. For this, it uses its own private key.
(iii)X now combines the ciphertext and its digital signature (i.e encrypted message digest) and it is
sent over the network to Y.
(iv) Y receives the ciphertext and X's digital signature. Y has to decrypt both of these. Y first
decrypts ciphertext back to plaintext. For this, it uses its own private key. Thus, Y gets the
message itself in a secure manner.
(v) Now to ensure that the message has come from the intended sender Y takes X's digital
signature and decrypts it. This gives Y the message digest as was generated by X. The X had
encrypted the message digest to form a digital signature using its own private key. Therefore,
Y uses X's public key for decrypting the digital signature.
(vi) Hash algorithm to generate the message digest is public. Therefore, Y can also use it.
(vii) Now there are two message digests one created by X and other by Y. The Y now Anon
simply compares the two message digests. If the two match, Y can be sure that the message
came indeed from X and not from someone else.
Thus with digital signature confidentiality, authenticity as well as message integrity is assured.
The other important feature supported by digital signature is non-repudiation. That is, a sender cannot
refuse having sent a message. Since the digital signature requires the private key of the sender, once a
message is digitally signed, it can be legally proven that the sender had indeed sent the message.
5.15 Challenges
Following are the challenges faced by the database security system -
(1) Data Quality
• The database community needs the solution to assess the quality of data. The quality of data
can be assessed by a simple mechanism such as quality stamps that are posted on web sites:
• The database community may need more effective technique of integrity semantic verification
for accessing the quality of data.
• Application level recovery techniques are also used to repair incorrect data.