[go: up one dir, main page]

0% found this document useful (0 votes)
22 views62 pages

DBMS M5 PPT

The document discusses concurrency control in database management systems, highlighting the importance of concurrent execution in multi-user environments and the advantages it offers, such as increased throughput and decreased wait times. It details various concurrency control protocols, including lock-based and timestamp-based methods, and explains the two-phase locking technique, which ensures serializability while managing locks. Additionally, it addresses challenges like deadlock and starvation, along with strategies for prevention and resolution.

Uploaded by

h13584962
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views62 pages

DBMS M5 PPT

The document discusses concurrency control in database management systems, highlighting the importance of concurrent execution in multi-user environments and the advantages it offers, such as increased throughput and decreased wait times. It details various concurrency control protocols, including lock-based and timestamp-based methods, and explains the two-phase locking technique, which ensures serializability while managing locks. Additionally, it addresses challenges like deadlock and starvation, along with strategies for prevention and resolution.

Uploaded by

h13584962
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Database Management System

Sub Code : BCS403


Module -5

Prof.Navyashree K S
Assistant Professor
Dept.of CSE (Data Science)
CONCURRENCY CONTROL
Introduction:
Concurrent execution refers to the simultaneous execution of more than one
transaction. This is a common scenario in multi-user database environments where
many users or applications might be accessing or modifying the database at the same
time.
Advantages of Concurrent Execution
1.Increased System Throughput: Multiple transactions can be in progress at the same
time, but at different stages
2.Maximized Processor Utilization: If one transaction is waiting for I/O operations,
another transaction can utilize the processor.
3.Decreased Wait Time: Transactions no longer have to wait for other long
transactions to complete.
4.Improved Transaction Response Time: Transactions get processed faster because
they can be executed in parallel.
Concurrency Control Protocols :The concurrency control protocols ensure
the atomicity, consistency, isolation, durability and serializability of the
concurrent execution of the database transactions.
Therefore, these protocols are categorized as:
1. Lock Based Concurrency Control Protocol
2. Time Stamp Concurrency Control Protocol
3. Validation Based Concurrency Control Protocol

Purpose of Concurrency control


• To ensure that Isolation property is maintained while allowing transactions to
execute concurrently .
• To preserve database consistency by ensuring that the schedules of executing
transactions are serializable.
• To resolve read-write and write-write conflicts among transactions.
TWO-PHASE LOCKING TECHNIQUES FOR CONCURRENCY CONTROL
Types of Locks and System Lock Tables
• Several types of locks are used in concurrency control. To introduce locking
concepts gradually, first we discuss binary locks, which are simple but are also
too restrictive for database concurrency control purposes and so are not used
much.
• Then we discuss shared/exclusive locks—also known as read/write locks—
which provide more general locking capabilities and are used in database locking
schemes.
Binary Locks. A binary lock can have two states or values: locked and unlocked (or 1 and 0, for
simplicity).
• A distinct lock is associated with each database item X. If the value of the lock on X is 1, item
X cannot be accessed by a database operation that requests the item.
• If the value of the lock on X is 0, the item can be accessed when requested, and the lock value
is changed to 1. We refer to the current value (or state) of the lock associated with item X as
lock(X)
• Two operations, lock_item and unlock_item, are used with binary locking. A transaction requests access to an
item X by first issuing a lock_item(X) operation.
• If LOCK(X) = 1, the transaction is forced to wait. If LOCK(X) = 0, it is set to 1 (the transaction locks the item)
and the transaction is allowed to access item X.
• When the transaction is through using the item, it issues an unlock_item(X) operation, which sets LOCK(X)
back to 0 (unlocks the item) so that X may be accessed by other transactions. Hence, a binary lock
enforces mutual exclusion on the data item. A description of the lock_item(X) and unlock_item(X) operations
is shown below
• Notice that the lock_item and unlock_item operations must be implemented as indivisible units (known as critical
sections in operating systems); that is, no interleaving should be allowed once a lock or unlock operation is started
until the operation terminates or the transaction waits.
• In Figure the wait command within the lock_item(X) operation is usually implemented by putting the transaction in a
waiting queue for item X until X is unlocked and the transaction can be granted access to it. Other transactions that also
want to access X are placed in the same queue. Hence, the wait command is considered to be outside
the lock_item operation.
• It is quite simple to implement a binary lock; all that is needed is a binary-valued variable, LOCK, associated with each
data item X in the database. In its simplest form, each lock can be a record with three fields:
• <Data_item_name, LOCK, Locking_transaction> plus a queue for transactions that are waiting to access the
item.
• The system needs to maintain only these records for the items that are currently locked in a lock table, which could be
organized as a hash file on the item name. Items not in the lock table are considered to be unlocked. The DBMS has
a lock manager sub-system to keep track of and control access to locks.
If the simple binary locking scheme described here is used, every transaction must obey the following rules:
• A transaction T must issue the operation lock_item(X) before any read_item(X) or write_item(X) operations are performed
in T.
• A transaction T must issue the operation unlock_item(X) after all read_item(X) and write_item(X) operations are
completed in T.
• A transaction T will not issue a lock_item(X) operation if it already holds the lock on item X.1
• A transaction T will not issue an unlock_item(X) operation unless it already holds the lock on item X.
Shared/Exclusive (or Read/Write) Locks.
The preceding binary locking scheme is too restrictive for database items because at
most, one transaction can hold a lock on a given item. We should allow several
transactions to access the same item X if they all access X for reading purposes only.
This is because read operations on the same item by different transactions are not
conflicting .However, if a transaction is to write an item X, it must have exclusive
access to X.
• For this purpose, a different type of lock called a multiple-mode lock is used. In
this scheme—called shared/exclusive or read/write locks—there are three
locking operations: read_lock(X), write_lock(X), and unlock(X).
• A lock associated with an item X, LOCK(X), now has three possible states: read-
locked, write-locked, or unlocked.
• A read-locked item is also called share-locked because other transactions are
allowed to read the item, whereas a write-locked item is called exclusive-
locked because a single transaction exclusively holds the lock on the item.
• One method for implementing the preceding operations on a read/write lock is to keep track of the
number of transactions that hold a shared (read) lock on an item in the lock table.
• Each record in the lock table will have four fields:
<Data_item_name, LOCK, No_of_reads, Locking_transaction(s)>
• Again, to save space, the system needs to maintain lock records only for locked items in the lock
table. The value (state) of LOCK is either read-locked or write-locked, suitably coded (if we assume
no records are kept in the lock table for unlocked items).
• If LOCK(X)=write-locked, the value of locking_transaction(s) is a single transaction that holds the
exclusive (write) lock on X.
• If LOCK(X)=read-locked, the value of locking transaction(s) is a list of one or more transactions
that hold the shared (read) lock on X. The three operations read_lock(X), write_lock(X),
and unlock(X).
When we use the shared/exclusive locking scheme, the system must enforce the following rules:
• A transaction T must issue the operation read_lock(X) or write_lock(X) before any read_item(X)
operation is performed in T.
• A transaction T must issue the operation write_lock(X) before any write_item(X) operation is
performed in T.
2. GUARANTEEING SERIALIZABILITY BY TWO-
PHASE LOCKING
• A transaction is said to follow the two-phase locking protocol if all locking operations (read_lock, write_lock)
precede the first unlock operation in the transaction.
• Such a transaction can be divided into two phases: an expanding or growing (first) phase, during which new
locks on items can be acquired but none can be released; and a shrinking (second) phase, during which existing
locks can be released but no new locks can be acquired.
• If lock conversion is allowed, then upgrading of locks (from read-locked to write-locked) must be done during the
expanding phase, and downgrading of locks (from write-locked to read-locked) must be done in the shrinking
phase. Hence, a read_lock(X) operation that downgrades an already held write lock on X can appear only in the
shrinking phase.
• Transactions T1 and T2 in Figure 22.3(a) do not follow the two-phase locking protocol because the write_lock(X) operation follows
the unlock(Y) operation in T1, and similarly the write_lock(Y) operation follows the unlock(X) operation in T2.

• If we enforce two-phase locking, the transactions can be rewritten as T1 and T2 , as shown in Figure 22.4. Now, the schedule shown in
Figure 22.3(c) is not permitted for T1 and T2 (with their modified order of locking and unlocking operations) under the rules of locking
described in Section 22.1.1 because T1 will issue its write_lock(X) before it unlocks item Y; consequently, when T2 issues
its read_lock(X), it is forced to wait until T1 releases the lock by issuing an unlock (X) in the schedule.

• It can be proved that, if every transaction in a schedule follows the two-phase locking protocol, the schedule is guaranteed to be
serializable, obviating the need to test for serializability of schedules. The locking protocol, by enforcing two-phase locking rules, also
enforces serializability.
Two-phase locking may limit the amount of concurrency that can occur in a schedule
because a transaction T may not be able to release an item X after it is through using it
if T must lock an additional item Y later; or conversely, T must lock the additional
item Y before it needs it so that it can release X.
Hence, X must remain locked by T until all items that the transaction needs to read or
write have been locked; only then can X be released by T.
Meanwhile, another transaction seeking to access X may be forced to wait, even
though T is done with X; conversely, if Y is locked earlier than it is needed, another
transaction seeking to access Y is forced to wait even though T is not using Y yet.
This is the price for guaranteeing serializability of all schedules without having to
check the schedules themselves.
Although the two-phase locking protocol guarantees serializability (that is, every
schedule that is permitted is serializable), it does not permit all possible serializable
schedules (that is, some serializable schedules will be prohibited by the protocol).
Basic, Conservative, Strict, and Rigorous Two-Phase Locking.
There are a number of variations of two-phase locking (2PL). The technique just described is known as basic
2PL.
1. A variation known as conservative 2PL (or static 2PL) requires a transaction to lock all the items it accesses before the
transaction begins execution, by predeclaring its read-set and write-set. Recall from Section 21.1.2 that the read-set of a
transaction is the set of all items that the transaction reads, and the write-set is the set of all items that it writes.
• If any of the predeclared items needed cannot be locked, the transaction does not lock any item; instead, it waits until all
the items are available for locking.
• Conservative 2PL is a deadlock-free protocol, we can discuss the deadlock problem. However, it is difficult to use in
practice because of the need to predeclare the read-set and write-set, which is not possible in many situations.
2. In practice, the most popular variation of 2PL is strict 2PL, which guarantees strict .
In this variation, a transaction T does not release any of its exclusive (write) locks until after it commits or aborts. Hence, no
other transaction can read or write an item that is written by T unless T has committed, leading to a strict schedule for
recoverability. Strict 2PL is not deadlock-free.
3.A more restrictive variation of strict 2PL is rigorous 2PL, which also guarantees strict schedules. In this variation, a
transaction T does not release any of its locks (exclusive or shared) until after it commits or aborts, and so it is easier to
implement than strict 2PL.
Notice the difference between conservative and rigorous 2PL: the former must lock all its items before it starts, so once the
transaction starts it is in its shrinking phase; the latter does not unlock any of its items until after it terminates (by
committing or aborting), so the transaction is in its expanding phase until it ends.
In many cases, the concurrency control subsystem itself is responsible for generating
the read_lock and write_lock requests.
• For example, suppose the system is to enforce the strict 2PL protocol. Then, whenever
transaction T issues a read_item(X), the system calls the read_lock(X) operation on behalf of T.
• If the state of LOCK(X) is write_locked by some other transaction T , the system places T in the
waiting queue for item X; otherwise, it grants the read_lock(X) request and permits the read_item(X)
operation of T to execute.
• On the other hand, if transaction T issues a write_item(X), the system calls the write_lock(X)
operation on behalf of T. If the state of LOCK(X) is write_locked or read_locked by some other
transaction T , the system places T in the waiting queue for item X;
• if the state of LOCK(X) is read_locked and T itself is the only transaction holding the read lock
on X, the system upgrades the lock to write_locked and permits the write_item(X) operation by T.
• Finally, if the state of LOCK(X) is unlocked, the system grants the write_lock(X) request and
permits the write_item(X) operation to execute. After each action, the system must update its lock
table appropriately.
• The use of locks can cause two additional problems: deadlock and starvation.
3. Dealing with Deadlock and Starvation
Deadlock occurs when each transaction T in a set of two or more transactions is waiting for some item that
is locked by some other transaction T in the set. Hence, each transaction in the set is in a waiting queue,
waiting for one of the other trans-actions in the set to release the lock on an item. But because the other
transaction is also waiting, it will never release the lock.
• A simple example is shown in Figure 22.5(a), where the two transactions T1 and T2 are deadlocked in a
partial schedule; T1 is in the waiting queue for X, which is locked by T2 , while T2 is in the waiting queue
for Y, which is locked by T1 . Meanwhile, neither T1 nor T2 nor any other transaction can access
items X and Y.
Deadlock Prevention
• A Transaction Locks all the data items it refers to before it begins execution.
• This way of locking prevents the deadlock since a transaction never waits for a data item.
• The conservative two phase locking uses this type of approach.
Deadlock detection and resolution
In this approach , deadlocks are allowed to happen. The scheduler maintains a wait for
graph for detecting cycle. If a cycle exists, then one of the transaction involved in the cycle
is selected as victim and roll back.
A wait for graph is created using lock table. As soon as a transaction is blocked , it is added
to the graph. When a chain like : Ti waits for Tj waits for Tk waits for Ti or Tj occurs , then
this creates a cycle.
DeadLock avoidance
There are many variations of two phase locking algorithm. Some avoid deadlock by not
letting cycle to complete.That is as soon as the algorithm discovers that blocking a
transaction is likely to create a cycle , it rolls back the transaction.
Dealing with deadlock & Starvation:
• Starvation occurs when a particular transaction consistently waits or restarted and
never gets a chance to proceed further.
• In a deadlock resolution it is possible that the same transaction may consistently be
selected as victim and roll back.
• This limitations in inherent in all priority based scheduling mechanisms.
• In wound wait scheme a younger transaction may always be wounded(aborted)by a
long running older transaction which creates starvation.

2. Concurrency Control Based on Timestamp Ordering


• A monotonically increasing variable (integer) indicating the age of an
operation or a transaction.
• A larger timestamp value indicates a more recent event or operation.
• Timestamp based algorithm uses timestamp to serialize the execution of
concurrent transactions.
The Timestamp Ordering Algorithm
Database concurrency control
MULTIVERSION CONCURRENCY CONTROL
TECHNIQUES,
• This approach maintains a number of versions of a data item and allocates the right
version to a read operation of a transaction. Thus unlike other mechanisms a read
operation in this mechanism is never rejected.
• Side effect: Significantly more storage (RAM and DISK) is required to maintain
multiple versions. To check unlimited growth of versions, a garbage collection is
run when some criteria is satisfied.
Multiversion Technique based on timestamp ordering
• This approach maintains a number of versions of a data item and allocates the right
version to a read operation of a transaction.
• Thus unlike other mechanisms a read operation in this mechanism is never rejected.
• Side effect: Significantly more storage (RAM & Disk)is required to maintain
multiple versions. To check unlimited growth of versions, a garbage collection is
run when some criteria is satisfied.
In this method, several versions X1, X2, … , Xk of each data item X are
maintained. For each version, the value of version Xi and the following two
timestamps associated with version Xi are kept:
1. read_TS(Xi). The read timestamp of Xi is the largest of all the
timestamps of transactions that have successfully read version Xi.
2. write_TS(Xi). The write timestamp of Xi is the timestamp of the
transaction that wrote the value of version Xi.

• Whenever a transaction T is allowed to execute a write_item(X) operation,


a new version Xk+1 of item X is created, with both the write_TS(Xk+1)
and the read_TS(Xk+1) set to TS(T).
• Correspondingly, when a transaction T is allowed to read the value of
version Xi , the value of read_TS(Xi ) is set to the larger of the current
read_TS(Xi ) and TS(T).
• To ensure serializability, the following rules are used:
1. If transaction T issues a write_item(X) operation, and version i
of X has the highest write_TS(Xi) of all versions of X that is also
less than or equal to TS(T), and read_TS(Xi) > TS(T), then abort
and roll back transaction T; otherwise, create a new version Xj of
X with read_TS(Xj ) = write_TS(Xj ) = TS(T).
2. If transaction T issues a read_item(X) operation, find the version
i of X that has the highest write_TS(Xi) of all versions of X that
is also less than or equal to TS(T); then return the value of Xi to
transaction T, and set the value of read_TS(Xi) to the larger of
TS(T) and the current read_TS(Xi).
Multiversion Two-Phase Locking Using Certify Locks
Validation-Based (Optimistic) Concurrency Control
Granularity of Data Items and Multiple Granularity
Locking
• All concurrency control techniques assume that the database is formed of a
number of named data items. A database item could be chosen to be one of
the following:
• A database record
• A field value of a database record
• A disk block
• A whole file
• The whole database

The granularity can affect the performance of concurrency control and


recovery. we discuss a multiple granularity locking scheme, where the
granularity level (size of the data item) may be changed dynamically.
NOSQL Databases and Big Data Storage Systems:
1.Introduction to NOSQL Systems
• NoSQL is a type of database management system (DBMS) that is designed to handle
and store large volumes of unstructured and semi-structured data. Unlike traditional
relational databases that use tables with pre-defined schemas to store data, NoSQL
databases use flexible data models that can adapt to changes in data structures and
are capable of scaling horizontally to handle growing amounts of data.
• NoSQL databases are generally classified into four main categories:
1. Document databases: These databases store data as semi-structured documents,
such as JSON or XML, and can be queried using document-oriented query
languages.
2. Key-value stores: These databases store data as key-value pairs, and are
optimized for simple and fast read/write operations.
3. Column-family stores: These databases store data as column families, which are
sets of columns that are treated as a single entity. They are optimized for fast and
efficient querying of large amounts of data.
4. Graph databases: These databases store data as nodes and edges, and are
designed to handle complex relationships between data.
Emergence of NOSQL Systems
Characteristics of NOSQL Systems
THE CAP THEOREM
• The CAP theorem, also known as Brewer’s
theorem, was introduced by Eric Brewer in
2000 . The three letters in CAP theorem stands
for -:
• C -: Consistency
• A -: Availability
• P -: Partition Tolerance

Statement of CAP theorem


The CAP theorem states that it is not possible to guarantee all three of the
desirable properties — consistency, availability, and partition tolerance at
the same time in a distributed system with data replication.
C -: Consistency
• In a distributed system, consistency means
that all nodes or replicas in the system have
the same data at the same time. When a
client reads data, it receives the most recent
write or an error. In other words, there is no
divergence in the data observed by different
nodes.
• Suppose we are working on a distributed
system having client node and two database
nodes say d1 and d2 . Now let’s say we have
generated an update request to d1 and at the
same time we have generated a read request at
d2 . So here due to replication of data between
d1 and d2 we are able to access latest data .
This is called consistency .
A -: Availability
• Availability refers to the system’s ability to respond to client requests, even in the
presence of node failures or network partitions. An available system ensures that
every request eventually receives a response, though it doesn’t guarantee that the
response contains the most recent data.
• In short availability ensures that the system is always available.
P -: Partition Tolerance
• Partition tolerance deals with the system’s ability to continue functioning even when
network partitions occur. Network partitions can cause nodes to lose contact with
one another, making communication and synchronization difficult.
• Let’s consider above example to understand it better .
• Suppose somehow the connection between d1 and d2 breaks down now the
replication of data will not occur hence consistency is not maintained but still both
systems are generating output. This is partition tolerance. So even after connection
breakdown the output is being generated by systems is partition tolerance.
1.CA (consistency + availability )

CAP theorem says that we cannot have all three properties


i.e. C A P at same time we can have at most two at once . So
let’s understand this .
All possible combinations of consistency , availability and
Here complete system is consistent and is always
partition tolerance are
available . If we break the connection between
1.CA (consistency + availability ) ,
systems in order to make it partition tolerant we
2.AP (availability + partition tolerance ) and
will lose consistency of system.
3.CP (consistency + partition tolerance ) .
AP (availability + partition tolerance ) CP (consistency + partition tolerance )
To make above system consistent and partition tolerant we have
to down the system in order to establish the connection between
d1 and d2 again this will make our system unavailable for a
while and after the connection has been established the system
will not be partition tolerant .

After breaking the connection between d1 and d2 our system


becomes partition tolerant and is always available but
consistency is not maintained.
The CAP theorem is important because it forces developers to think carefully about the trade-offs
they’re making when building a distributed system. When designing a distributed system, you have to
decide which two properties are most important for your use case.
Document-Based NOSQL Systems and MongoDB
• A document is a record in a document database. A document typically stores
information about one object and any of its related metadata.
• Documents store data in field-value pairs. The values can be a variety of types
and structures, including strings, numbers, dates, arrays, or objects.
Documents can be stored in formats like JSON, BSON, and XML.
• Collections
• A collection is a group of documents. Collections typically store documents
that have similar contents.
• Not all documents in a collection are required to have the same fields, because
document databases have flexible schemas. Note that some document
databases provide schema validation, so the schema can optionally be locked
down when needed.
• Documents can be specified in various formats, such as XML. A popular
language to specify documents in NOSQL systems is JSON (JavaScript
Object Notation).
• There are many document-based NOSQL systems, including MongoDB
and CouchDB, among many others.
MongoDB Data Model
MongoDB documents are stored in BSON (Binary JSON) format, which is a
variation of JSON with some additional data types and is more efficient for
storage than JSON. Individual documents are stored in a collection.
The operation create Collection is used to create each collection. For example,
the following command can be used to create a collection called project to
hold PROJECT objects from the COMPANY database.
db.createCollection(“project”, { capped : true, size : 1310720, max : 500 } )
• Each document in collection has unique objectID Field called_id.
• A collection doesn’t have any schema
• Structure of the data will be chosen based on how the documents are accessed.
• User can choose normalized or renormalized design.
• Documents Creation using insert Operation
Db.<collection name>.insert(<document(s)>)
• Document deletion using remove operation
Db.<collection name>.remove(<document(s)>)
MongoDB CRUD Operations
CRUD operations
Document databases typically have an API or query language that allows
developers to execute the CRUD (create, read, update, and delete) operations.
1. Create: Documents can be created in the database. Each document has a
unique identifier.
•db.collection.insertOne()
•db.collection.insertMany();
2. Read: Documents can be read from the database. The API or query
language allows developers to query for documents using their unique
identifiers or field values. Indexes can be added to the database in order to
increase read performance.
• db.collection.find()

3. Update:Update operations modify existing documents in a collection.


MongoDB provides the following methods to update documents of a collection.
• db.collection.updateOne()
• db.collection.updateMany()
• db.collection.replaceOne()
4. Delete Operations
Delete operations remove documents from a collection. MongoDB provides the following
methods to delete documents of a collection:
• db.collection.deleteOne()
• db.collection.deleteMany()
MongoDB Distributed Systems Characteristics
1. Sharding-distributing data across multiple servers.
Mechanism: Each shard is a separate MongoDB instance. Data is partitioned based on a shard key, which is
chosen to balance the load across shards effectively.
2. Replication-Replication involves copying data from one MongoDB server (the primary) to one or more
other servers (secondaries).
Mechanism: MongoDB uses replica sets, where a group of mongod instances maintain the same data set. One
instance is designated as the primary, and the rest as secondaries.
3. Consistency and Read/Write Concerns
Consistency: MongoDB offers tunable consistency models through read and write concerns. For example, you
can configure a read concern to ensure the data read is acknowledged by a certain number of replica set
members.
4. Automatic Failover and Election
• Failover: In a replica set, if the primary node fails, the system automatically detects the failure and promotes
one of the secondaries to become the new primary.
• Election: MongoDB uses an election process to determine which secondary should become the new
primary. This process is managed by the replica set members themselves.
5. Horizontal Scalability- Horizontal scalability means adding more servers to handle increased load.
Mechanism: MongoDB achieves this through sharding. By distributing data and query loads across multiple servers, it can
handle larger volumes of data and higher traffic more effectively.
6. Data Distribution and Balancing
Balancer: MongoDB includes a balancer that redistributes data across shards to maintain an even load. This process helps
prevent any single shard from becoming a bottleneck.
Chunk Migration: The balancer moves chunks of data between shards to ensure that the data is evenly distributed.
7. Geospatial Distribution-Global Deployment: MongoDB supports deployment across geographically dispersed data
centers. This can improve latency for global applications by placing data closer to users.
8. Strong Consistency Options-Transactions: MongoDB supports multi-document ACID transactions, ensuring strong
consistency and isolation even in distributed environments. This is crucial for applications that require reliable and accurate
data across multiple operations.
9. Flexible Schema Design
Schema-less: MongoDB’s document model allows for flexible schema design, which can adapt to changes in data structure
over time. This flexibility is beneficial in distributed systems where different nodes might have different schema
requirements.
10. Operational Tools
Monitoring and Management: MongoDB provides tools for monitoring and managing distributed clusters, such as
MongoDB Atlas, which offers automated monitoring, backups, and scaling.
NOSQL Key-Value Stores
• Key-value stores focus on high performance, availability, and scalability
by storing data in a distributed storage system.
• The data model used in key-value stores is relatively simple, there is no
query language but Databases and Big Data Storage Systems set of
operations that can be used by the application programmers.
• The key is a unique identifier associated with a data item and is used to
locate this data item rapidly.
• The value is the data item itself, and it can have very different formats
for different key-value storage systems.
• In some cases, the value is just a string of bytes or an array of bytes, and
the application using the key-value store has to interpret the structure of
the data value.
• In other cases, some standard formatted data
is allowed; for example, structured data rows
(tuples) similar to relational data, or semi
structured data using JSON or some other
self-describing data format.
• Different key-value stores can thus store
unstructured, semi structured, or structured
data items .
• The main characteristic of key-value stores is
the fact that every value (data item) must be
associated with a unique key, and that
retrieving the value by supplying the key
must be very fast.
DynamoDB Overview
• The DynamoDB system is an Amazon product and is available as part of
Amazon’s AWS/SDK platforms (Amazon Web Services/Software Development
Kit).
• It can be used as part of Amazon’s cloud computing services, for the data storage
component.
• DynamoDB data model. The basic data model in DynamoDB uses the concepts of
tables, items, and attributes. A table in DynamoDB does not have a schema; it
holds a collection of self-describing items.
• Each item will consist of a number of (attribute, value) pairs, and attribute
values can be single-valued or multivalued. DynamoDB also allows the user to
specify the items in JSON format, and the system will convert them to the internal
storage format of DynamoDB.
• When a table is created, it is required to specify a table name and a primary
key; the primary key will be used to rapidly locate the items in the table. Thus,
the primary key is the key and the item is the value for the DynamoDB key-value
store.
• The primary key attribute must exist in every item in the table. The primary key can
be one of the following two types:
• A single attribute.-build hash indexes on items in table
• A pair of attributes.- hash index and range(A,B)
• DynamoDB Distributed Characteristics
1. Mechanisms used for replication, sharding, and other distributed system. concepts
in an open source key-value system called Voldemort.
2. Voldemort is based on many of the techniques proposed for DynamoDB.
Voldemort Key-Value Distributed Data Store
•Open source system- Licensed under apache 2.0 , and it is based on
DynamoDB.
•Usually used for replication of data , scaling and high performance.
•Used for consistent hashing (Linkedin data storage)
•Simple basic operations are used (S.put (k,v) and V=S.get (K) for insertion
and retrieval of data items where is S is store.
■ High-level formatted data values-The values v in the (k, v) items can be specified in JSON
(JavaScript Object Notation), and the system will convert between JSON and the internal
storage format.
• Other data object formats can also be specified if the application provides the conversion
(also known as serialization) between the user format and the storage format as a Serializer
class.
■ Consistent hashing for distributing (key, value) pairs.-A variation of the data distribution
algorithm known as consistent hashing is used in Voldemort for data distribution among the
nodes in the distributed cluster of nodes.
• A hash function h(k) is applied to the key k of each (k, v) pair, and h(k) determines where
the item will be stored.
• The method assumes that h(k) is an integer value, usually in the range 0 to Hmax = 2n−1,
where n is chosen based on the desired range for the hash values.
• This method is best visualized by considering the range of all possible integer hash values 0
to Hmax to be evenly distributed on a circle (or ring).
• The nodes in the distributed system are then also located on the same ring;
■ Consistency and versioning. Voldemort
uses a method similar to the one developed
for DynamoDB for consistency in the
presence of replicas.
Basically, concurrent write operations are
allowed by different processes so there
could exist two or more different values
associated with the same key at different
nodes when items are replicated.
Consistency is achieved when the item is
read by using a technique known as
versioning and read repair
Examples of Other Key-Value Stores
• Oracle key-value store. Oracle has one of the well-known SQL relational database systems, and
Oracle also offers a system based on the key-value store concept; this system is called the Oracle
NoSQL Database.
• Redis key-value cache and store. Redis differs from the other systems discussed here because it
caches its data in main memory to further improve performance. It offers master-slave replication
and high availability, and it also offers persistence by backing up the cache to disk. Apache
Cassandra.
• Cassandra is a NOSQL system that is not easily categorized into one category; it is sometimes
listed in the column-based NOSQL category (see Section 24.5) or in the key-value category. If
offers features from several NOSQL categories and is used by Facebook as well as many other
customers.
Column-Based or Wide Column NOSQL Systems
• The Google distributed storage system for big data, known as Bigtable, is a well-
known example of this class of NOSQL systems, and it is used in many Google
applications that require large amounts of data storage, such as Gmail.
• Bigtable uses the Google File System (GFS) for data storage and distribution. An open
source system known as Apache HBase is somewhat similar to Google Bigtable, but it
typically uses HDFS (Hadoop Distributed File System) for data storage.
• Cassandra, is also column based NoSQL but it can be used as key-value store.
• HBase ,also known as Sparse multi dimentional distributed persistent storage system store
map(Key, value) pairs.
• Hadoop File System- Provides large data storage ,Where it stored across the Network.
• Master-slave concept-> master assign work to slaves.
HBase Data models and Versioning
• HBase data model. The data model in Hbase organizes data using the concepts of namespaces, tables, column families,
column qualifiers, columns, rows, and data cells.
• A column is identified by a combination of (column family: column qualifier). Data is stored in a self-describing form
by associating columns with data values, where data values are strings.
• HBase also stores multiple versions of a data item, with a timestamp associated with each version
Define the terminology
1. Tables and Rows
2. Column Families, Column Qualifiers, and Columns
3. Versions and Timestamps.
4. Cells.
5. Namespaces
Hbase CRUD Operations
Hbase has low-level CRUD (create, read, update, delete) operations, as in many
of the NOSQL systems.
NOSQL Graph Databases and Neo4j
• Neo4j Graph Database follows the Property Graph Model to store and manage its data.
Key features:
The model represents data in Nodes, Relationships and Properties
• Properties are key-value pairs
• Nodes are represented using circle and Relationships are represented using arrow keys
• Relationships have directions: Unidirectional and Bidirectional
• Each Relationship contains "Start Node" or "From Node" and "To Node" or "End Node"
• Both Nodes and Relationships contain properties
• Relationships connects nodes
Data Model In Neo4j-The data model in Neo4j organizes data using the concepts of
nodes and relationships
• Nodes can have labels:
• A node can have zero, one, or several labels.
• The nodes that have the same label are grouped into a collection that identifies a
subset of the nodes in the database graph for querying purposes.
• Relationships are directed, each relationship has a start node and end node as well
as a relationship type, which serves a similar role to a node label by identifying
similar relationships that have the same relationship type Properties can be specified
via a map pattern,
1. Creation of nodes in Neo4j: Creating nodes for a COMPANY:
• CREATE (e1: EMPLOYEE, {Empid: ‘1’, Lname: ‘Sharma’, Fname: ‘Nitin’, Minit: ‘B’})
• CREATE (d1: DEPARTMENT, {Dno: ‘1’, Dname: ‘Business’})
• CREATE (p1: PROJECT, {Pno: ‘8’, Pname: ‘WebDev’})
• CREATE (loc1: LOCATION, {Name: ‘Noida’})
2. Creation of relationships in Neo4j: Creating relationships for a COMPANY:
• CREATE (e1)- [: WorksFor]-> (d1)
• CREATE (d1)- [: LocatedIn ]-> (loc1)
3.Features of Neo4j Data Model:
1. Labels and properties:
2. Relationships and relationship types: “->”
3. Paths:
4. Optional Schema:
5. Indexing and node identifiers:
The Cypher Query Language of Neo4j
Neo4j has a high-level query language, Cypher. There are declarative commands for
creating nodes and relationships
as well as for finding nodes and relationships based on specifying patterns. Deletion
and modification of data is also possible in Cypher . A Cypher query is made up of
clauses. When a query has several clauses, the result from one clause can be the input
to the next clause in the query (MATCH and Return)

You might also like