Database Management Systems(BCS403)
Module – 5 Transaction Processing
Contents:
Concurrency Control in Databases
▪ Two-phase locking techniques for Concurrency control
▪ Concurrency control based on Timestamp ordering,
▪ Multiversion Concurrency control techniques
▪ Validation Concurrency control techniques
▪ Granularity of Data items and Multiple Granularity Locking.
Module – 5 Transaction Processing
Contents:
NOSQL Databases and Big Data Storage Systems:
▪ Introduction to NOSQL Systems,
▪ The CAP Theorem
▪ Document-Based NOSQL Systems and MongoDB
▪ NOSQL Key-Value Stores
▪ Column-Based or Wide Column NOSQL Systems
▪ NOSQL Graph Databases and Neo4j
Module – 5 Introduction to NOSQL Systems
Emergence of NOSQL Systems
Many companies and organizations are faced with applications that
store vast amounts of data.
Example-1:
Consider a free e-mail application, such as Google Mail or Yahoo Mail
or other similar service—this application can have millions of users,
and each user can have thousands of e-mail messages.
Example-2:
Consider an application such as Facebook, with millions of users who
submit posts, many with images and videos; then these posts must be
displayed on pages of other users using the social media relationships
among the users.
User profiles, user relationships, and posts must all be stored in a huge
collection of data stores, and the appropriate posts must be made
available to the sets of users that have signed up to see these posts.
Module – 5 Introduction to NOSQL Systems
Emergence of NOSQL Systems
Some of the organizations that were faced with these data management
and storage applications decided to develop their own systems:
Google developed a proprietary NOSQL system known as BigTable,
which is used in many of Google’s applications that require vast
amounts of data storage, such as Gmail, Google Maps, and Web site
indexing.
Amazon developed a NOSQL system called DynamoDB that is
available through Amazon’s cloud services. This innovation led to the
category known as key-value data stores or sometimes key-tuple or
key-object data stores.
Facebook developed a NOSQL system called Cassandra, which is
now open source and known as Apache Cassandra. This NOSQL
system uses concepts from both key-value stores and column-based
systems.
Module – 5 Introduction to NOSQL Systems
Characteristics of NOSQL Systems
NOSQL characteristics related to distributed databases and
distributed systems.
Scalability
▪ In NOSQL systems, horizontal scalability is employed while the
system is operational, so techniques for distributing the existing data
among new nodes without interrupting system operation are
necessary.
▪ NoSQL data stores are designed to expand horizontally.
▪ Horizontal scaling means that scaling out data stores by adding
more machines as data nodes (servers) into the pool of resources
(processing, memory, network connections).
▪ The design scales out using multi-utility cloud services.
Module – 5 Introduction to NOSQL Systems
Characteristics of NOSQL Systems
NOSQL characteristics related to distributed databases and
distributed systems.
Availability, Replication and Eventual Consistency
▪ In NoSQL systems, data is replicated over two or more nodes in a
transparent manner, so that if one node fails, the data is still available
on other nodes.
▪ Replication improves data availability and can also improve read
performance, because read requests can often be serviced from any
of the replicated data nodes.
▪ This ensures high availability, partition, reliability and fault
tolerance.
Module – 5 Introduction to NOSQL Systems
Characteristics of NOSQL Systems
NOSQL characteristics related to distributed databases and
distributed systems.
Replication Models
▪ Two major replication models are used in NOSQL systems:
1. Master-slave.
2. Master-master replication.
▪ Master-slave replication requires one copy to be the master copy;
all write operations must be applied to the master copy and then
propagated to the slave copies using eventual consistency.
▪ The master-master replication allows reads and writes at any of
the replicas but may not guarantee that reads at nodes that store
different copies see the same values. Different users may write the
same data item concurrently at different nodes of the system, so the
values of the item will be temporarily inconsistent.
Module – 5 Introduction to NOSQL Systems
Characteristics of NOSQL Systems
NOSQL characteristics related to distributed databases and
distributed systems.
Sharding of Files:
▪ Sharding (also known as horizontal partitioning) of the file records is
often employed in NOSQL systems.
▪ This serves to distribute the load of accessing the file records to
multiple nodes.
▪ The combination of sharding the file records and replicating the
shards improves the load balancing as well as data availability.
High-Performance Data Access
▪ In many NOSQL applications, it is necessary to find individual
records or objects (data items) from among the millions of data
records or objects in a file.
Module – 5 Introduction to NOSQL Systems
Characteristics of NOSQL Systems
NOSQL characteristics related to distributed databases and
distributed systems.
▪ To achieve this, most NoSQL systems use one of two techniques:
hashing or range partitioning on object keys.
NOSQL characteristics related to data models and query
languages.
Not Requiring a Schema
▪ The users can specify a partial schema in some systems to improve
storage efficiency, but it is not required to have a schema in most of
the NOSQL systems.
▪ There are various languages for describing semi-structured data,
such as JSON (JavaScript Object Notation) and XML (Extensible
Markup Language).
Module – 5 Introduction to NOSQL Systems
Characteristics of NOSQL Systems
NOSQL characteristics related to data models and query
languages.
Less Powerful Query Languages
▪ Many applications that use NOSQL systems may not require a
powerful query language such as SQL, because search (read) queries
in these systems often locate single objects in a single file based on
their object keys.
▪ NOSQL systems typically provide a set of functions and operations
as a programming API (application programming interface), so
reading and writing the data objects is accomplished by calling the
appropriate operations by the programmer.
Versioning
▪ Some NOSQL systems provide storage of multiple versions of the
data items, with the timestamps of when the data version was
created.
Module – 5 Introduction to NOSQL Systems
Categories of NOSQL Systems
Document-based NOSQL systems:
▪ These systems store data in the form of documents using well-known
formats, such as JSON (JavaScript Object Notation).
▪ Documents are accessible via their document id, but can also be
accessed rapidly using other indexes.
NOSQL key-value stores:
▪ These systems have a simple data model based on fast access by the
key to the value associated with the key; the value can be a record or
an object or a document or even have a more complex data structure.
Column-based or wide column NOSQL systems:
▪ These systems partition a table by column into column families,
where each column family is stored in its own files.
▪ They also allow versioning of data values.
Module – 5 Introduction to NOSQL Systems
Categories of NOSQL Systems
Graph-based NOSQL systems:
▪ Data is represented as graphs, and related nodes can be found by
traversing the edges using path expressions.
Additional categories
Hybrid NOSQL systems: These systems have characteristics from two
or more of the above four categories.
Object databases
XML databases
Module – 5 Introduction to NOSQL Systems
Brewer’s CAP Theorem
“The CAP Theorem states that a distributed system can only meet 2 of
3 properties. Means that any distributed system cannot guaranty C, A,
and P together. So there might only be CA, AP, or CP systems”.
Consistency: All nodes observe the same data at the same time.
Example: When thousands of customers are looking to book a flight,
all updates from any client (e.g., book a flight) should be accessible by
other clients.
Availability: Each request receives a response on success/failure.
Partition Tolerance: The system continues to operate as a whole even
in case of message loss, node failure or node not reachable.
Module – 5 Introduction to NOSQL Systems
Brewer’s CAP Theorem
Fig: CAP Theorem in Big data Solutions.
Module – 5 Introduction to NOSQL Systems
NoSQL Document store:
➢ Document stores are high performance and flexible data stores.
➢ Scalability varies depends on stored contents.
➢ Complexity is low compared to tabular, object and graph data stores.
Other features are
Module – 5 Introduction to NOSQL Systems
Document store supports different data formats
CSV Format
➢ CSV data store is a format for records
➢ CSV does not represent object-oriented databases or hierarchical data
records.
JSON Format
➢ JSON and XML formats represent semi-structured data, object-oriented
data, and hierarchical data records.
➢ JSON refers to a language format for semi-structured data.
XML Format
➢ XML(eXtensible Markup Language) is an extensible, simple, and scalable
language.
➢ Its self-describing format describes structure and contents in an easy to
understand format.
➢ XML is widely used. The document model consists of root element and
their sub-elements.
➢ XML document model has a hierarchical structure, and has features of
object-oriented records.
Module – 5 Introduction to NOSQL Systems
MongoDB Database
• MongoDB is an open source DBMS. MongoDB programs create and
manage databases.
• MongoDB manages the collection and document data store. MongoDB
functions do querying and accessing the required information.
MongoDB is
(i) non-relational and NoSQL.
(ii) Distributed and document based.
(iii) open source and cross-platform.
(iv) Scalable and flexible data model.
(v) Indexed.
(vi) multi-master
(vii) fault tolerant.
(viii)Document data store in JSON-like documents. The data store uses the
dynamic schemas.
Module – 5 Introduction to NOSQL Systems
MongoDB Database
▪ Dynamic Schema Dynamic schema implies that documents in the
same collection do not need to have the same set of fields or
structure. Also, the similar fields in a document may contain
different types of data.
▪ Replication: Replication ensures high availability in Big Data.
Presence of multiple copies increases on different database servers.
This makes DBs fault- tolerant against any database server failure.
Module – 5 Introduction to NOSQL Systems
MongoDB Replication
➢ MongoDB replicates with the help of a replica set.
➢ A replica set in MongoDB is a group of mongod (MongoDb server)
processes that store the same dataset.
➢ Replica sets provide redundancy but high availability.
➢ A replica set usually has minimum three nodes.
➢ Any one out of them is called primary.
➢ The primary node receives all the write operations.
➢ All the other nodes are termed as secondary.
➢ The data replicates from primary to secondary nodes.
➢ A new primary node can be chosen among the secondary nodes at the time
of automatic failover or maintenance.
➢ The failed node when recovered can join the replica set as secondary node
again.
Module – 5 Introduction to NOSQL Systems
Sharding in MongoDB
▪ Sharding of the documents in the collection is also known as
horizontal partitioning, divides the documents into disjoint
partitions known as shards.
▪ This allows the system to add more nodes as needed by a process
known as horizontal scaling of the distributed system and to store the
shards of the collection on different nodes to achieve load balancing.
▪ There are two ways to partition a collection into shards in
MongoDB
1. Range partitioning and
2. Hash partitioning.
▪ Both methods require that the user specify a particular document
field to be used as the basis for partitioning the documents into
shards.
▪ This partitioning field is called shard key.
Module – 5 Introduction to NOSQL Systems
Sharding in MongoDB
▪ Shard key in MongoDB must have two characteristics:
1. It must exist in every document in the collection,
2. It must have an index.
▪ The values of the shard key are divided into chunks either through
range partitioning or hash partitioning, and the documents are
partitioned based on the chunks of shard key values.
MongoDB CRUD Operations
CRUD stands for (create, read, update, delete).
▪ Documents can be created and inserted into their collections using
the insert operation.
General format of creating and inserting documents is:
db.<collection name>.insert(<documents>)
Module – 5 Introduction to NOSQL Systems
MongoDB CRUD Operations
▪ The parameters of the insert operation can include either a single
document or an array of documents.
▪ The delete operation is called remove, and the format is:
db.<collection name>.remove (<condition>)
▪ The documents to be removed from the collection are specified by a
Boolean condition on some of the fields in the collection documents.
▪ For read queries, the main command is called find, and the format
is:
db.<collection name>.find (<condition>)
▪ There is also an update operation, which has a condition to select
certain documents, and a $set clause to specify the update.
Module – 5 Introduction to NOSQL Systems
MongoDB CRUD Operations
▪ The format for updating the field values of documents is:
db.<collection name>.update( { select document}, { $set: { field1:
value, field2: value, fieldn:value } } );
Module – 5 Introduction to NOSQL Systems
NoSQL Key-Value Store
➢ The key is a unique identifier associated with a data item and is used
to locate this data item rapidly.
➢ The value is the data item itself, and it can have very different
formats for different key-value storage systems.
➢ Provides high performance, Scalability, and Flexibility.
➢ Data retrieval is fast in Key-Value pair data store.
➢ A simple string called Key maps to a large data string or
BLOB(Basic Large Object).
➢ Key-Value store accesses use a primary key for accessing values.
➢ Key-Value store can be easily scaled up for very large data.
Module – 5 Introduction to NOSQL Systems
NoSQL Key-Value Store
Example of a Key-Value pairs in data architectural pattern.
Module – 5 Introduction to NOSQL Systems
NoSQL Key-Value Store
The Key-Value data stores provides client to read and write values
as follows:
Limitations of Key-Value store
Module – 5 Introduction to NOSQL Systems
NoSQL Key-Value Store
Examples of Other Key-Value Stores
Oracle key-value store
▪ Oracle has one of the well-known SQL relational database systems,
and Oracle also offers a system based on the key-value store
concept; this system is called the Oracle NoSQL Database.
Redis key-value cache and store
▪ Redis differs from the other NoSQL systems, because it caches its
data in main memory to further improve performance.
▪ It offers master-slave replication and high availability, and it also
offers persistence by backing up the cache to disk.
Module – 5 Introduction to NOSQL Systems
NoSQL Key-Value Store
Examples of Other Key-Value Stores
Apache Cassandra
▪ Cassandra is a NOSQL system that is not easily categorized into one
category; it is sometimes listed in the column-based NOSQL
category or in the key-value category.
▪ It offers features from several NOSQL categories and is used by
Facebook as well as many other customers
Module – 5 Introduction to NOSQL Systems
Column-Based or Wide Column NOSQL Systems
▪ The Google distributed storage system for big data, known as
BigTable, is a well-known example of this class of NOSQL
systems, and it is used in many Google applications that require
large amounts of data storage, such as Gmail.
▪ BigTable uses the Google File System (GFS) for data storage and
distribution.
▪ An open source system known as Apache Hbase is similar to
Google BigTable, but it typically uses HDFS (Hadoop Distributed
File System) for data storage.
Module – 5 Introduction to NOSQL Systems
Hbase Data Model and Versioning
Hbase data model
▪ The data model in Hbase organizes data using the concepts of
namespaces, tables, column families, column qualifiers, columns,
rows, and data cells.
▪ A column is identified by a combination of (column family:column
qualifier).
▪ Data is stored in a self-describing form by associating columns with
data values, where data values are strings.
Tables and Rows.
▪ Data in Hbase is stored in tables, and each table has a table name.
▪ Data in a table is stored as self-describing rows.
▪ Each row has a unique row key
Module – 5 Introduction to NOSQL Systems
Hbase Data Model and Versioning
Column Families, Column Qualifiers, and Columns.
▪ A table is associated with one or more column families.
▪ Each column family will have a name, and the column families
associated with a table must be specified when the table is created
and cannot be changed later.
Module – 5 Introduction to NOSQL Systems
Hbase Data Model and Versioning
Versions and Timestamps
▪ Hbase can keep several versions of a data item, along with the
timestamp associated with each version.
▪ The timestamp is a long integer number that represents the system
time when the version was created, so newer versions have larger
timestamp values.
Data Cells
▪ A cell holds a basic data item in Hbase.
▪ The key (address) of a cell is specified by a combination of (table,
rowid, columnfamily, columnqualifier, timestamp).
Module – 5 Introduction to NOSQL Systems
Hbase Data Model and Versioning
Namespaces
▪ A namespace is a collection of tables.
▪ A namespace basically specifies a collection of one or more tables
that are typically used together by user applications, and it
corresponds to a database that contains a collection of tables in
relational terminology.
Note:
▪ For DynamoDB data model, Voldemort Key-Value Distributed Data
Store, and Neo4j Data Model refer Navathe text book(chapter 24).
Module – 5 Introduction to NOSQL Systems
Graph Databases
➢ Graph Database presents data as entities, or nodes.
➢ Nodes can have properties that have further information.
➢ Nodes are connected to other nodes with edges.
➢ Edges encode the relationships between nodes.
➢ Each connection between two nodes can be labeled with properties.
➢ Example of graph model usages are social networks of connected
people.
Source
Text books
1. Fundamentals of Database Systems, Ramez Elmasri and Shamkant
B. Navathe, 7th Edition, 2017, Pearson.
2. Database management systems, Ramakrishnan, and Gehrke, 3rd
Edition, 2014, McGraw Hill