NoSQL Data Management
MODULE 5
Syllabus
➢Introduction to NoSQL
➢Four types of NoSQL Databases –
➢ Aggregate data models -Aggregates – Key-Value and Document Data Models
➢ Relationships –
➢ Graph Databases
➢ Schema less Databases
➢ Materialized views
➢ Distribution Models – Sharding –
➢ Master-Slave Replication –
➢ Peer-Peer Replication
Introduction
➢NoSQL (non-SQL or non-relational) is an approach to database design that focuses on
providing a mechanism for storing and retrieving data modeled in means other than the
tabular relations used in relational databases.
➢Instead of the typical tabular structure of a relational database, NoSQL databases house
data within one data structure.
➢Since this non-relational database design does not require a schema, it offers
rapid scalability to manage large, typically unstructured data sets.
➢NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may
support SQL-like query languages
Features of NoSQL Databases:
➢Schema Agnostic: NoSQL Databases do not require any specific schema or
storage structure than traditional RDBMS.
➢Scalability: NoSQL databases scale horizontally as data overgrows certain
commodity hardware could be added and scalability features could be preserved
for NoSQL.
➢High Availability: In traditional RDBMS it relies on primary and secondary nodes
for fetching the data, Some NoSQL databases use master place architecture.
➢Performance: To increase the performance of the NoSQL system one can add
a different commodity server that provides reliable and fast access to database
transfer with minimum overhead.
➢Global Availability: Data is replicated among multiple servers and clouds, so it
is accessible to anyone, minimizing the latency period.
Advantages of NoSQL
➢High scalability: NoSQL databases use sharding for horizontal scaling. NoSQL
can handle a huge amount of data because of scalability, as the data grows
NoSQL scales. The auto itself to handle that data in an efficient manner.
➢Flexibility: NoSQL databases are designed to handle unstructured or semi-
structured data, which means that they can accommodate dynamic changes to
the data model. This makes NoSQL databases a good fit for applications that
need to handle changing data requirements.
➢High availability: The auto, replication feature in NoSQL databases makes it
highly available because in case of any failure data replicates itself to the
previous consistent state.
➢Scalability: NoSQL databases are highly scalable, which means that they can
handle large amounts of data and traffic with ease. This makes them a good fit
for applications that need to handle large amounts of data or traffic
➢Performance: NoSQL databases are designed to handle large amounts of data
and traffic, which means that they can offer improved performance compared
to traditional relational databases.
➢Cost-effectiveness: NoSQL databases are often more cost-effective than
traditional relational databases, as they are typically less complex and do not
require expensive hardware or software.
Disadvantages
➢Lack of standardization: There are many different types of NoSQL databases, each
with its unique strengths and weaknesses. This lack of standardization can make it
difficult to choose the right database for a specific application
➢Lack of ACID compliance: NoSQL databases are not fully ACID-compliant, which
means that they do not guarantee the consistency, integrity, and durability of data.
This can be a drawback for applications that require strong data consistency
guarantees.
➢Narrow focus: NoSQL databases have a very narrow focus as it is mainly
designed for storage but it provides very little functionality. Relational databases
are a better choice in the field of Transaction Management than NoSQL.
➢Open-source: NoSQL is an open-source database. There is no reliable standard
for NoSQL yet. In other words, two database systems are likely to be unequal.
➢Lack of support for complex queries: NoSQL databases are not designed to
handle complex queries, which means that they are not a good fit for
applications that require complex data analysis.
➢Lack of maturity: NoSQL databases are relatively new and lack the maturity of
traditional relational databases. This can make them less reliable and less secure than
traditional databases.
➢Unavailability of GUI: GUI mode tools to access the database are not flexibly
available in the market.
➢Backup: Backup is a great weak point for some NoSQL databases like MongoDB.
➢Large document size: Some database systems like MongoDB and CouchDB store data
in JSON format. This means that documents are quite large (BigData, network
bandwidth, speed)
SQL NoSQL
RELATIONAL DATABASE MANAGEMENT Non-relational or distributed database
SYSTEM (RDBMS). system.
These databases have fixed or static or They have a dynamic schema
predefined schema
These databases are not suited for These databases are best suited for
hierarchical data storage.. hierarchical data storage
These databases are best suited for complex These databases are not so good for
queries complex queries
Vertically Scalable Horizontally scalable
Follows ACID property Follows CAP(consistency, availability,
partition tolerance
Aggregate Data Models
➢A data model is the model through which we perceive and manipulate our data.
➢In a database, the data model describes how we interact with the data in the database.
➢This is distinct from a storage model, which describes how the database stores and
manipulates the data internally.
➢The term “data model” often means the model of the specific data in an application.
➢The dominant data model is the relational data model, which is best
visualized as a set of tables.
➢ Each table has rows, with each row representing some entity of interest.
➢the entity is described through columns, each having a single value.
➢A column may refer to another row in the same or different table, which
constitutes a relationship between those entities
Aggregates
➢The relational model stores the information and divides it into tuples (rows).
➢A tuple is a limited data structure: It captures a set of values, and cannot nest one
tuple within another to get nested records, nor put a list of values or tuples within
another.
➢The term aggregate means a collection of objects that we use to treat as a unit.
➢An aggregate is a collection of data that we interact with as a unit.
➢These units of data or aggregates form the boundaries for ACID operation
➢Aggregate is a term that comes from Domain-Driven Design.
➢In Domain-Driven Design, an aggregate is a collection of related objects that
we wish to treat as a unit.
➢In particular, it is a unit for data manipulation and management of
consistency.
// in customers
{ }
"id":1, ],
"name":"Martin", "shippingAddress":[{"city":"Chicago"}]
"billingAddress":[{"city":"Chicago"}] "orderPayment":[
} {
// in orders "ccinfo":"1000-1000-1000-1000",
{ "txnId":"abelif879rft",
"id":99, "billingAddress": {"city": "Chicago"}
"customerId":1, }
"orderItems":[ ],
{ }
"productId":27,
"price": 32.45,
"productName": "NoSQL Distilled"
➢In this model, two main aggregates are there: customer and order.
➢A black-diamond composition marker in UML to show how data fits into the
aggregation structure.
➢The customer contains a list of billing addresses; the order contains a list of order
items, a shipping address, and payments.
➢The link between the customer and the order isn’t within either aggregate—it’s a
relationship between aggregates.
➢Similarly, the link from an order item would cross into a separate aggregate structure
for products
// in customers }
{ ],
"customer": { "shippingAddress":[{"city":"Chicago"}]
"id": 1, "orderPayment":[
"name": "Martin", {
"billingAddress": [{"city": "Chicago"}], "ccinfo":"1000-1000-1000-1000",
"orders": [ "txnId":"abelif879rft",
{ "billingAddress": {"city": "Chicago"}
"id":99, }],
"customerId":1, }]
"orderItems":[ }
{ }
"productId":27,
"price": 32.45,
"productName": "NoSQL Distilled"
Consequences of Aggregate Orientation
➢Various data modeling techniques have provided ways of marking aggregate or
composite structures.
➢The problem is that modelers rarely provide any semantics for what makes an
aggregate relationship different from any other; where there are semantics, they vary.
➢When working with aggregate-oriented databases, a clearer semantics to consider
by focusing on the unit of interaction with the data storage.
➢It is, however, not a logical data property: It’s all about how the data is being used by
applications
➢Aggregation is not a logical data property It is all about how the data is being
used by applications.
➢An aggregate structure may be an obstacle for others but helps with some
data interactions.
➢It has an important consequence for transactions.
➢NoSQL databases don’t support ACID transactions thus sacrificing consistency.
➢aggregate-oriented databases support the atomic manipulation of a single
aggregate at a time.
Advantages
➢It can be used as a primary data source for online applications.
➢Easy Replication.
➢No single point Failure.
➢It provides fast performance and horizontal Scalability.
➢It can handle Structured semi-structured and unstructured data with equal
effort.
Disadvantages
➢No standard rules.
➢Limited query capabilities.
➢Doesn’t work well with relational data.
➢Not so popular in the enterprise.
➢When the value of data increases it is difficult to maintain unique values.
Types of No SQL Databases
➢Document-based Databases
➢Key-value Stores
➢Column Family
➢Graph-based Databases
KEY-VALUE DATA MODEL
➢A key-value data model or database is also referred to as a key-value store.
➢It is a non-relational type of database.
➢In this, an associative array is used as a basic database in which an individual key is
linked with just one value in a collection.
➢For the values, keys are special identifiers.
➢ Any kind of entity can be valued.
➢The collection of key-value pairs stored on separate records is called key-value
databases and they do not have an already defined structure.
➢How do key-value databases work?
➢ Several easy strings or even a complicated entity are referred to as a value
that is associated with a key by a key-value database, which is utilized to
monitor the entity.
➢A key-value database resembles a map object, array, or dictionary, and is
controlled by a DBMS.
➢An efficient and compact structure of the index is used by the key-value store
to have the option to rapidly and dependably find value using its key.
➢Examples:
➢ Couchbase:
➢ Amazon DynamoDB:
➢ Riak:
➢ Aerospike:
➢ Berkeley DB:
➢Features:
➢ One of the most un-complex kinds of NoSQL data models.
➢ For storing, getting, and removing data, key-value databases utilize simple
functions.
➢ Querying language is not present in key-value databases.
➢ Built-in redundancy makes this database more reliable.
➢Advantages:
➢ It is very easy to use. Due to the simplicity of the database, data can accept
any kind, or even different kinds when required.
➢ Its response time is fast due to its simplicity, given that the remaining
environment near it is very much constructed and improved.
➢ Key-value store databases are scalable vertically as well as horizontally.
➢ Built-in redundancy makes this database more reliable.
➢Disadvantages:
➢ As querying language is not present in key-value databases, transportation
of queries from one database to a different database cannot be done.
➢ The key-value store database is not refined. You cannot query the database
without a key.
DOCUMENT DATA MODEL
➢A Document Data Model is a lot different than other data models because it
stores data in JSON, BSON, or XML documents.
➢In this data model, we can move documents under one document, and apart
from this, any particular elements can be indexed to run queries faster.
➢Often documents are stored and retrieved in such a way that it become close to
the data objects that are used in many applications which means very less
translations are required to use data in applications.
➢JSON is a native language that is often used to store and query data.
➢This is a data model that works as a semi-structured data model in which the
records and data associated with them are stored in a single document which
means this data model is not completely unstructured.
➢The main thing is that data here is stored in a document.
➢Examples :
➢ Amazon DocumentDB
➢ MongoDB
➢ Cosmos DB
➢ ArangoDB
➢ Couchbase Server
➢ CouchDB
➢Features:
➢ Document Type Model: data is stored in documents rather than tables or
graphs, so it becomes easy to map things in many programming languages.
➢ Flexible Schema: Schema is very flexible to support this statement one must
know that not all documents in a collection need to have the same fields.
➢ Distributed and Resilient: Document data models are dispersed which is the
reason behind horizontal scaling and distribution of data.
➢ Manageable Query Language: These data models are the ones in which query
language allows the developers to perform CRUD (Create Read Update
Destroy) operations on the data model
➢Advantages:
➢ Schema-less: These are very good in retaining existing data at massive volumes
because there are no restrictions in the format and the structure of data storage.
➢ Faster creation of document and maintenance: It is very simple to create a
document and apart from this maintenance requires is almost nothing.
➢ Open formats: It has a very simple build process that uses XML, JSON, and its other
forms.
➢ Built-in versioning: It has built-in versioning which means as the documents grow
in size there might be a chance they can grow in complexity. Versioning decreases
conflicts.
➢Disadvantages:
➢ Weak Atomicity: It lacks in supporting multi-document ACID transactions. A change in
the document data model involving two collections will require us to run two separate
queries i.e. one for each collection. This is where it breaks atomicity requirements.
➢ Consistency Check Limitations: One can search the collections and documents that are
not connected to an author collection but doing this might create a problem in the
performance of database performance.
➢ Security: Nowadays many web applications lack security which in turn results in the
leakage of sensitive data. So it becomes a point of concern, one must pay attention to
web app vulnerabilities.
COLUMN-FAMILY MODEL
➢Most databases have a row as a storage unit which helps write performance.
➢ There are many scenarios where “writes” are rare, but often need to “read” a
few columns of many rows simultaneously.
➢In this situation, storing groups of columns for all rows as the basic storage unit is
better, which is why these databases are called column stores.
➢In Columnar Data Model instead of organizing information into rows, it does in
columns.
➢This makes them function the same way that tables work in relational databases.
➢The column-family model is as a two-level aggregate structure.
➢As with key-value stores, the first key is often described as a row identifier,
picking up the aggregate of interest.
➢The difference with column-family structures is that this row aggregate is
itself formed of a map of more detailed values.
➢These second-level values are referred to as columns.
➢Hypertable, HBase and Cassandra
➢Column-family databases organize their columns into column families.
➢Each column has to be part of a single column family, and the column acts as
a unit for access, with the assumption that data for a particular column family
will be usually accessed together.
➢Since the database knows about these common groupings of data, it can use
this information for its storage and access behavior.
ADVANTAGES DISADVANTAGES
➢Well structured ➢Designing indexing Schema
➢Flexible ➢Suboptimal data loading
➢Aggregation queries are fast ➢Security Vulnerability
➢Scalability ➢Online Transaction Processing
(OLTP) are not compatible
RELATIONSHIPS
➢Many databases provide ways to make these relationships visible to the database.
➢Document stores make the content of the aggregate available to the database to
form indexes and queries.
➢Riak, a key-value store, allows you to put link information in metadata, supporting
partial retrieval and link-walking capability
➢Aggregate oriented databases treat the aggregate as the unit of data-retrieval.
➢Atomicity is only supported within the contents of a single aggregate.
➢But it is not possible to update multiple aggregates at once.
➢This may imply that if data based on lots of relationships, it is better a
relational database over a NoSQL store.
➢While that’s true for aggregate-oriented databases, it’s worth remembering
that relational databases aren’t all that stellar with complex relationships
either
GRAPH BASED DATA MODEL
➢Graph Based Data Model is a type of Data Model which tries to focus on
building the relationship between data elements.
➢As the name suggests Graph-Based Data Model, each element here is stored
as a node, and the association between these elements is often known as Links.
➢ Association is stored directly as these are the first-class elements of the data
model. These data models give us a conceptual view of the data.
➢These are the data models which are based on topographical network
structure.
➢Nodes: These are the instances of data that represent objects which is to be
tracked.
➢Edges: As we already know edges represent relationships between nodes.
➢Properties: It represents information associated with nodes.
➢Working of Graph Data Model :
➢ In these data models, the nodes which are connected together are connected
physically and the physical connection among them is also taken as a piece of data.
➢ Connecting data in this way becomes easy to query a relationship.
➢ This data model reads the relationship from storage directly instead of calculating
and querying the connection steps.
➢ Like many different NoSQL databases these data models don’t have any schema as
it is important because schema makes the model well and good and easy to edit.
➢ Examples: JanusGraph, Neo4J, DGraph
➢Advantages
➢ Structure: The structures are very agile and workable too.
➢ Explicit Representation: The portrayal of relationships between entities is
explicit.
➢ Real-time O/P Results: Query gives us real-time output results.
➢Disadvantages :
➢ No standard query language: Since the language depends on the platform
that is used so there is no certain standard query language.
➢ Unprofessional Graphs: Graphs are very unprofessional for transactional-
based systems.
➢ Small User Base: The user base is small which makes it very difficult to get
support when running into a system.
SCHEMA LESS DATABASES
➢A common theme across all the forms of NoSQL databases is that they are schema less.
➢With NoSQL databases, storing data is much more casual.
➢A key-value store allows you to store any data under a key.
➢A document database effectively does the same thing, since it makes no restrictions on
the structure of the documents you store.
➢Column-family databases allow you to store any data under any column.
➢Graph databases add new edges and freely add properties to nodes and edges.
➢With a schema, figure out in advance what need to store, but that can be hard to
do.
➢Without a schema binding whatever data can store easily.
➢This allows to change data storage easily as learn more about our project.
➢Can add new things easily as discover them.
➢Furthermore, if we find we don’t need some things anymore, we can just stop
storing them, without worrying about losing old data as we would if we delete
columns in a relational schema
➢As well as handling changes, a schemaless store also makes it easier to deal with
nonuniform data: data where each record has a different set of fields.
➢A schema puts all rows of a table , which becomes awkward if you have different
kinds of data in different rows.
➢Either end up with lots of columns that are usually null, or end up with
meaningless columns.
➢Schemaless ness avoids this, allowing each record to contain just what it needs—
no more, no less.
➢Challenges
➢ Implicit schema in application code, making it hard to understand data
without digging into the code.
➢ Database can't optimize storage or enforce data validation.
➢ Multiple applications accessing the same database can lead to
inconsistencies.
➢Solutions:
➢ Encapsulate database interactions within a single application.
➢ Use web services for integration.
➢ Clearly delineate different areas for access by different applications.
➢Comparison:
➢ Schemaless Databases:
➢ Ideal for nonuniform data and rapid development; flexible within an aggregate.
➢ Relational Databases:
➢ Better for uniform data and controlled schema changes; ensures data integrity
and optimization.
➢Both types of databases have their own advantages and challenges, and the
choice depends on the specific requirements of the application.
Materialized Views
➢Aggregate-Oriented Data Models:
➢ Advantage: Useful for accessing all data for an order in a single unit.
➢ Disadvantage: Difficult to answer queries like total sales of a product over time
without reading every order.
➢Relational Databases:
➢ Advantage: Lack of aggregate structure allows flexible data access.
➢Views: Defined by computation over base tables, providing a way to look at data
differently.
➢Materialized Views:
➢ Precomputed data structures that store the results of complex queries to
improve query performance.
➢ Materialized views are effective for data that is read heavily but can stand
being somewhat stale
➢ Materialized views help optimize data access and improve performance,
especially in scenarios where certain queries do not fit well with the
aggregate structure.
➢Strategies for Building Materialized Views:
➢ Eager Approach:
➢ Update materialized views simultaneously with base data updates.
➢ Suitable for frequent reads and fresh data.
➢ Batch Jobs:
➢ Update materialized views at regular intervals to reduce overhead on each
update.
➢Implementation:
➢ Outside Database:
➢ Read data, compute the view, and save it back to the database.
➢ Within Database:
➢ Provide computation, and the database executes it as needed.
DISTRIBUTION MODELS
➢The important benefits of data model are:
➢Depending on a distribution model, data can store that will give the ability to
handle larger quantities of data, the ability to process a greater read or write
traffic, or more availability in the face of network slowdowns or breakages.
➢There are two paths to data distribution: replication and sharding.
➢Replication takes the same data and copies it over multiple nodes.
➢Sharding puts different data on different nodes.
Single Server
➢The simplest distribution recommended option is —no distribution at all.
➢Run the database on a single machine that handles all the reads and writes to
the data store.
➢The advantage is, it eliminates all the complexities that the other options
introduce; it’s easy for operations people to manage and easy for application
developers to reason about.
➢A lot of NoSQL databases are designed around the idea of running on a cluster, it
can make sense to use NoSQL with a single-server distribution model if the data
model of the NoSQL store is more suited to the application.
➢Graph databases are the best in a single-server configuration.
➢If the data usage is mostly about processing aggregates, then a single-server
document or key-value store may well be worthwhile because it’s easier on
application developers
Sharding
➢Sharding is a database partitioning technique that divides data into smaller subsets
based on a key or a criterion, or shards, and stores them on separate servers.
➢ It's a core feature of NoSQL databases, which are designed for distributed
computing and automatic sharding
➢Sharding is a form of horizontal partitioning,
➢Sharding allows the data to be distributed across multiple servers, which can
improve scalability, performance, availability, and load balancing
➢How does sharding work?
➢ Sharding works by applying a sharding function or algorithm to the data, which
determines how the data is assigned to different shards.
➢ The sharding function can be based on various criteria, such as a hash value, a
range, a list, or a custom logic.
➢ The sharding function should ensure that the data is evenly distributed among
the shards, and that the shards are easy to locate and access.
➢ The sharding function also defines the sharding key, which is the attribute or the
combination of attributes that identifies the shard for a given data item.
➢Benefits : Scalability, performance, and availability.
➢ It can help scale the database horizontally by adding more servers or nodes as the data
grows, thus reducing load and bottlenecks on a single server and increasing throughput
and storage capacity.
➢ Sharding can also improve performance by reducing query latency and network traffic
since queries can be executed on smaller and more relevant subsets of data.
➢ It can enable parallel processing and caching to further enhance performance.
➢ Sharding can improve availability by providing redundancy and fault tolerance, avoiding
single points of failure, and supporting replication and backup for data consistency and
durability.
➢Drawbacks: Complexity, Consistency, and Cost.
➢ It can increase the complexity of the database design, management, and maintenance, as
well as the cost of the database.
➢ Sharding requires careful planning and implementation to avoid data imbalance, hotspots,
or fragmentation.
➢ It can compromise the consistency of the database by introducing the possibility of data
inconsistency, duplication, or loss.
➢ It can create challenges for enforcing data integrity and atomicity of transactions.
➢ It can also create issues for performing complex queries, joins, or aggregations.
➢ it can demand more skills and expertise for managing the sharded database and resolving
potential problems or conflicts.
Master-Slave Replication
➢With master-slave distribution, replicate data across multiple nodes.
➢One node is designated as the master, or primary.
➢This master is the authoritative source for the data and is usually responsible
for processing any updates to that data.
➢The other nodes are slaves, or secondaries.
➢A replication process synchronizes the slaves with the master
Advantages
➢1. Scaling for Read-Intensive Datasets:
➢ Horizontal Scaling: Add more slave nodes to handle increased read requests.
➢ Limitation: The master node's ability to process updates and pass them on
limits scalability for write-heavy datasets.
➢ 2. Read Resilience:
➢ Handling Master Failure: Slaves can continue to handle read requests if the master fails.
➢ Write Limitation: No writes can be processed until the master is restored or a new
master is appointed.
➢ Quick Recovery: Slaves can be quickly appointed as a new master, speeding up recovery
➢3. Hot Backup:
➢ Single-Server Store with Hot Backup: All traffic goes to the master, with the slave acting
as a backup.
➢ Resilience: Provides greater resilience and graceful handling of server failures
➢ 4. Master Appointment:
➢ Manual Appointment: Configure one node as the master during cluster
setup.
➢ Automatic Appointment: Nodes elect a master, simplifying configuration
and reducing downtime during failures.
➢5. Read Resilience Configuration:
➢ Separate Paths: Ensure different read and write paths in the application.
➢ Testing: Conduct tests to ensure reads still occur if writes are disabled.
Drawbacks
➢1. Inconsistency:
➢ Propagation Delay: Different clients may see different values due to delayed
propagation of changes.
➢ Hot Backup Concern: Updates not passed to the backup are lost if the master fails
➢2. Master-slave replication helps with read scalability but doesn’t help with the scalability
of writes.
➢3. It provides resilience against the failure of a slave, but not of a master.
➢4. The master is a bottleneck and a single point of failure
PEER-TO-PEER REPLICATION
➢Peer-to-peer replication overcomes the problems of master-slave replication.
➢All the replicas have equal weight, and can all accept writes, and the loss of any of
them doesn’t prevent access to the data store.
➢Peer-to-peer replication in NoSQL is a technique that allows multiple servers to
maintain copies of data and share them:
➢In this All servers are considered peers, and any server can update the same data
at the same time.
➢The servers coordinate to keep their copies of the data in sync
➢Advantages:
➢ Node Failure Resilience: Can handle node failures without losing data access.
➢ Scalability: Easily add nodes to improve performance
➢Complications:
➢ Consistency Issues: Risk of write-write conflicts when writing to multiple places
simultaneously.
➢ Transient Read Inconsistencies: Inconsistent reads are temporary, but
inconsistent writes are permanent.
➢Dealing with Write Inconsistencies:
➢ Coordination: Ensure replicas coordinate to avoid conflicts, similar to a master-
slave setup. Requires network traffic for coordination.
➢ Majority Agreement: Only a majority of replicas need to agree on a write,
allowing survival of minority node failures.
➢ Coping with Inconsistencies: Develop policies to merge inconsistent writes,
maximizing performance by allowing writes to any replica.
Combining Sharding and Replication
➢Replication and sharding are strategies that can be combined.
➢If both master-slave replication and sharding is used, there have multiple
masters, but each data item only has a single master.
➢Depending on the configuration, can choose a node to be a master for some
data and slaves for others, or can dedicate nodes for master or slave duties
➢Using peer-to-peer replication and sharding is a common strategy for column-
family databases.
➢In this model, there might have tens or hundreds of nodes in a cluster with
data sharded over them.
➢A good starting point for peer-to-peer replication is to have a replication
factor of 3, so each shard is present on three nodes.
➢Should a node fail, then the shards on that node will be built on the other
nodes