3.
NoSQL
Relational Database NoSQL
Student id Name Location Gender College
{
“Student_id” : “24”
“Name” : “Sachin”
24 Sachin Vasai M VCET “Hobby” : “Reading Books”
“Branch” : “CSE DS”
25 Dinesh Virar M VIT }
{
26 Mayuri Palghar F SPIT
“Student_id” : “25”
“Name” : “Dinesh”
1. Follow structure while entering the data. “Location” : “Virar”
2. Blank Space not allowed(Mark Nil - “Hobby” : “Singing”
occupies space or space wasted).
}
3. Can’t insert additional data.
Need to change data according to the
structure or Need to mold Structure
according to the data.(time consuming &
requires efforts.)
SQL NoSQL
1) Databases are categorized as Relational Database NoSQL databases are categorized as Non-relational or
Management System (RDBMS). distributed database system.
2) SQL databases have fixed or static or predefined schema. NoSQL databases have dynamic schema.
3) SQL databases display data in form of tables so it is known as NoSQL databases display data as collection of key-value pair,
table-based database. documents, graph databases or wide-column stores.
4) SQL databases use a powerful language "Structured Query In NoSQL databases, collection of documents are used to query
Language" to define and manipulate the data. the data. It is also called unstructured query language. It varies
from database to database.
5) SQL databases are best suited for complex queries.(structured NoSQL databases are not so good for complex queries
Query Language) because these are not as powerful as SQL queries (Untructured
Query Language).
6) SQL databases are not best suited for hierarchical data NoSQL databases are best suited for hierarchical data storage
storage. (Based on ACID Property) (CAP Theorem).
7) MySQL, Oracle, Sqlite, PostgreSQL and MS-SQL etc. are the MongoDB, BigTable, Redis, RavenDB, Cassandra, Hbase, Neo4j,
example of SQL database. CouchDB etc. are the example of nosql database
Brief History of NoSQL Databases
1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source relational
database
• 2000- Graph database Neo4j is launched
• 2004- Google BigTable is launched
• 2005- CouchDB is launched
• 2007- The research paper on Amazon Dynamo is released
• 2008- Facebooks open sources the Cassandra project
• 2009- The term NoSQL was reintroduced
Introduction to NoSQL
• Non-relational database (doesn't have features which are related to Relational Database)
• Doesn’t have predefined schema (Structure of the table will not be predefined, Total no. of
attributes & meaning of entire database will not be same at every point of time)
• Doesn’t store data in tables (stores unstructured data)
• Generally used to store big data and real-time data (data can be of different types for eg.
image, audio, video for storing this we can’t use traditional database format. It also stores
streaming data i.e. data which is recorded live this was not handled by the structured format
so NoSQL was introduced)
• Follows a CAP Theorem (Consistency, Availability, Partitioning) NoSQL can’t follow all 3
properties at a time, it can follow any of 2 properties.
What is NoSQL?
• NoSQL is next generation database which is completely different from the traditional database.
• NoSQL refers to a non-relational database management system designed to handle large volumes of
diverse, unstructured, and semi-structured data that traditional relational databases struggle with.
• NoSQL stands for Not only SQL. SQL as well as other query languages can be used with NoSQL databases.
• NoSQL is non-relational database and it is schema-free.
• NoSQL uses distributed architecture and works on multiple processors to give high performance.
• NoSQL databases are horizontally scalable.
• Many open-source NoSQL databases are available. Data file can be easily replicated.
• NoSQL uses simple API.
• NoSQL can manage huge amount of data.
• NoSQL can be implemented on commodity hardware which has separate RAM and disk
(shared nothing concept).
Features of NoSQL
1. Never follows Relational Model (Never provide Table with fixed Column
Records & it is Schema Free).
2. Provide share Nothing Environment (if data is distributed in Nodes, each
Node will receive different data & each node performs its task
independently).
3. Scalability (Scale up i.e. Vertical scaling [increase in Hardware configuration
which means RAM , HDD - Expensive] & Scale out i.e. horizontal scaling
[dividing the load or data into multiple nodes – low cost] )
4. Low-cost hardware as compared to relational database.
5. Faster Performance (because space is not wasted).
NoSQL Business Drivers
NoSQL Business Drivers
Volume:
• Data is getting generated exponentially which is increasing the volume, with this increase
in data there is a need for extending the storage capacity.
• NOSQL databases are designed to handle massive amounts of data, which is a core
characteristic of big data.
• Because of this Volume, there is a need to scale up.
• Not only for the storage purpose, there is a need to scale up the processing speed as well
as resources.
• If the data is large in amount, requirement of processing it is more & resources used to
fulfilling these requirements will also be more.
• Because of RDBMS it would take more amount of time to process this huge data, so
organization shifted from serial to parallel processing (entire data will be divided into
clusters , every cluster will be processed parallelly & finally output from these clusters will
be combined).
NoSQL Business Drivers
Velocity:
• Velocity means the rate at which data was been generated.
• Initially rate at which data was getting generated was very low, hence there was
very low traffic generated.
• The no. of request for accessing the data was very less.
• Due to velocity , there was Random Bursts in web Traffic, it was difficult for RDBMS
to respond to all these request in given amount of time, it resulted in slow response
time.
• To deal with this problem, there was a need for huge number of resources, which
was expensive for the organization & company.
NoSQL Business Drivers
Variability:
• Big data is not always structured and consistent. NoSQL databases excel at handling
diverse data types (structured, semi-structured, and unstructured) without requiring rigid
schemas.
• This flexibility is essential for accommodating the various data formats generated by social
media, IoT devices and other modern sources.(availability of data)
Agility:
• NoSQL databases offer greater agility in terms of development and deployment.
• They allow for faster iteration and adaptation to changing business needs due to their
flexible schema designs and ability to scale horizontally.
• This agility is particularly important in fast-paced environments where time to market is a
competitive advantage. (time taken to feed or retrieve data is high)
CAP Theorem
• The CAP theorem, originally introduced as the CAP principle, can be used to explain
some of the competing requirements in a distributed system with replication.
• It is a tool used to make system designers aware of the trade-offs while designing
networked shared-data systems.
• The three letters in CAP refer to three desirable properties of distributed systems
with replicated data: consistency (among replicated copies), availability (of the
system for read and write operations) and partition tolerance (in the face of the
nodes in the system being partitioned by a network fault).
• The CAP theorem states that it is not possible to guarantee all three of the desirable
properties – consistency, availability, and partition tolerance at the same time in a
distributed system with data replication.
• The theorem states that networked shared-data systems can only strongly support
two of the following three properties:
• Consistency: means that all clients see the same data at the same time, no matter
which node they connect to in a distributed system. To achieve consistency,
whenever data is written to one node, it must be instantly forwarded or replicated to
all the other nodes in the system before the write is deemed successful.
• Availability: means that every non-failing node returns a response for all read and
write requests in a reasonable amount of time, even if one or more nodes are down.
Another way to state this — all working nodes in the distributed system return a valid
response for any request, without failing or exception.
• Partition Tolerance: means that the system continues to operate despite arbitrary
message loss or failure of part of the system. In other words, even if there is a
network outage in the data center and some of the computers are unreachable, still
the system continues to perform. Distributed systems guaranteeing partition
tolerance can gracefully recover from partitions once the partition heals.
The CAP theorem categorizes systems into three categories:
• CP (Consistent and Partition Tolerant) database:
A CP database delivers consistency and partition tolerance at the expense of
availability.
When a partition occurs between any two nodes, the system has to shut down the
non-consistent node (i.e., make it unavailable) until the partition is resolved.
Partition refers to a communication break between nodes within a distributed
system. Meaning, if a node cannot receive any messages from another node in the
system, there is a partition between the two nodes.
Partition could have been because of network failure, server crash, or any other
reason.
• AP (Available and Partition Tolerant) database:
An AP database delivers availability and partition tolerance at the expense of
consistency.
When a partition occurs, all nodes remain available but those at the wrong end
of a partition might return an older version of data than others.
When the partition is resolved, the AP databases typically resync the nodes to
repair all inconsistencies in the system.
• CA (Consistent and Available) database:
A CA delivers consistency and availability in the absence of any network
partition.
Often a single node’s DB servers are categorized as CA systems.
Single node DB servers do not need to deal with partition tolerance and are thus
considered CA systems.
NOSQL Case study
1. Amazon DynamoDB
2. Google’s BigTable
• Google’s motivation for developing BigTable is driven by its need for
massive scalability, better performance characteristics and ability run on
commodity hardware.
• Each time when a new service or increase in load happens, its solution
BigTable would result in only a small incremental cost.
• Volume of Google’s data generally is in petabytes and is distributed over
100,000 nodes
3. MongoDB
• MongoDB was designed by Eliot Horowitz with his team in 10gen.
• MongoDB was built based on their experiences in building large scale, high
availability, robust systems.
• MongoDB was thought of changing the data model of MySql from relational to
document based, to achieve speed, manageability, agility, schema-less databases
and easier horizontal scalability (also JOIN free).
• Relational databases like MySql or Oracle work well with, say, indexes, dynamic
queries and updates.
• MongoDB works exactly the same way but has the option of indexing an
embedded field.
4. Neo4j
• Neo4j is an open-source (source code is available in github) sponsored by
Neo Technology.
• Its NoSQL graph database is implemented in Java.
• Its development started in 2003; it was made publicly available since
2007.
• Neo4j is used today by hundreds to thousands of enterprises.
• To name a few: scientific research, routing, matchmaking, network
management, recommendations, social networks, software analytics,
organizational and project management
Desirable features of NoSQL that drive business are listed below:
1. 24 × 7 Data availability
2. Location transparency
3. Schema-less data model
4. Modern day transaction analysis
5. Architecture that suits big data
6. Analytics and business intelligence
NoSQL Data Architectural Patterns
Types of NoSQL Data Stores
1. Key−value store.
2. Column store.
3. Document store.
4. Graph store.
1. Key Value Store Database
• Most basic data model.
• Stores the data in the form of key-value pairs.
• Key is the representative of data value.
• Key can be integer, string or any other data type but it must always be
unique.
• Value is a data, that is correlated to the key (JSON, BLOB(Binary large
Object), String, etc.)
• The key-value pair storage databases generally store data as a hash table
where each key is unique.
• This type of pattern is usually used in shopping websites or e-commerce
applications.
1. Key Value Store Database
Advantages:
• Can handle large amounts of data and heavy load
• Easy retrieval of data by keys.
Disadvantages:
• Complex queries may involve multiple key-value pairs which may delay performance.
• Data can involve many-to-many relationships which may collide.
Use:
• DynamoDB
• Berkeley DB
Examples of Key−Value Stores
• Redis, Amazon Dynamo, Azure Table Storage (ATS), Riak, Memcache,
etc. Uses of Key−Value Stores Dictionary, image store, lookup tables,
cache query, etc.
• A key−value store is similar to Dictionary where for a word (key) all
the associated words (noun/verb forms, plural, picture, phrase in
which the word is used, etc.) and meaning (values) are given.
• External websites are stored as key−value store in Google’s
database. Amazon S3 (simple storage service) makes use of
key−value store to save the digital media content like photos, music,
videos in the cloud. In a key−value store, static component is the
URL of the website and images. The dynamic component of the
website generated by scripts is not stored in key−value store.
2. Column Store Database
• Data storage is done in individual cells (can relate with RDBMS)
• Every column is handled differently (all columns coming under particular
column is functioning separately).
• Individual columns will contain several columns inside it.
2. Column Store Database
Advantages:
• Readily available data
• Aggregate queries can run readily on data(SUM, AVG, COUNT , etc ).
Disadvantages:
• Not efficient with online transactional processing.
Use:
• HBase
• Cassandra
3. Document Database
• Stores the data in key-value pair but here values are termed as documents.
• Documents can be any complex data structures.
• Documents can be arrays, strings, XML, JSON, etc.
• Documents can be nested too ( Multiple documents can be inside single
document) which can increase the complexity of Storage of data.
• For example, if the root is Employee, the path can be
Employee[id=‘2300’]/Address/street/BuildingName/text()
• Though the document store tree structure is complex the search API is simple.
• Document structure uses JSON (JavaScript Object Notation) format for deep
nesting of tree structures associated with serialized objects.
• But JSON does not support document attributes such as bold, hyperlinks, etc.
• Examples include: MongoDB, CouchBase, and CouchDB.
3. Document Database
Advantages:
• Useful for semi-structured data.
• Retrieval and management of data is easy.
Disadvantages:
• Aggregate operations may not work fine (as data is stored in the form of semi-structure data).
Use:
• CouchDB
• MongoDB
MangoDB
This scalable, high performance, open source NOSQL db features
document-oriented storage, full index support, replication and fast on-site updates.
This product is suitable for dynamic queries, dynamic data structures, written in
C/C++.
CouchDB
Also, an open-source database that focuses on the ease of data storage in a series
of JSON documents, each with its own definition of the schema. Eventual
consistency is enforced by ACID semantics that prevents locking db files during
writing.
4. Graph Database
• Stores the data in form of graphs.
• Graphs are basic data structures that states connection between objects.
• Objects that are connected are termed as nodes(In this case objects are called as
nodes and these nodes are connected to every other node).
• Relationships that define these objects are termed as edges (the connection that
is used to define the relationships between these objects or nodes are termed as
edges).
• Graph stores contain sequence of nodes and relations that form the graph. Both
nodes and relationships contain properties like follows, friend, family, etc.
• So, a graph store has three fields: nodes, relationships and properties.
• Examples include: Neo4j, AllegroGraph, TeradataAster.
Some of Neo4j features are listed below:
1. Neo4j has CQL, Cypher query language much like SQL.
2. Neo4j supports Indexes by using Apache Lucence.
3. It supports UNIQUE constraints.
4. Neo4j Data Browser is the UI to execute CQL Commands.
5. It supports ACID properties of RDBMS.
6. It uses Native Graph Processing Engine to store graphs.
7. It can export query data to JSON and XLS format.
8. It provides REST API to Java, Scala, etc
4. Graph Database
Advantages:
• Fast traversal and retrieval of data.
Disadvantages:
• Because nodes are connected to each other, can easily traverse to the entire graph which
increases the traversal rate. Now disadvantage is that, if incase wrong relationship is
established in any two node. The problem of infinite loop may occur.
Use:
• Neo4J
• FlockDB
Type Typical usage Examples
Key-value store—A simple data •Image stores •Berkeley DB • Memcache
storage system that uses a key •Key-based file systems •Redis • Riak •DynamoDB
to access a value •Object cache
•Systems designed to scale
Column family store—A sparse • Web crawler results • Apache HBase • Apache
matrix system that uses a row •Big data problems that can relax Cassandra •Hypertable •
and a column as keys consistency rules Apache Accumulo
Graph store—For relationship •Social networks •Neo4j • AllegroGraph •Bigdata
intensive problems • Fraud detection (RDF data store) • InfiniteGraph
•Relationship-heavy data (Objectivity)
Document store—Storing •High-variability data •Document • MongoDB (10Gen) •CouchDB
hierarchical data structures search • Integration hubs • Web •Couchbase • MarkLogic •
directly in the database content management • eXist-db •Berkeley DB XM
Publishing
Variation of NoSQL Architectural patterns
• Variations – different architectural patterns NoSQL database follow .
• Variation exist because different problems need different storage and access
models.
• The key−value store, column family store, document store and graph store
patterns can be modified based on different aspects of the system and its
implementation.
• Database architecture could be distributed (manages single database
distributed in multiple servers located at various sites) or federated (manages
independent and heterogeneous databases at multiple sites).
1. Customization for RAM or SSD stores
2. Distributed stores
3. Grouping Items
NoSql Case Study
Case study: LiveJournal’s Memcache
• Engineers working on the blogging system LiveJournal started
to look at how their systems were using their most precious
resource: the RAM in each web server.
• LiveJournal had a problem. Their website was so popular that
the number of visitors using the site continued to increase on a
daily basis. The only way they could keep up with demand was
to continue to add more web servers, each with its own
separate RAM.
Case study: Google’s MapReduce - use commodity hardware to
create search indexes
• One of the most influential case studies in the NoSQL
movement is the Google MapReduce system. In this paper,
Google shared their process for transforming large volumes of
web data content into search indexes using lowcost commodity
CPUs.
• Though sharing of this information was significant, the concepts
of map and reduce weren’t new. Map and reduce functions are
simply names for two stages of a data transformation as given
in figure
What is a Big Data NoSQL Solution?
• A decade ago, NoSQL was deployed in companies such as Google,
Amazon, Facebook and LinkedIn.
• Nowadays, most enterprises that are customer-centric and
revenue-driving applications that serve millions of consumers are
adopting this database.
• The move is motivated by the explosive growth of mobile devices, the IoT
and cloud infrastructure.
• The need of industries for scalability and performance requirements was
rising which the relational database technology was never designed to
address.
• Thus, enterprises are turning to NoSQL to overcome these limitations. A
few of the case studies which require NoSQL kind of databases are listed
in the following subsections.
1 Recommendation
2 User Profile
3 Real-Time Data Handling
4 Content Management
5 Catalog Management
6 360-Degree Customer View
7 Mobile Applications
8 Internet of Thing
9 Fraud Detection
Use Case Explanation Suitable NoSQL Type Examples
Suggests products, movies, or content based on user history and
1. Recommendation preferences (e.g., Amazon, Netflix). Needs relationship tracking Graph DB / Document DB Neo4j, MongoDB
between users and items.
Stores dynamic user info (name, preferences, activity logs). Data
2. User Profile Document DB MongoDB, Couchbase
structure varies from user to user.
High-speed applications (stock trading, gaming, chat apps, IoT Key-Value Store / Wide-Column
3. Real-Time Data Handling Redis, Cassandra
sensors). Requires very fast read/write. Store
Manages unstructured/semi-structured data (articles, videos,
4. Content Management Document DB MongoDB, CouchDB
blogs, metadata). Needs search & scalability.
E-commerce catalogs with flexible product attributes (clothes vs.
5. Catalog Management Document DB MongoDB, Couchbase
electronics). Schema-less is needed.
Combines CRM, transactions, social media, and customer
6. 360-Degree Customer View Graph DB + Document DB Neo4j + MongoDB
support data for unified view.
7. Mobile Applications Apps need offline sync, fast response, and flexible JSON storage. Document DB / Key-Value Firebase, Couchbase Mobile
Continuous streams of data from sensors (temperature, GPS,
8. Internet of Things (IoT) Time-Series DB / Wide-Column Cassandra, InfluxDB
devices). High write throughput required.
Real-time identification of abnormal transactions. Relies on
9. Fraud Detection Graph DB / Wide-Column Store Neo4j, Cassandra
pattern and relationship analysis.
Understanding Types of Big Data Problems
Big Data problems are categorized into two broad types based on how the data is accessed and
used:
A. Read-mostly Problems
• These problems involve data that is written once (or rarely updated) but read many times.
• Example: Logs, images, documents.
Subcategories:
1. Image - Large collections of images stored and retrieved (e.g., medical images, satellite images).
2. Event-log
• System or user activity logs.
• Two modes of processing:
• Real time → Data is processed as it arrives.
Example: Clickstream data from a website, IoT sensor data.
• Batch → Data is collected and processed later in bulk.
Example: Daily operational reports, server log analysis.
3. Documents
• Text-heavy data requiring indexing and search.
• Full-text search problems include:
• Simple text → keyword or phrase search.
• Annotations → metadata or tagging for better retrieval.
Big Data problems are categorized into two broad types based on how the data is accessed and
used:
4. Graph
• Data represented as nodes and edges.
• Example: Social networks, recommendation engines.
2. Read-write Problems
• These involve frequent updates as well as reads.
• Subcategories:
• High availability
• Data must always be available with minimal downtime.
• Example: Cloud databases for e-commerce, stock exchanges.
• Transactions
• Require strong consistency and atomic updates.
• Example: Banking systems, online payments.
Some ways you classify big data problems and see how NoSql systems
are changing the way organization use data.
1. Read mostly
2. Log events
3. Full text documents
Analyzing Big Data with a Shared Nothing Architecture
• In the distributed computing architecture, there are two ways of resource sharing
possible or share nothing.
• The RAM can be shared or disk can be shared (by CPUs); or no resources shared.
• The three of them can be considered as shared RAM, shared disk and
shared-nothing.
• Each of these architectures works with different types of data to solve big data
problems.
• In shared RAM, many CPUs access a single shared RAM over a high-speed bus.
• This system is ideal for large computation and also for graph stores.
• For graph traversals to be fast, the entire graph should be in main memory.
• The shared disk system, processors have independent RAM but shares disk space
using a storage area network (SAN).
• Big data uses commodity machines which shares nothing (shares no resources).
Choosing Distribution Models : Master-Slave Versus Peer-to- Peer
• NoSQL database makes distribution of data easier, since it has to move only aggregate data and not all
the related data that is used in aggregation.
• There are two styles of distributing data: Sharding and replication. A system may use either or both
techniques.
Like Riak database shards the data and also replicates it.
1. Sharding: Horizontal partitioning of a large database leads to partitioning of rows of the database. Each
partition forms part of a shard, meaning small part of the whole. Each part (shard) can be located on a
separate database server or any physical location.
2. Replication: Replication copies entire data across multiple servers. So the data is replicated and
available in multiple places.
Replication comes in two forms: master−slave and peer-to-peer.
• Master−slave replication: One node has the authoritative copy that handles writes. Slaves synchronize
with the master and handle reads.
• Peer-to-peer replication: This allows writes to any node; the nodes coordinate between themselves to
synchronize their copies of the data
Four Ways that NoSQL System Handles Big Data Problems
Every business needs to find the technology trends that have impact on its revenue.
Modern business not only needs data warehouse but also requires web/mobile
application generated data and social networking data to understand their customer’s
need.
NoSQL systems help to analyze such data.
IT executives select the right NoSQL systems and set up and configure them.
1. Manage Data Better by Moving Queries to the Data
▪ NoSQL system uses commodity hardware to store fragmented data on their
shared-nothing architecture except for graph databases which require specialized
processors.
▪ NoSQL databases improve performances drastically over RDBMS systems by moving
the query to each node for processing and not transfer the huge data to a single
processor.
2. Using Consistent Hashing Data on a Cluster
▪ A server in a distributed system is identified by a key to store or
retrieve data.
▪ The most challenging problem here is when servers become
unreachable through network partitions or when server fails.
▪ Suppose there are “n” servers to store or retrieve a value.
▪ Server is identified by hashing the value’s key modulo s.
▪ But when server fails, the server no longer fills the hash space.
▪ The only option is to invalidate the cache on all servers, renumber
them, and start once again.
▪ This solution is not feasible if the system has hundreds of servers
and one or the other server fails.
3. Using Replication to Scale Reads
▪ Replication improves read performance and database server
availability. Replication can be used as a scale-out solution where
you want to split up database queries across multiple database
servers.
▪ Replication works by distributing the load of one master to one or
more slaves.
▪ This works best in an environment where there are high number of
reads and low number of writes or updates.
▪ Most users browse the website for reading articles, posts or view
products.
▪ Writes occur only when making a purchase (during session
management) or when adding a comment or sending message to a
forum.
4. Letting the Database Distribute Queries Evenly to DataNodes
▪ The most important strategy of NoSQL data store is moving query to
the database server and not vice versa.
▪ Every node in the cluster in shared-nothing architecture is identical;
all the nodes are peers.
▪ Data is distributed evenly across all the nodes in a cluster using
extremely random hash function and so there are no bottlenecks.
▪ In ideal scale-out architecture “shared-nothing” concept is used.
▪ Since no resource is shared, there is no bottleneck and all the nodes
in this architecture act as peers.
▪ Data is evenly distributed among peers through a process called
sharding