[go: up one dir, main page]

0% found this document useful (0 votes)
12 views53 pages

Module 3 - NoSQL

Uploaded by

Saif Madre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views53 pages

Module 3 - NoSQL

Uploaded by

Saif Madre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Chapter 3

NoSQL
-Ahlam A.
NoSQL
• 3.1 Introduction to NoSQL,
NoSQL Business Drivers,
• 3.2 NoSQL Data Architecture
Patterns: Key-value stores, Graph
• stores, Column family
(Bigtable)stores, Document stores,
• Variations of NoSQL architectural
Contents patterns, NoSQL Case Study
• 3.3 NoSQL solution for big data,
Understanding the types of big
• data problems; Analyzing big data
with a shared-nothing
• architecture; Choosing distribution
models: master-slave versus
• peer-to-peer; NoSQL systems to
handle big data problems.
What is NoSql?
• When people use the term “NoSQL database”, they typically use it to refer to any non-
relational database.
• Some say the term “NoSQL” stands for “non SQL” while others say it stands for “not only
SQL.”
• NoSQL databases are databases that store data in a format other than relational tables.
• A common misconception is that NoSQL databases or non-relational databases don’t
store relationship data well.
• NoSQL databases can store relationship data—they just store it differently than
relational databases do. In fact, when compared with SQL databases, many find
modeling relationship data in NoSQL databases to be easier than in SQL databases,
because related data doesn’t have to be split between tables.
• NoSQL data models allow related data to be nested within a single data structure.
NoSQL Databases

Why NoSQL? ● Best for: Unstructured, unpredictable


data
● Strengths:
Traditional (Relational) Databases ○ High performance & high
availability
● Best for: Predictable, structured data ○ Rich query languages
● Strengths: ○ Easy scalability
○ Strong ACID properties (Atomicity, ○ Distributed & fault-tolerant
Consistency, Isolation, Durability) architecture
○ Structured Query Language (SQL) ● Properties:
○ Ideal for transactions and complex ○ BASE (Basically Available, Soft
queries state, Eventual consistency)
● Limitations: ○ May not guarantee ACID
○ Less agile properties
○ Limited scalability Why NoSQL?
● Agile system for dynamic data
● Supports modern application needs
SQL vs NoSQL
SQL vs NoSQL
SQL vs NoSQL
ACID properties
1. Atomicity:

This ensures that a transaction is treated as a single unit, which means it either
completes fully or doesn't complete at all. If any part of the transaction fails, the
entire transaction is rolled back, and the system remains unchanged.
ACID properties
2. Consistency:

Consistency ensures that a transaction moves the database from one valid state
to another valid state, maintaining the integrity of the data. This means that after a
transaction completes, the data must follow all the defined rules, such as
constraints, cascades, triggers, etc.
ACID properties
3. Isolation:

Isolation ensures that multiple transactions occurring simultaneously do not


interfere with each other. Each transaction is executed as if it were the only
transaction being processed by the system at that time. This prevents conflicts or
inconsistencies when multiple users are interacting with the system.
ACID properties
4. Durability:

Durability ensures that once a transaction is committed, it remains permanent,


even if the system crashes immediately afterward. The data is saved to persistent
storage, so it will survive any failures.
ACID in a Banking Transaction

1. Atomicity: The transfer of $100 either completes fully (money is both


deducted and added), or nothing happens.
2. Consistency: The total balance across accounts remains correct and
consistent with the system rules.
3. Isolation: Multiple transfers happening at the same time don’t interfere
with each other; each transaction behaves as if it’s the only one
happening.
4. Durability: Once the transaction is completed, it’s permanently
recorded, even in the event of a system crash.
BASE (Basically Available, Soft state, Eventual consistency)

1. Basically Available:
This means the system is guaranteed to be available for serving requests, but it might not
guarantee immediate consistency of data across all nodes.

Example:

Consider a NoSQL system like Cassandra being used by a social media platform. During a
massive spike in traffic (e.g., a viral event or celebrity live stream), the system might
experience high load, but it will still be "basically available" to users. Even if the system is
under strain, users can still post messages, share photos, or interact with others. However,
the data might not be consistent across all the nodes immediately, but the system remains
operational and responsive.
BASE (Basically Available, Soft state, Eventual consistency)

2. Soft State:
The state of the system can change over time, even without new inputs, because the system does
not have strict consistency rules like in ACID systems. Data can be replicated across multiple
nodes, and the state might temporarily differ between nodes as updates are propagated across
the system.

Example: Imagine a NoSQL database that stores product information for an e-commerce website.
If a product's price is updated in one region's server, it might take some time for this update to
propagate across all servers globally. During this propagation window, the system is in a "soft
state" where the product's price might be different depending on which server a user’s request is
handled by. This state is temporary until all nodes eventually synchronize the updated price.
BASE (Basically Available, Soft state, Eventual consistency)

3. Eventual Consistency:
Instead of immediate consistency (as in ACID systems), BASE systems provide eventual
consistency. This means that, given enough time, all updates will eventually propagate to all
nodes, and the system will become consistent again, but not immediately.

Example: A NoSQL system like Amazon DynamoDB is used to manage inventory for a
global e-commerce site. If an item’s stock count is updated in one part of the world (e.g., a
warehouse in the U.S.), this update may not immediately be reflected in another region’s
database (e.g., Europe or Asia). However, over time, the system will propagate the stock
count update across all regions, and the system will eventually reach a consistent state
where every node reflects the updated stock count.
Global Food Delivery App
Imagine a global food delivery app using a NoSQL database like Couchbase, where users
can order food from various restaurants in different locations:

● Basically Available: No matter the load, users can always place orders or browse
menus. The system remains available even during peak traffic.
● Soft State: A restaurant updates its menu with new dishes. Some users might see the
updated menu right away, while others might see the old menu for a short period of
time. This reflects a "soft state" where not all nodes are synchronized yet.
● Eventual Consistency: Over time, all users will see the updated menu, as the
changes propagate across the distributed nodes. Eventually, all data will become
consistent across the system, but there’s no guarantee that consistency will happen
immediately.
CAP Theorem
Consistency, Availability, Partition tolerance (CAP)
theorem, also called as Brewer’s theorem, states
that it is not possible for a distributed system to
provide all three of the following guarantees
simultaneously:
1. Consistency guarantees all storage and their
replicated nodes have the same data at the same
time.
2. Availability means every request is guaranteed to
receive a success or failure response.
3. Partition tolerance guarantees that the system
continues to operate in spite of arbitrary partitioning
due to network failures.
Consistency
Example:

Suppose a customer buys the last item of a particular product from an e-


commerce website. If the database is consistent, all data nodes across the globe
will reflect that the product is out of stock at the same time. So if another customer
in a different country tries to purchase the same product immediately after, the
system will deny the purchase because the consistent data across the nodes
already reflects that the item is sold out.
Availability
Example:

Let’s say there’s a massive sale happening, and thousands of users are making
purchases simultaneously. Availability ensures that the system will respond to
every single user's transaction request, whether the transaction is successful (e.g.,
order processed) or failed (e.g., out of stock). Even during high-load periods, the
system will always provide a response, never leaving users wondering if their
action was completed.
Partition tolerance
Example:

Imagine that there is a network failure between the company’s North American
and European data centers due to a fiber cut. Partition tolerance guarantees that
both the North American and European data centers can continue to operate
independently, processing transactions and updating records. For example,
customers in North America can still make purchases, and customers in Europe
can do the same, even though the two data centers can't communicate with each
other at the moment. Once the network issue is fixed, the system will synchronize
the data, resolving any inconsistencies that might have arisen during the partition.
The figure
represents which
database systems
prioritize specific
properties at a
given time:
NoSQL Business Drivers

The Need for Scalable Data Storage


Modern businesses demand:

● Reliable and scalable data storage


● Rapid data access and analysis
● Ability to handle vast amounts of diverse data
● Real-time insights for informed decision-making

Traditional systems (single CPU, RDBMS) are struggling to keep pace.


Continued..
RDBMS with the business
drivers velocity, volume,
variability and agility
necessitates the
emergence of NoSQL
solutions. All of these
drivers apply pressure to
single CPU relational
model and eventually
make the system less
stable.
Why NoSQL for Big Data?
1. Velocity: NoSQL databases can handle the high speed at which data is generated and ingested,
making them suitable for real-time processing and analytics.
2. Volume: NoSQL ability to scale horizontally across multiple nodes makes it ideal for managing the
massive volumes of data typical in Big Data environments.
3. Variability: NoSQL databases provide flexible data models that can accommodate diverse data
formats without requiring strict schema definitions, unlike relational databases.
4. Agility: NoSQL solutions allow for rapid development and evolution of applications, as they support
dynamic schema changes, making them well-suited for fast-moving businesses.

These business drivers are why NoSQL solutions have become essential in modern Big Data
applications. They help organizations manage the speed, scale, diversity, and evolution of data
effectively, where traditional relational databases often fall short.
1. Velocity
Velocity refers to the speed at which data is generated, processed, and analyzed. In today’s world,
data streams in continuously and at high speeds, requiring databases that can handle rapid
ingestion, real-time processing, and fast responses.

Example: Consider a real-time analytics platform for an online stock trading system. The platform
continuously receives price updates, trades, and market data from various sources. These
updates need to be processed in real-time to provide traders with current market conditions.
Traditional relational databases struggle with this level of velocity because of their complexity and
constraints. NoSQL databases, like Apache Kafka or Cassandra, are designed for high
throughput, enabling the real-time ingestion and analysis of vast amounts of rapidly incoming
data.

Why NoSQL? NoSQL databases can efficiently handle high-velocity data streams by distributing
workloads across multiple nodes and processing data concurrently, ensuring that data ingestion
and response times remain fast.
2. Volume
Volume refers to the sheer scale of data that organizations collect and store, often in
terabytes or petabytes. Big Data requires systems that can store and manage massive
amounts of unstructured, semi-structured, and structured data.

Example: Social media platforms like Facebook or Twitter generate vast amounts of data
daily, from user posts and comments to likes, shares, and multimedia uploads. This data is
not only massive in scale but also varied in structure. Traditional databases are not designed
to handle such enormous volumes of data efficiently.

Why NoSQL? NoSQL databases, such as HBase or MongoDB, are built to scale
horizontally, meaning they can handle the growing volume of data by adding more servers
rather than upgrading the existing ones. This makes them ideal for handling large-scale data
volumes, providing flexible data models to accommodate the various forms of data being
stored.
3. Variability
Variability refers to the different forms and formats of data, ranging from structured tables to
unstructured text, videos, images, logs, and more. Modern data sources produce data that
does not fit neatly into the rows and columns of traditional relational databases.

Example: An e-commerce platform, such as Amazon, deals with a wide variety of data
formats. It collects structured data (e.g., product details, prices) alongside unstructured data
(e.g., user reviews, images, videos), semi-structured data (e.g., JSON or XML), and
machine-generated data (e.g., server logs, clickstreams). Traditional relational databases
would struggle to adapt to these varying data formats without significant overhead.

Why NoSQL? NoSQL databases, such as Couchbase or Elasticsearch, provide flexible


data models (e.g., document-oriented, key-value, graph-based) that allow for the storage
and processing of different types of data without requiring predefined schemas. This makes
them ideal for handling diverse and variable data formats.
4. Agility
Agility refers to the need for businesses to quickly adapt to changing requirements, deploy new
applications, and innovate rapidly. The rigidity of traditional relational databases often limits
flexibility when business models evolve or data structures change.

Example: Consider a startup developing a personalized recommendation engine for users based
on their behavior on a streaming platform (like Netflix). The recommendation engine needs to
adapt quickly as new features are added (e.g., new types of content, user preferences, watch
histories). The startup needs a database system that allows for rapid development, schema
changes, and continuous iteration without being locked into a rigid structure.

Why NoSQL? NoSQL solutions, such as DynamoDB or Neo4j, allow for schema flexibility,
enabling developers to make changes to the data model without significant downtime or
restructuring. This agility helps businesses iterate quickly, deploy new features, and respond to
market changes without being held back by database constraints.
NoSQL data Architectural Pattern/NoSQL solution
for big data
Types
1. Key-Value Store: Simple key-value pairs for quick lookups (e.g., Redis,
DynamoDB).
2. Column Store: Organizes data by columns, ideal for analytics (e.g.,
Cassandra, HBase).
3. Document Store: Stores flexible, JSON-like documents for complex and
dynamic data (e.g., MongoDB, Couchbase).
4. Graph Store: Optimized for storing and querying complex relationships
between entities (e.g., Neo4j, Amazon Neptune).

These NoSQL data stores are designed to handle the diverse needs of modern
applications, offering flexibility, scalability, and performance advantages over
traditional relational databases.
1. Key-Value Store
A Key-Value Store is the simplest type of NoSQL database. It stores data as a collection of key-
value pairs, where the key is a unique identifier, and the value can be anything from a simple
string to a more complex data structure like JSON.

● How it works: Think of it like a dictionary or a hash table where you look up a value using a
unique key.
● Example: Consider a caching system for a shopping website:
○ Key: user123_cart
○ Value: { "item1": "Laptop", "item2": "Phone", "item3": "Headphones" }
● Here, the key is the unique identifier for the user’s cart, and the value is a list of items they
have added to their cart.
● Use Case: Key-Value stores are great for applications requiring fast lookups of data by a
unique identifier, such as session management, caching, or storing user preferences.
● Popular Databases:
○ Redis (in-memory key-value store, often used for caching)
○ DynamoDB (AWS-managed key-value store)
○ Riak
2. Column Store
A Column Store organizes data into columns rather than rows, as in traditional relational databases. This
allows for fast retrieval of data for analytical queries, as entire columns can be accessed independently of
other columns.

● How it works: Instead of storing data by rows, a column store organizes data in columns. It groups
together data from the same column, which is highly efficient for read-heavy workloads.
● Example: Suppose you are storing data about customers:
○ Column 1: Customer Name ("John", "Jane", "Sam")
○ Column 2: Age (30, 25, 40)
○ Column 3: Country ("USA", "UK", "Canada")
● Instead of storing this data row by row, each column is stored together, making it easy to query
specific columns, like getting the ages of all customers.
● Use Case: Column stores are highly optimized for read-heavy operations and are ideal for
analytical applications where queries involve large data sets (e.g., data warehouses).
● Popular Databases:
○ Apache HBase (built on top of Hadoop)
○ Cassandra (a distributed column store often used for large-scale applications)
○ Google Bigtable
3. Document Store
A Document Store stores data in the form of documents, typically using formats like JSON, BSON, or XML. Each
document is self-contained and can contain structured or semi-structured data. Document stores are known for their
flexibility and schema-less nature.

● How it works: Each document is like a record that can contain nested structures, arrays, and various data types.
The database allows querying and indexing on the document's content.
● Example: Consider a collection of user profiles for a social media app: Document 1

{
"username": "john_doe",
"age": 30,
"friends": ["jane_doe", "sam_smith"]
}
Each document contains a complete record of the user’s information, which can be easily queried or updated without
affecting other documents.
Use Case: Document stores are widely used for content management systems, catalogs, and applications that need
flexible schema and complex data structures.
Popular Databases:

● MongoDB (stores data as JSON-like documents)


● Couchbase (a distributed document store)
● RethinkDB
4. Graph Store
A Graph Store is designed to store and query graph-based data structures, such as nodes (entities) and edges
(relationships between entities). Graph stores are used when relationships between data points are as important
as the data itself.

● How it works: Graph databases store data as nodes (representing entities, like people or objects) and
edges (representing relationships between those entities). These databases excel at querying relationships
between entities, which makes them ideal for complex queries involving networks of data.
● Example: Consider a social network where users are connected to each other:
○ Node 1: john_doe
○ Node 2: jane_doe
○ Edge (relationship): john_doe FRIENDS_WITH jane_doe
● The graph database allows you to traverse relationships efficiently, such as finding all the friends of
john_doe or finding the shortest path between two users.
● Use Case: Graph stores are perfect for social networks, recommendation engines, fraud detection, and any
application where understanding and querying relationships between entities is critical.
● Popular Databases:
○ Neo4j (popular graph database for complex relationships)
○ Amazon Neptune (AWS-managed graph database)
○ ArangoDB (a multi-model database supporting graph, document, and key-value models)
NoSQL Case Studies
1. Amazon DynamoDB,
2. Google’s BigTable,
3. MongoDB, and
4. Neo4j
Amazon DynamoDB
Use Case:
Airbnb, a global leader in the vacation rental space, needed a scalable database solution to store and manage
millions of listings, reservations, and user data. Their infrastructure had to handle high availability and scalability to
accommodate the vast and fluctuating demand of their global user base.
Challenges:
● Airbnb handles millions of transactions daily, with spikes in activity during peak times such as holidays or special
events.
● The need for low-latency responses when users search for listings, make bookings, or update their profiles.
Solution:
● Airbnb adopted Amazon DynamoDB as its primary database for handling user sessions, activity feeds, and
real-time data. DynamoDB's ability to automatically scale and distribute data across multiple servers globally
allowed Airbnb to maintain high availability and low-latency responses regardless of traffic spikes.
● DynamoDB Streams helped Airbnb implement real-time notifications and updates for users by capturing and
processing data changes as they occurred.
Results:
● Airbnb achieved the necessary scalability to handle millions of users across the world while ensuring a
seamless experience. The real-time nature of DynamoDB improved Airbnb’s ability to handle session data,
leading to better performance and user satisfaction.
Google’s BigTable
Use Case:
Snapchat, a popular social media platform, generates enormous amounts of data daily, including messages, media, and
user interactions. Managing this data at scale while ensuring rapid access and storage required a database that could
handle immense workloads and deliver real-time analytics.

Challenges:

● Snapchat's system generates over a billion pieces of content daily, including ephemeral messages and media.
● They needed a solution that could offer low-latency access and real-time analytics while scaling effortlessly to meet
growing demand.

Solution:

● Google BigTable was chosen as the primary database for storing user profiles, messages, and multimedia content.
BigTable’s distributed architecture allowed Snapchat to manage huge datasets across multiple locations while
delivering fast read/write capabilities.
● The database’s integration with Google Cloud enabled real-time data processing, allowing Snapchat to analyze
user interactions, deliver content recommendations, and manage messaging.

Results:

● Snapchat could scale to handle its rapid user growth and high data demands while maintaining real-time messaging
and low-latency performance. BigTable’s efficiency in managing large-scale time-series data contributed to
improved user experiences on the platform.
MongoDB
Use Case:
eBay, one of the world's largest online marketplaces, needed a flexible and scalable database system for handling its huge product
catalog. With millions of users listing products, making purchases, and searching the platform, eBay required a system that could
handle unstructured and semi-structured data efficiently.

Challenges:

● eBay had to deal with a vast and constantly changing product catalog, including user-generated content such as reviews and
feedback.
● The system needed to support multiple types of data formats (text, images, metadata) while maintaining high performance and
availability.

Solution:

● eBay adopted MongoDB as a core component of its product catalog management system. MongoDB’s flexible schema
allowed eBay to store product listings, images, and customer reviews in a scalable, document-oriented model.
● The use of sharding in MongoDB allowed eBay to distribute data across multiple servers, ensuring continuous availability and
performance as the user base grew.

Results:

● MongoDB’s flexibility enabled eBay to handle a wide range of data formats, making it easier to expand its product catalog
without being constrained by a rigid schema. This led to faster product updates and an improved user experience in searching
and browsing.
Neo4j
Use Case:
LinkedIn, the professional networking platform, required a database system optimized for handling complex relationships between
users, companies, job listings, and content. LinkedIn’s recommendation engine, which connects users with people and
opportunities, relies heavily on understanding and querying relationships.

Challenges:

● LinkedIn needed a database that could efficiently handle large, complex graphs of relationships, with millions of users and
billions of connections.
● Traditional relational databases struggled with efficiently traversing and analyzing these interconnected relationships at
scale.

Solution:

● Neo4j, a leading graph database, was selected for LinkedIn’s recommendation and connection services. Neo4j’s graph-
based model allowed LinkedIn to represent users, jobs, skills, and companies as nodes, with relationships (e.g.,
connections, recommendations, endorsements) as edges.
● Neo4j’s query language, Cypher, allowed LinkedIn to efficiently query relationships and suggest new connections or job
opportunities to users based on their existing networks.

Results:

● Neo4j enabled LinkedIn to improve the performance of its recommendation engine, making it faster and more accurate in
suggesting relevant connections and content to users. The graph database’s ability to traverse complex relationships quickly
led to more personalized user experiences.
Scalable Database Architectures in IoT and System Design

● Database architectures, such as key-value stores, column family stores, document


stores, and graph stores, can be adapted to fit system needs, whether distributed
(across multiple servers) or federated (managing independent databases).
● In IoT, a virtual sensor integrates multiple real sensor data streams into one, with
results stored temporarily or permanently depending on application requirements,
using data-centric IoT middleware.
● Scalable architectural patterns, such as load balancers with shared-nothing
architecture or distributed Hash Tables (e.g., Cassandra), support the creation of new
scalable systems.
● System architecture can be tailored based on needs like agility, availability,
intelligence, scalability, collaboration, and low latency.
● Technologies supporting these requirements include virtualization, cloud computing,
Big Data platforms, in-memory databases, and machine learning.
Using NoSQL to Manage Big Data

● What is a Big Data NoSQL Solution?


● Understanding Types of Big Data Problems
● Analyzing Big Data with a Shared Nothing Architecture
● Choosing Distribution Models
● Four Ways that NoSQL System Handles Big Data Problems
What is a Big Data NoSQL Solution?
A decade ago, companies like Google, Amazon, Facebook, and LinkedIn began deploying NoSQL databases to address the growing
demands of mobile devices, IoT, and cloud infrastructure. Traditional relational databases struggled with scalability and performance, driving
enterprises toward NoSQL solutions for their flexibility and ability to handle large-scale, dynamic data.

Case Studies Highlighting NoSQL Use Cases:

1. Recommendation Systems: NoSQL’s flexible data models and low-latency capabilities make it ideal for real-time personalization in e-
commerce, media, and travel industries.
2. User Profile Management: As user data grows in complexity and volume, NoSQL scales easily, offering fast read/write performance
that RDBMS cannot match.
3. Real-Time Data Handling: NoSQL, when combined with Hadoop, provides a solution for real-time operational data processing,
enhancing efficiency and revenue.
4. Content Management: NoSQL excels in managing diverse content types, such as images, videos, and user-generated content,
which RDBMS cannot handle effectively.
5. Catalog Management: NoSQL simplifies the aggregation of catalog data across multiple applications, managing complexity with its
flexible data models.
6. 360-Degree Customer View: By integrating structured and unstructured data from various sources, NoSQL enables a
comprehensive, up-to-date customer view, supporting improved service and sales opportunities.
7. Mobile Applications: NoSQL supports the scalability, performance, and availability needed for mobile apps, enabling rapid
deployment and expansion.
8. Internet of Things (IoT): NoSQL handles the volume, variety, and velocity of IoT data, offering scalability and real-time data access for
connected devices.
9. Fraud Detection: NoSQL’s low-latency and integrated cache mechanisms enable rapid data processing for real-time fraud detection,
which is critical in financial services.
Understanding Types of Big Data Problems
Polyglot Persistence and NoSQL in Modern Data Management

With the rise of e-commerce, a single database no longer suffices for all web applications. Polyglot persistence adopts a hybrid
approach, leveraging the strengths of multiple data stores to select the best option for each data type and purpose. This approach is
particularly relevant for big data, which varies in type and requires different NoSQL systems.

Big data can be classified as either "Read (mostly)" or "Read-Write." "Read-Write" data, often transactional and requiring high
availability, aligns with ACID properties, while "Read (mostly)" data, rarely changed after being read, is typical for data warehouse
applications and involves large datasets like clickstreams, sensor data, and documents.

NoSQL databases enable the cost-effective storage and analysis of large volumes of data, such as log files that record events with
timestamps. Unlike traditional RDBMS, which is more suited for ERP systems, NoSQL databases, such as key-value, document, and
column-family types, are aggregate-oriented, allowing for easy distribution across clusters and efficient retrieval.

In document databases, entire documents can be queried, similar to SQL rows, enabling complex reports that combine both traditional
and NoSQL data. For instance, a single query might extract all authors of presentations on a topic and then retrieve additional details
like skills and contact information from an RDBMS.
Analyzing Big Data with a Shared Nothing Architecture
Resource Sharing in Distributed Computing Architectures

In distributed computing, resources can be shared in three main architectures: shared RAM, shared disk,
and shared-nothing. Each architecture suits different big data problems:

● Shared RAM: Multiple CPUs access a single shared RAM via a high-speed bus, ideal for large
computations and graph stores. Fast graph traversal requires the entire graph to be in main
memory.
● Shared Disk: Processors have independent RAM but share disk space via a Storage Area Network
(SAN).
● Shared-Nothing: This architecture, where no resources are shared, is commonly used in big data
with commodity machines.

Among architectural patterns, only key-value stores and document stores are cache-friendly. BigTable,
which scales well, uses a shared-nothing architecture and has row-column identifiers similar to key-value
stores. In contrast, row stores and graph stores are not cache-friendly as they cannot be referenced by
short keys.
Choosing Distribution Models
Data Distribution in NoSQL Databases: Sharding and Replication

NoSQL databases simplify data distribution by moving only aggregate data, not all related data. There are two
primary data distribution methods: Sharding and Replication. Systems may use one or both techniques.

1. Sharding: Involves horizontal partitioning of a large database into smaller pieces called shards, each
containing a portion of the data. Shards can be distributed across different servers or physical locations.
2. Replication: Copies entire data across multiple servers, ensuring data availability in multiple locations.
Replication comes in two forms:
○ Master-Slave Replication: A single master node handles writes, while slave nodes synchronize with
the master and handle reads. This reduces update conflicts but can lead to a single point of failure
(SPOF) if the master node crashes. Solutions include using RAID drives or a standby master.
○ Peer-to-Peer Replication: Allows any node to handle writes, with nodes coordinating to synchronize
data. This model avoids SPOF but increases communication overhead due to the need for nodes to
stay synchronized.

The choice of distribution model depends on business requirements. For batch processing, master-slave is
preferred, while peer-to-peer is suitable for high availability. For example, Hadoop's initial version used a master-
slave architecture, while Cassandra uses a peer-to-peer model.
Four Ways that NoSQL System Handles Big Data Problems
Key Concepts in Managing Big Data with NoSQL

Modern businesses require advanced data management systems like NoSQL to handle the vast amounts of data generated by web, mobile
applications, and social networking platforms. Below are key strategies employed by NoSQL systems:

1. Moving Queries to the Data: Unlike traditional RDBMS, which transfer large amounts of data to a central processor, NoSQL improves
performance by moving queries to each node where data resides. This reduces network load, as only the query and results traverse the
network.
2. Consistent Hashing in Clusters: Consistent hashing helps manage data across a distributed system. Servers and data keys are hashed to
distribute data evenly across the system. When a server fails or is added, only a portion of data is reassigned, minimizing disruption.
3. Replication for Scaling Reads: Replication enhances read performance by distributing the load across multiple servers. It is particularly
useful in environments with high read and low write operations. However, challenges include replication lag and potential data inconsistency
between reads and writes.
4. Query Distribution in Shared-Nothing Architecture: NoSQL leverages a shared-nothing architecture where all nodes in a cluster are
peers, and data is evenly distributed using sharding. This approach avoids bottlenecks and allows the system to continue functioning even if
a node fails. Auto-sharding automates data distribution, making it more efficient than manual sharding.
5. Querying Large Datasets: NoSQL document databases use key access patterns to navigate related documents without requiring JOIN
operations. For instance, an employee record can be accessed using EmployeeId, and related information like department details can be
retrieved using DepartmentId.
Group Discussion
1. NoSQL solution for big data,
2. Understanding the types of big data problems;
3. Analyzing big data with a shared-nothing
architecture;
4. Choosing distribution models: master-slave versus
peer-to-peer;
5. NoSQL systems to handle big data problems.

You might also like