Distributed Storage Systems
Cloud Computing
Spring 2025
Introduction
• In cloud computing, storage is not confined to a single server or
location.
• Distributed storage systems enable reliable, scalable, and high-
performance data storage across a network of machines.
• These systems underpin many cloud services and are fundamental to
supporting modern applications that require access to large-scale,
highly available data.
• This chapter explores the various facets of distributed storage in the
cloud, from fundamental storage services to advanced architectures
for real-time processing and disaster recovery.
Cloud Storage Services
Cloud providers offer highly scalable and durable storage solutions for
unstructured data. Key services include:
• Amazon S3
• Google Cloud Storage
• Azure Blob Storage
Amazon S3
• Amazon Simple Storage Service (S3) is an object storage service that
offers industry-leading scalability, data availability, with high durability
(99.999999999%), and security.
• S3 organizes data into buckets and allows users to store and retrieve
any amount of data at any time.
• Key features include lifecycle management, cross-region replication,
and fine-grained access control.
• Supports versioning, lifecycle policies, and encryption.
• Integrates with AWS analytics and compute services.
Google Cloud Storage
• Google Cloud Storage offers unified object storage for developers and
enterprises.
• It provides multiple storage classes (Standard, Nearline, Coldline,
Archive) designed for different access frequencies.
• Features include strong consistency, automatic redundancy across
regions, and integration with other Google Cloud services such as
BigQuery and AI/ML tools .
• Strong consistency model.
Google Cloud Storage
Azure Blob Storage
• Azure Blob Storage is Microsoft’s object storage solution for the
cloud.
• It is optimized for storing massive amounts of unstructured data such
as text and binary data.
• Blob Storage supports three access tiers: hot, cool, and archive access
tiers, enabling cost-effective storage based on usage patterns.
• Supports block blobs, append blobs, and page blobs
• Integrated with Azure Data Lake for analytics
Distributed File Systems
Distributed file systems enable large-scale data storage across clusters.
Key systems include:
• Hadoop Distributed File System (HDFS)
• Ceph
• Lustre
Hadoop Distributed File System (HDFS)
• HDFS is a scalable, fault-tolerant distributed file system designed to
run on commodity hardware.
• Designed for batch processing with MapReduce
• Replicates data across nodes for fault tolerance
• Optimized for large, sequential reads
• It divides large files into blocks and distributes them across nodes in a
cluster.
• Each block is replicated to ensure data durability and availability.
Ceph
• Ceph is a unified, distributed storage system designed for excellent
performance, reliability, and scalability.
• It provides object, block, and file system storage in a single platform.
• Ceph uses the CRUSH (Controlled Replication Under Scalable Hashing)
algorithm for data placement, eliminating the need for a central
metadata server.
• Highly scalable with self-healing capabilities
Lustre
• Lustre is a high-performance distributed file system commonly used
in large-scale cluster computing.
• Supports POSIX (Portable OS Interface) compliance for compatibility
• It is widely deployed in supercomputing environments where
performance and throughput are critical.
• Used in scientific computing and financial modeling.
NoSQL Databases in the Cloud
NoSQL databases provide flexible schemas and horizontal scalability for
cloud applications.
• Amazon DynamoDB
• Apache Cassandra
• MongoDB Atlas
Amazon DynamoDB
• DynamoDB is a fully managed NoSQL database service that supports
key-value and document data models.
• Single-digit millisecond latency with auto-scaling
• Supports ACID transactions (atomicity, consistency, isolation, and
durability) and global tables
• It is designed for low-latency and high-throughput applications and
offers features such as on-demand scaling, DAX (DynamoDB
Accelerator), and global tables.
Apache Cassandra
• Cassandra is a highly scalable NoSQL database designed for handling
large amounts of data across multiple commodity servers with no
single point of failure.
• It uses a peer-to-peer architecture and supports eventual consistency.
• Decentralized, wide-column store with tunable consistency
• Linear scalability across multiple data centers
• Used by Netflix, Apple, and other large-scale applications
MongoDB Atlas
• MongoDB Atlas is a fully managed cloud version of MongoDB, a
document-based NoSQL database.
• Atlas supports multi-region deployments, automated backups, and
integrated monitoring tools.
• Document-oriented database with JSON-like schema.
• Supports sharding for horizontal scaling.
• Available as a managed service.
Data Consistency Models and Replication
Strategies
Distributed storage systems often face trade-offs between consistency,
availability, and partition tolerance (CAP theorem). Various consistency
models are used to balance these trade-offs:
• Strong Consistency: Guarantees that all users see the same data at
the same time.
• Eventual Consistency: Updates will eventually propagate through the
system, but immediate consistency is not guaranteed.
• Causal Consistency: Ensures that causally related updates are seen by
all nodes in the same order.
Data Consistency Models and Replication
Strategies
Replication strategies include:
• Master-slave replication: One node handles writes, others replicate
data.
• Multi-master replication: Multiple nodes can handle writes, requiring
conflict resolution.
• Quorum-based replication: Read and write operations require a
quorum of nodes to agree. Balances consistency and availability (e.g.,
Dynamo-style systems)
• Synchronous replication: Ensures data consistency but increases
latency
• Asynchronous replication: Lower latency but risk of data loss
Cloud-Based Data Warehousing
Modern data warehouses enable large-scale analytics with serverless
architectures.
• Google BigQuery
• Snowflake
• Amazon Redshift
Google BigQuery
• BigQuery is a serverless, highly scalable data warehouse that allows
users to run SQL-like queries on large datasets.
• It supports real-time analytics and integrates with various data
ingestion tools.
• Real-time querying and integration with ML models.
Snowflake
• Snowflake offers a cloud-native data warehouse with separate
compute and storage, enabling elastic scalability and concurrent
workloads.
• Its architecture supports structured and semi-structured data.
Amazon Redshift
• Redshift is a fully managed data warehouse that uses columnar
storage and parallel processing to deliver high performance for
analytical queries.
• It integrates with S3 for data lakes and supports Redshift Spectrum for
querying data directly from S3.
Data Streaming and Real-Time Processing
Real-time data processing is crucial for applications such as fraud
detection, log analysis, and recommendation systems.
Cloud-based streaming services include:
• Apache Kafka: A distributed event streaming platform that enables
real-time data feeds.
• Amazon Kinesis/ Azure Event Hubs: A suite of services for real-time
data ingestion and processing.
- Managed streaming services for real-time analytics
- Supports ingestion from IoT, logs, and transactions
• Google Cloud Dataflow: A serverless data processing service for
stream and batch data using Apache Beam SDK.
Backup, Disaster Recovery, and Storage
Security
• Backup and Disaster Recovery
• Storage Security
Backup and Disaster Recovery
Cloud providers offer automated backup services with options for
versioning and point-in-time recovery.
Disaster recovery strategies include:
• Cold standby: Delayed recovery using periodically updated backups.
• Warm standby: Partially active infrastructure that can be quickly
scaled.
• Hot standby: Fully active and redundant systems across regions.
Storage Security
Security in cloud storage involves:
• Encryption: Both in transit (TLS) and at rest (AES-256).
• Access Control: Fine-grained IAM policies, Access control, and access logs.
• Immutable storage: to prevent ransomware attacks
• Compliance: Adherence to standards like GDPR, HIPAA, and SOC 2.
Conclusion
• Distributed storage systems are foundational to the reliability,
performance, and scalability of cloud-based solutions.
• From object storage services and distributed file systems to NoSQL
databases and real-time processing platforms, understanding these
systems is essential for architects and developers building cloud-
native applications.
• Moreover, robust replication strategies, consistency models, and
security mechanisms ensure the integrity and availability of data in a
distributed environment.