Cloud Computing –unit - 3
1. Data in the Cloud
Cloud platforms provide storage and management of data without the need for on-premise
infrastructure. Data can be structured, semi-structured, or unstructured.
1.1 Relational Databases in the Cloud
Relational databases store data in tables (rows and columns), with relationships defined
between tables. SQL is used to query relational databases.
Cloud Examples:
- Amazon RDS
- Google Cloud SQL
- Microsoft Azure SQL Database
Features:
- Automatic backups
- Multi-zone replication
- Automatic failover
- Horizontal and vertical scaling
2. Cloud File Systems
Traditional file systems can't scale to handle petabytes of data efficiently. Cloud computing uses
distributed file systems to manage large data sets across clusters.
2.1 Google File System (GFS)
Proprietary system by Google, designed for huge data across commodity hardware.
Architecture:
- Master Node: metadata and chunk locations
- Chunk Servers: store data blocks (chunks)
- Default chunk size: 64 MB, replicated 3 times
Advantages:
- Fault-tolerant
- Scalable
- Optimized for large, sequential files
2.2 Hadoop Distributed File System (HDFS)
Inspired by GFS, open-source, used in big data.
Architecture:
- NameNode (Master): manages metadata
- DataNodes: store actual blocks
Characteristics:
- Block size: 128 MB (default)
- Fault-tolerance via replication (3 copies)
- Write-once, read-many optimized
2.3 Comparison: GFS vs HDFS
| Feature | GFS | HDFS |
|------------------|-----------|-----------|
| Origin | Google | Apache |
| Open Source | No | Yes |
| Block Size | 64 MB | 128 MB |
| Language | C++ | Java |
| Replication | 3 copies | 3 copies |
3. NoSQL Databases in the Cloud
Created to address scalability limitations of RDBMS.
3.1 Google Bigtable
Structured data across clusters, built on GFS.
Structure:
- Key: Row, column, timestamp
- Supports versioning
Used in:
- Google Search Index
- Google Earth and Maps
3.2 Apache HBase
Open-source Bigtable implementation, runs on HDFS.
Features:
- Column-family oriented
- Real-time read/write
- Schema-less
3.3 Amazon Dynamo
Highly available key-value storage.
Design:
- Eventual Consistency
- Decentralized P2P architecture
- Consistent Hashing, Vector Clocks
Use Cases:
- Shopping carts, Session storage
4. MapReduce and Extensions
Programming model for large dataset processing.
4.1 Concept of MapReduce
Two phases:
1. Map: Emit key-value pairs
2. Reduce: Merge values by key
Example: Word Count
Map: <word, 1>
Reduce: Sum counts per word
4.2 Parallel Computing in MapReduce
- Data and Task parallelism
- Intermediate results shuffled and sorted
Efficient if:
- Local processing
- Minimized data movement
4.3 Relational Operations Using MapReduce
- Selection: Filter in Map
- Projection: Select columns in Map
- Group By: Group in Reduce
- Join: Tagged Map, matched Reduce
4.4 Enterprise Batch Processing
Used for large data transformation jobs.
Tools: Hadoop, Spark
Use Cases:
- Monthly reports
- ETL pipelines
4.5 Real-World Applications of MapReduce
- Log analysis
- Social graphs
- Recommendation engines
- Bioinformatics
- Fraud detection
Summary Table
| Concept | Type | Use Case | Key Feature |
|------------|----------|--------------------------|---------------------------|
| RDBMS | SQL DB | Structured data | ACID transactions |
| GFS/HDFS | File Sys | Big data storage | Distributed, fault-tolerant |
| Bigtable | NoSQL | Google-scale storage | Column family, timestamps |
| HBase | NoSQL | Real-time Hadoop access | Built on HDFS |
| Dynamo | NoSQL | Key-value store | Decentralized, eventual consistency |
| MapReduce | Process | Batch processing | Map & Reduce model |