Detailed Exam Notes for Big Data and Hadoop
Unit I: Introduction to Big Data and Hadoop
1. What is Big Data?
- Big Data refers to datasets that are too large and complex to be processed by traditional
data-processing tools.
- Characteristics (3Vs): Volume (large size), Velocity (speed of data), Variety (different formats).
- Example: Data generated by social media platforms like Facebook, Twitter.
2. Hadoop Ecosystem:
- Framework for distributed storage and processing of Big Data.
- Core Components:
a. HDFS (Hadoop Distributed File System): Stores data in blocks across multiple nodes.
b. MapReduce: Processes data in parallel across the cluster.
c. Other Tools: Hive (SQL-like queries), Pig (data transformation), HBase (NoSQL database).
3. IBM Big Data Strategy and Infosphere BigInsights:
- IBM Infosphere provides tools for Big Data analysis, such as BigSheets for analyzing large
datasets.
Diagram: Big Data flow (Collection -> Storage -> Processing -> Insights)
Diagram: Big Data Analytics flow (to be drawn).
Unit II: Hadoop Distributed File System (HDFS)
1. Architecture of HDFS:
- HDFS is a distributed file system that splits large data files into blocks and distributes them
across nodes.
- Components:
a. NameNode: Master node managing metadata.
b. DataNodes: Worker nodes storing actual data.
2. Data Ingestion:
- Flume: Transfers log data to HDFS in real-time.
- Sqoop: Transfers structured data from RDBMS to HDFS.
3. Hadoop I/O:
- Compression: Reduces data size for faster processing.
- Serialization: Converts data into a storable format (e.g., Avro).
Diagram: HDFS Architecture with NameNode and DataNodes
Diagram: HDFS Architecture (to be drawn).
Additional Units and Diagrams
The content and diagrams for Units III, IV, and V will follow similar patterns.