Exam-Oriented Notes: Big Data and Hadoop
Unit I: Introduction to Big Data and Hadoop
1. Big Data Analytics:
- Processing large, complex datasets to extract useful patterns and insights.
- Types: Structured, Unstructured, Semi-structured.
2. History of Hadoop:
- Developed by Doug Cutting and Mike Cafarella.
- Inspired by Google's MapReduce and GFS papers.
3. Hadoop Ecosystem:
- Tools like HDFS, MapReduce, Pig, Hive, HBase, Sqoop, Flume, and Oozie.
4. IBM Big Data Strategy:
- Integrates Hadoop with IBM Infosphere BigInsights for enterprise data management.
Unit II: HDFS (Hadoop Distributed File System)
1. HDFS Concepts:
- Distributed storage system for large datasets.
- Data divided into blocks and distributed across nodes.
2. Data Ingestion (Flume and Sqoop):
- Flume: Moves large logs into HDFS.
- Sqoop: Transfers structured data between HDFS and databases.
3. Hadoop I/O:
- Compression: Reduces data size.
- Serialization: Converts data into storable formats.
Unit III: MapReduce
1. Anatomy of MapReduce Job:
- Splits data into tasks, processes them in parallel, and combines results.
2. Shuffle and Sort:
- Organizes data before the reduce phase.
3. Job Scheduling:
- Ensures efficient task execution using schedulers like FIFO, Fair Scheduler.
Unit IV: Hadoop Ecosystem Tools
1. Pig:
- High-level platform for processing data.
- Uses Pig Latin language, easier than Java.
2. Hive:
- Query data using HiveQL (SQL-like language).
- Used for data warehousing and querying.
3. HBase:
- NoSQL database for real-time data.
- Faster than traditional RDBMS.
Unit V: Data Analytics with R and Machine Learning
1. Supervised Learning:
- Uses labeled data to train models.
- Examples: Regression, Classification.
2. Unsupervised Learning:
- Works on unlabeled data to find patterns.
- Examples: Clustering, Dimensionality Reduction.
3. Collaborative Filtering:
- Recommender systems based on user preferences.