[go: up one dir, main page]

0% found this document useful (0 votes)
12 views52 pages

BDA

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 52

Module 1

1)What is Big Data? Explain evolution of big data &


characteristics.

What is Big Data?


Big Data refers to extremely large datasets that cannot be efficiently
processed, managed, or analyzed using traditional data management tools
and applications. It encompasses high-volume, high-velocity, and high-
variety information assets that require new forms of processing to enable
enhanced decision-making, insight discovery, and process optimization.

Definitions:
 "Big Data is a collection of data sets so large or complex that traditional
data processing applications are inadequate."
 "It is data of a very large size, typically to the extent that its manipulation
and management present significant logistical challenges."
 "Big Data refers to data sets whose size is beyond the ability of typical
database software tools to capture, store, manage, and analyze."

Evolution of Big Data


The rapid advancement in technology has led to the exponential growth
of data. Initially, smaller data units such as megabytes were used, but
with increasing data production, we have moved to processing and
analyzing petabytes of data. Conventional storage and analysis systems
face challenges due to:
 Significant growth in the volume of data.
 Diverse data formats and variety.
 Increased complexity of data structures.
 Faster data generation and the need for real-time processing and analysis.

Characteristics of Big Data


Big Data is primarily defined by the 4 Vs:
1. Volume:
o Refers to the size or quantity of data generated.
o Applications produce a vast amount of data that traditional systems
struggle to handle.
2. Velocity:
o Represents the speed at which data is generated and processed.
o Fast data generation is critical for real-time analytics and decision-
making.
3. Variety:
o Indicates the diverse forms and formats of data, including
structured, semi-structured, and unstructured data.
o Data comes from multiple sources such as sensors, web servers,
and enterprise systems, adding complexity to its management and
analysis.
4. Veracity:
o Refers to the quality and accuracy of the data collected.
o Poor data quality can affect the reliability of insights derived from
Big Data analytics.
These characteristics highlight the challenges and opportunities presented
by Big Data, necessitating advanced tools and techniques for effective
management and utilization.

2) What is Cloud Computing? Explain different services of


Cloud.

What is Cloud Computing?


Cloud computing is an internet-based computing model that provides
shared resources and services, such as processing power and data storage,
on demand. It enables users to access computing infrastructure,
platforms, and software services anytime and anywhere over the Internet.
It is one of the best approaches for data processing, allowing parallel and
distributed computing in a cloud environment. Cloud services can be
hosted on platforms like Amazon Web Service (AWS), Microsoft Azure,
or Apache CloudStack. For example, Amazon Simple Storage Service
(S3) offers a simple interface to store and retrieve unlimited data over the
web.
Key Features of Cloud Computing:
1. On-demand service.
2. Resource pooling.
3. Scalability.
4. Accountability.
5. Broad network access.
Types of Cloud Services
1. Infrastructure as a Service (IaaS):
o IaaS provides access to physical and virtual resources such as hard
disks, network connections, storage, data centers, and virtual
servers.
o Example: Tata CloudStack, which helps manage virtual machines
and provides scalable infrastructure.
2. Platform as a Service (PaaS):
o PaaS offers a runtime environment where developers can build,
test, and deploy applications. It supports services like storage,
networking, and application hosting.
o Examples: Hadoop Cloud Services like IBM BigInsight, Microsoft
Azure HD Insights, and Oracle Big Data Cloud Services.
3. Software as a Service (SaaS):
o SaaS delivers software applications over the internet to end users.
These applications are hosted by service providers and accessed via
the web.
o Examples: SQL services like GoogleSQL, IBM BigSQL, Microsoft
Polybase, and Oracle Big Data SQL.

Summary
Cloud computing provides flexible and scalable solutions for data
storage, software, and computing resources. It offers three main types of
services—IaaS, PaaS, and SaaS—to meet various user needs, enabling
efficient and cost-effective computing solutions over the internet.

3) Explain the following terms. i. Scalability & Parallel


Processing ii. Grid & Cluster Computing.

i. Scalability & Parallel Processing


Scalability refers to the ability of a system to increase or decrease its
capacity for data storage, processing, and analytics as per workload
demands. It ensures that a system can handle larger workloads effectively
by scaling up or scaling out.
 Scaling Up (Vertical Scalability): Involves increasing the system's
resources, such as CPUs, RAM, or storage, to improve analytics,
reporting, and visualization capabilities. Efficient algorithm design helps
utilize these resources effectively.
 Scaling Out (Horizontal Scalability): Involves increasing the number of
systems working together to distribute tasks across multiple systems in
parallel. It is used for processing different datasets of large data in a
distributed manner.

Parallel Processing is a method to break down a computational problem


into sub-tasks that can be processed simultaneously across multiple
compute resources. It enhances performance and reduces processing time.
 Tasks can be distributed at various levels:
1. Onto separate threads on the same CPU.
2. Onto separate CPUs within the same system.
3. Onto separate computers in a network.
Massively Parallel Processing (MPP) platforms utilize multiple compute
resources for parallel processing, enabling faster execution of large
programs by executing sub-tasks concurrently.
ii. Grid & Cluster Computing
Grid Computing is a distributed computing approach where a group of
computers from different locations work together to achieve a common
goal.
 Key Features:
o It provides large-scale, secure, and flexible resource sharing among
users.
o Ideal for data-intensive storage rather than small data objects.
o Scalable and forms a distributed network for resource integration.
Advantages:
 Enables sharing resources among many users, reducing infrastructure
costs.
 Useful for applications requiring large data distribution across grid nodes.
Drawbacks:
 Vulnerable to failure if any participating node underperforms or fails.

Cluster Computing involves a group of interconnected computers


working together to accomplish a shared task.
 Primarily used for load balancing, where processes are shifted between
nodes to maintain an even workload.
 Commonly used in architectures like Hadoop for distributing tasks
efficiently.
Both grid and cluster computing enhance system efficiency through
resource sharing and parallel processing but differ in their
implementation and purpose.

4) Explain any two Big Data different Applications.


Big Data Analytics Applications (Simplified)
1. Marketing and Sales
o Understands customer needs and improves experiences
using Customer Value Analytics (CVA).
o Helps in creating effective content, improving customer
relationships, reducing costs, and increasing value.
o Uses browsing history for targeted ads (contextual
marketing).
2. Fraud Detection
o Prevents financial losses by identifying fake data and
poor-quality products.
o Combines data from social media, websites, and emails for
better fraud prediction.
3. Risks in Big Data
o Big data may have errors or inaccurate information.
o Companies need strong risk management to ensure
accurate results.
4. Credit Risk Management
o Helps banks identify high-risk sectors, businesses, and
individuals before lending.
o Detects risks like loan defaults and money shortages.
5. Healthcare
o Uses medical records and diagnosis logs for better
treatment and monitoring.
o Prevents fraud, reduces costs, and improves patient
outcomes in real-time.
6. Medicine
o Combines DNA and wearable device data for disease
research and patient risk profiling.
o Creates patterns to improve medical understanding and
treatments.
7. Advertising/Telecommunication
o Analyzes trends for better-targeted ads on social media,
emails, and more.
o Helps discover new opportunities and personalize
marketing campaigns.
These applications simplify processes, save costs, and drive
innovation in various industries.
MODULE 3
1) What is NOSQL? Explain CAP Theorem.

Here's an explanation of the CAP Theorem in a simple way, using the content
from the image:

CAP Theorem:
The CAP Theorem states that in any distributed database system, only two out
of the following three properties can be guaranteed simultaneously:
1. Consistency (C):
o All copies of data in the system must show the same data at the
same time.
o For example, if a sales record is updated in one part of the
database, the same change should reflect everywhere instantly.

2. Availability (A):
o The system must respond to every request, even during partial
failures.
o This means the system ensures that users can always access their
data.
3. Partition Tolerance (P):
o The system should continue to function even if there is a network
partition (e.g., a part of the network becomes unreachable).
o It ensures the database keeps running despite message losses or
network failures.

Key Insight:
 A distributed database cannot achieve all three properties (C, A, P) at the
same time.
 Developers must choose two out of the three, depending on the
requirements of their application.

Examples of CAP Trade-offs:


1. Consistency + Availability (CA):
o Suitable when data accuracy is critical, and the system cannot
tolerate partitions.
o Examples: RDBMS (MySQL, PostgreSQL).
2. Availability + Partition Tolerance (AP):
o Suitable for applications needing fast responses, even if some data
might be outdated.
o Examples: CouchDB, Cassandra.
3. Consistency + Partition Tolerance (CP):
o Suitable when consistent data is required, but availability can be
temporarily compromised.
o Examples: HBase, MongoDB.

Brewer’s Theorem:
 Brewer’s CAP Theorem further explains that during a network failure, a
system has to decide:
1. Provide old/wrong data to maintain Availability (AP trade-off).
2. Refuse to provide any data until the latest copy is available
(Consistency priority, CP trade-off).

Conclusion:
The CAP Theorem helps in designing distributed systems by understanding the
trade-offs between Consistency, Availability, and Partition Tolerance. Each
application’s priorities (like speed or accuracy) guide the choice between CA,
AP, or CP models.

2)Explain NOSQL Data Architecture Patterns.


NoSQL Data Architecture Patterns
NoSQL databases are designed for high scalability, flexibility, and performance. Here are the
key architecture patterns:

1. Key-Value Pair Data Stores


 Definition: Schema-less database where data is stored as key-value pairs.
 Features:
o High performance, scalability, and flexibility.
o Values can store any data type (text, images, etc.).
 Functions:
o Get(key): Retrieve value by key.
o Put(key, value): Insert/update value.
o Delete(key): Remove key-value pair.
 Uses:
o Image/document storage, lookup tables, and caches.

2. Document Stores
 Definition: Store unstructured or semi-structured data as documents (e.g., JSON,
XML).
 Features:
o Schema-less and hierarchical tree structures.
o Supports flexible schema changes.
 Examples: MongoDB, CouchDB.
 Uses:
o Storing office documents, forms, and inventory data.

3. CSV and JSON File Formats


 CSV: Stores flat, tabular data without a hierarchy.
 JSON:
o Supports hierarchical structures and is developer-friendly.
o Easier parsing for modern applications.

4. Columnar Data Stores


 Definition: Stores data in columns instead of rows for high-performance analytics.
 Features:
o High scalability and partitioning.
o Efficient querying and replication.
 Examples: HBase, Cassandra, BigTable.
 Uses:
o Web crawling and handling sparse datasets.

5. BigTable Data Stores


 Features:
o Massively scalable and supports petabytes of data.
o Integrates with Hadoop and MapReduce.
o Handles millions of operations per second.
 Uses:
o High-throughput applications like analytics and global-scale services.

6. Object Data Stores


 Definition: Store data as objects (files, images, etc.) with metadata.
 Features:
o APIs for querying, indexing, and lifecycle management.
 Example: Amazon S3.
 Uses:
o Web hosting, image storage, and backups.

7. Graph Databases
 Definition: Store data as interconnected nodes (objects) and edges (relationships).
 Features:
o High flexibility for relationship-heavy data.
o Specialized query languages (e.g., SPARQL).
 Examples: Neo4j, HyperGraph.
 Uses:
o Social networks, pattern matching, and relationship-based queries.

Summary
NoSQL databases provide different patterns like key-value, document, columnar, and graph-
based storage, each suited for specific use cases. They ensure scalability, flexibility, and
efficiency, making them ideal for modern data-driven applications.
3) Explain Shared Nothing Architecture for Big Data tasks.
Shared-Nothing Architecture for Big Data Tasks
Definition:
 Shared-Nothing (SN) is a cluster architecture where nodes do not share data with
each other.
 It is used in distributed computing to connect independent nodes via a network.

Key Points:
1. Independence:
o Each node works independently without sharing memory or data, making it
self-sufficient.
2. Partitioning:
o Big Data is divided into shards (smaller pieces).
o Each shard is processed by a different node, enabling parallel processing.
3. Self-Healing:
o If a node or link fails, the system creates a new link to maintain operations.
4. No Network Contention:
o Nodes do not compete for shared resources, ensuring better performance.
5. Data Management:
o Each node maintains its own copy of data using a coordination protocol.

Examples:
 Hadoop, Flink, Spark

Advantages:
 High Scalability: Nodes can be added easily.
 Fault Tolerance: System continues to function even if a node fails.
 Efficient Parallel Processing: Multiple queries run simultaneously.

Summary:
The Shared-Nothing architecture is ideal for Big Data tasks as it distributes data
across nodes, ensuring independence, fault tolerance, and scalability. It’s
widely used in tools like Hadoop and Spark for handling large-scale data
efficiently.

Choosing Distribution Models for Big Data


Big Data systems distribute data across multiple nodes for scalability and high
performance. Below are the main distribution models:

1. Single Server Model (SSD)


o All data is stored and processed on a single server.
o Best for: Small applications or graph databases with sequential processing.
o Limitation: Doesn’t scale for large datasets or high traffic.
o Example: Small graph database.

2. Sharding Very Large Databases


o Splits a large database into smaller pieces called shards, distributed across
multiple servers.
o Benefits:
 Improves performance by parallel processing.
 Fault-tolerant: Shards can move to another node if one fails.
o Example: Customer records split across four servers.
3. Master-Slave Distribution Model (MSD)
o Master node handles writes; slave nodes replicate data for reads.
o Benefits:
 Optimized read performance.
 Ensures consistency with replication.
o Challenges:
 Latency in replication.
 Master node failure affects write operations.
o Example: MongoDB.
4. Peer-to-Peer Distribution Model (PPD)
o All nodes are equal and can handle both read and write operations
independently.
o Benefits:
 High availability (tolerates node failures).
 Consistent data across nodes.
o Challenges: Complex management since every node handles all operations.
o Example: Cassandra.

Summary:
Choose the model based on scale, performance, and fault tolerance needs:
 SSD for small systems.
 Sharding for parallel processing.
 Master-Slave for optimized reads.
 Peer-to-Peer for high availability.

4) Explain MONGO DATABASE.


MongoDB Database (For 10 Marks)
MongoDB is a popular open-source NoSQL database designed for handling large amounts of data in
a flexible and distributed way. It was initially developed by 10gen (now MongoDB Inc.). It works
well for unstructured and semi-structured data.

Key Characteristics of MongoDB:


1. Non-relational: It doesn’t follow traditional SQL-based models.
2. NoSQL: Flexible and can manage large data across many servers.
3. Distributed: Data is spread across multiple machines for scalability.
4. Document-based: Stores data in JSON-like documents for flexibility.
5. Cross-platform: Works on multiple operating systems.
6. Scalable: Handles growing data by adding more servers (horizontal scaling).
7. Fault Tolerant: Ensures high availability with replication and redundancy.

Features of MongoDB:
1. Database Structure:
o A database contains collections (like tables in SQL).

o Default database: db. Main process: mongod. Client: mongo.

2. Collections:
o Stores documents (like rows in SQL).

o Schema-less: Documents in the same collection can have different structures.

3. Document Model:
o Data is stored in BSON (Binary JSON), which is flexible.

4. JSON-Like Storage:
o Stores data in a format similar to JSON, allowing complex data structures.

5. Flexible Data Storage:


o No need for a predefined schema; the structure can evolve over time.

6. Querying and Indexing:


o MongoDB supports dynamic querying and indexing to speed up queries.

o Query language is SQL-like but designed for document storage.

7. No Complex Joins:
o Doesn’t require complex joins, making it faster for big datasets.

8. Distributed Architecture:
o Data is spread across servers for scalability and high availability.

9. Real-Time Aggregation:
o Supports grouping, filtering, and analyzing data in real time.

Why Use MongoDB?


 Easy to scale for large datasets.
 Handles unstructured data efficiently.
 Ideal for modern applications that need flexibility and speed.
 Example Use Cases: E-commerce websites, real-time analytics, and content management
systems.

MODULE 4
1) Explain Map Reduce Execution steps with neat diagram.

MapReduce Execution Steps (For 10 Marks)


MapReduce is a framework in Hadoop for processing large datasets in a distributed and
parallel manner. The execution involves the following steps:

1. Input Split
 Purpose: Divides large input data into smaller chunks for parallel processing.
 How It Works: Data is split into fixed-size blocks (e.g., 64 MB). Each block is
processed by a mapper.
 Example: A log file is divided into chunks like lines 1–1000, 1001–2000, etc.

2. Record Reader
 Purpose: Converts raw input data into key-value pairs for processing.
 How It Works: Converts each line of the input split into (key, value) format.
 Example:
o Line 5: Key = 5, Value = This is an example line.
3. Mapping Phase (MAP)
 Purpose: Processes the key-value pairs and generates intermediate key-value pairs.
 How It Works: Applies the user-defined map() function.
 Example:
o Input: (line_number, "apple banana apple")
o Output: ("apple", 1), ("banana", 1), ("apple", 1)

4. Combine Phase (Optional)


 Purpose: Reduces data transfer by aggregating results locally.
 How It Works: Acts as a “mini-reducer” at the mapper node.
 Example:
o Input: ("apple", 1), ("apple", 1), ("banana", 1)
o Output: ("apple", 2), ("banana", 1)

5. Shuffle and Sort


 Purpose: Groups and organizes data for reducers.
 How It Works:
o Shuffle: Transfers intermediate key-value pairs to reducers.
o Sort: Groups and orders keys before reducing.
 Example:
o Input from mappers:
 Mapper 1: ("apple", 2), ("banana", 1)
 Mapper 2: ("apple", 1), ("orange", 1)
o After shuffle and sort:
 Reducer 1: ("apple", [2, 1])
 Reducer 2: ("banana", [1]), ("orange", [1])

6. Reducing Phase (REDUCE)


 Purpose: Aggregates grouped data to produce final output.
 How It Works: Applies the user-defined reduce() function.
 Example:
o Input: ("apple", [2, 1])
o Output: ("apple", 3)

7. Output
 Purpose: Writes final key-value pairs to Hadoop Distributed File System (HDFS).
 How It Works: Each reducer writes results to separate HDFS files.
 Example:
o Reducer 1: ("apple", 3)
o Reducer 2: ("banana", 2), ("orange", 1)

Overall Example: Word Count


Input Text:
"apple banana apple orange banana apple"
1. Input Split:
o Split into chunks: (1, "apple banana apple"), (2, "orange banana apple")
2. Mapper:
o Output: ("apple", 1), ("banana", 1), ("apple", 1), ...
3. Shuffle and Sort:
o Grouped: ("apple", [1, 1, 1]), ("banana", [1, 1]), ("orange", [1])
4. Reducer:
o Output: ("apple", 3), ("banana", 2), ("orange", 1)
5. Output:
o Written to HDFS as a file.

2) What is HIVE? Explain HIVE Architecture.


Hive was created by Facebook. Hive is a data warehousing tool and is also a data store on the
top of Hadoop. An enterprise uses a data warehouse as large data repositories that are
designed to enable the tracking, managing, and analyzing the data.
HIVE Architecture:
Hive architecture consists of several components that work together to provide the functionality of
data querying and management:
1. Hive Server (Thrift):
o The Hive Server is an optional service that allows remote clients to interact with
Hive.
o It exposes a simple API to execute HiveQL statements.

o Clients can send requests in different programming languages and retrieve results.

2. Hive CLI (Command Line Interface):


o This is the most commonly used interface to interact with Hive.

o It allows users to run HiveQL commands directly.

o When running on a Hadoop cluster, Hive CLI operates in "local mode," meaning it
uses local storage instead of Hadoop Distributed File System (HDFS).
3. Web Interface (HWI Server):
o Hive can also be accessed using a web browser.

o The HWI server must be running on a designated node to enable web access.

o You can use the URL http://hadoop:<port number>/hwi to access Hive through the
web interface.
4. Metastore:
o The Metastore is the system catalog in Hive.

o It stores important metadata like table schemas, database structures, columns, data
types, and mappings to HDFS locations.
o Every component of Hive interacts with the Metastore to get or store information.
5. Hive Driver:
o The Hive Driver manages the lifecycle of a HiveQL statement.

o It handles tasks like compilation, optimization, and execution of HiveQL queries.

Hive Installation:
To install Hive, the following software packages are required:
1. Java Development Kit (JDK): Needed for compiling and running Java code.
2. Hadoop: Hive runs on top of Hadoop, so Hadoop must be installed.
3. Hive (Compatible version): Make sure to use a version of Hive that is compatible with the
JDK. For instance, Hive 1.2 and onward supports Java 1.7 or newer.
In summary, Hive provides a simple and efficient way to query large datasets stored in Hadoop. It is
an essential tool for working with Big Data in a scalable and user-friendly way.

3) Explain Pig architecture for scripts dataflow and processing

Pig Architecture for Script Dataflow and Processing:


Pig is a platform built on top of Hadoop that allows users to process large data sets using a
language called Pig Latin. The architecture of Pig handles the execution of scripts and
optimizes the data processing flow to make it efficient. Let's break down the architecture and
the flow of data in simple terms.

Ways to Execute Pig Scripts:


There are three main ways to execute Pig scripts:
1. Grunt Shell:
o It is an interactive shell used to run Pig commands directly.
o You type Pig Latin commands, and Pig executes them immediately in an
interactive environment.
2. Script File:
o You can write a series of Pig commands in a script file.
o These commands are executed on the Pig Server, which processes the script
step by step.
3. Embedded Script:
o Sometimes, you need to use functions that are not built-in with Pig.
o You can create your own functions (called UDFs, or User Defined Functions)
in other programming languages (like Java or Python).
o These UDFs can then be embedded within the Pig Latin script.

Execution Flow in Pig:


1. Parser:
o The first step when you run a Pig script is parsing the script.
o The parser checks for any syntax errors or type errors in the script.
o It then generates a Directed Acyclic Graph (DAG), which is a graphical
representation of the logic in the script.
 Acyclic means that the data flows in one direction without any cycles.
 Nodes in this graph represent the operations (like filters or joins), and
edges represent the data flowing between these operations.
2. Optimizer:
o Once the DAG is generated, the next step is optimization.
o The optimizer improves the efficiency of the operations to reduce the data
flow and make the script run faster.
o Some key optimization techniques are:
 Push Up Filter: Moves filter operations earlier to reduce the amount
of data being processed.
 Push Down For Each Flatten: Delays certain operations to keep data
smaller.
 Column Pruner: Removes unnecessary columns to reduce the size of
the data.
 Map Key Pruner: Removes unnecessary map keys to optimize
storage.
 Limit Optimizer: If there is a limit operation, it reduces the number of
records early to save processing time.
3. Compiler:
o After optimization, the next step is compilation.
o The compiler takes the optimized DAG and translates it into MapReduce
jobs.
o These MapReduce jobs correspond to the logical steps of the Pig script and
prepare them for execution.
4. Execution Engine:
o Finally, the execution engine submits the generated MapReduce jobs for
execution on the Hadoop cluster.
o These jobs run across the cluster and perform the actual computation on the
data.
o Once the jobs complete, the final result is generated.

Flow Summary:
To summarize the entire flow of executing a Pig script:
1. Parse the Script: Check for errors and generate the DAG (data flow graph).
2. Optimize the DAG: Apply optimization techniques to reduce unnecessary processing
and data flow.
3. Compile the Jobs: Convert the optimized DAG into MapReduce jobs.
4. Execute the Jobs: The execution engine runs the MapReduce jobs on the cluster, and
the result is produced.

In simple terms, Pig takes a script, checks and optimizes it, then runs it on a cluster to process
large amounts of data efficiently. This process is designed to minimize unnecessary work and
speed up the execution by using techniques like filtering early and removing unnecessary
data.
4) Explain Key Value pairing in Map Reduce.

Key-Value Pairing in MapReduce


In the MapReduce programming model, key-value pairs are the fundamental data structures used to
process and generate data. The key-value pairing mechanism is crucial for the functioning of both the
Map and Reduce phases. Here’s an in-depth explanation:
1. Input Data: The input data for a MapReduce job is typically stored in the Hadoop Distributed
File System (HDFS). This data is divided into smaller chunks called Input Splits.
2. Input Split: Each Input Split is processed by a single Mapper. The Input Format defines how
the input data is split into these Input Splits.
3. Record Reader: The Record Reader is responsible for reading the data from the Input Split
and converting it into key-value pairs. The key is usually a unique identifier for the data
record, and the value is the actual data content.
4. Map Task: The Mapper processes each key-value pair independently. The Map function
takes a single key-value pair as input and produces zero or more intermediate key-value pairs.
The output of the Map function is a set of intermediate key-value pairs.
5. Shuffle and Sort: The intermediate key-value pairs are then shuffled and sorted by the
framework. The shuffle phase ensures that all values associated with the same key are sent to
the same Reducer.
6. Reduce Task: The Reducer processes the sorted key-value pairs. The Reduce function takes a
key and a list of values associated with that key as input and produces zero or more final key-
value pairs as output.
7. Output: The final key-value pairs are written to the output files, which are stored in HDFS.
Image Description
The image illustrates the data flow in a MapReduce job. It shows how input data stored on HDFS is
processed through various stages:
1. Input Data Stored on HDFS: The raw data stored in the Hadoop Distributed File System.
2. Input Format: Defines how the input data is split into Input Splits.
3. Input Split: Each split of the input data.
4. Record Reader: Reads data from Input Split and converts it into key-value pairs.
5. Key-Value: The key-value pairs generated by the Record Reader.
6. Map Task: Processes the key-value pairs and produces intermediate key-value pairs.
This image is relevant as it visually represents the core components and data flow in a MapReduce
job, making it easier to understand the process of key-value pairing and the overall MapReduce
framework.
MODULE 5
1) What is Machine Learning? Explain different types of Regression
Analysis.
Machine Learning (ML) is a subset of artificial intelligence (AI) that involves developing algorithms
and statistical models that enable computers to perform tasks without being explicitly programmed.
These systems learn from and make predictions or decisions based on data. ML is widely used in
fields such as natural language processing, computer vision, healthcare, finance, and more.ML can be
classified into supervised learning, unsupervised learning and reinforcement learning.
Euclidean Distance
2) Explain with neat diagram K-means clustering.

K-Means Clustering Algorithm (Simplified Explanation)


K-Means Clustering is an unsupervised learning algorithm used to group data into clusters
based on similarities.
It is a centroid-based algorithm, where each cluster is associated with a centroid.

How It Works (Simple Steps):


1. Decide the number of clusters (K):
Example: If K=2, divide data into 2 groups.
2. Pick random points (centroids):
These are the starting centers for the clusters.
3. Assign data points to the closest centroid:
Every data point goes to the group with the nearest centroid.
4. Recalculate centroids:
Find the average position of all data points in each cluster to get new centroids.
5. Repeat steps 3 and 4 until the centroids stop moving (no changes).
6. Done: Your data is now divided into K meaningful clusters.

Key Points:
 It groups similar data points together.
 The process adjusts until the clusters are as good as possible.
 You choose the number of clusters (K) before starting.

Example Use Case:


Organizing customers into groups based on their buying habits.

3) Explain Naïve Bayes Theorem with example.


4) Explain five phases in a process pipeline text mining.

Five phases in the text mining process:

1. Text Pre-processing
This is the first step to clean and organize the text so that it can be analyzed easily.
 Clean Up: Fix typos, remove irrelevant parts like extra spaces or special characters.
 Tokenization: Break sentences into words.
 POS Tagging: Label words as nouns, verbs, etc.
 Word Sense Disambiguation: Decide a word's meaning based on its context (e.g.,
"bank" as a river or a financial institution).
 Parsing: Understand sentence structure (who is doing what).

2. Feature Generation
Turn text into a format that machines can understand.
 Bag of Words: Count how many times each word appears.
 Stemming: Simplify words to their root (e.g., "running" → "run").
 Remove Stop Words: Get rid of common words like "the" or "and."
 VSM(Vector Space Model)TF(Term Frequency)-IDF(Inverse Document
Frequency): Highlight important words in a document by considering their
uniqueness.
3. Feature Selection
Pick only the most useful parts of the text to save time and improve results.
 Dimensionality Reduction: Remove repetitive or unimportant features.
 N-grams: Look at combinations of words (e.g., "good morning").
 Noise Removal: Eliminate unnecessary or irrelevant data.

4. Data Mining Techniques


Use methods to find patterns and insights in the text.
 Clustering (Unsupervised): Group similar texts together without knowing their type
beforehand.
 Classification (Supervised): Categorize text (e.g., spam or not spam).
 Trend Analysis: Find changes or patterns over time (e.g., how topics in news
change).

5. Analyzing Results
Check and use the output for decision-making.
 Evaluate: See if the results are useful.
 Interpret: Understand what the results mean.
 Visualize: Create graphs or charts to make it easy to explain.
 Apply Insights: Use the findings to improve things like strategies or processes.

This explanation focuses on the main ideas, making it easy to grasp!

5)Explain Web Usage Mining.


Web Usage Mining – Simplified Explanation
Web usage mining helps us understand how users interact with websites by analyzing click
patterns (how they navigate). It collects and studies the data generated by user activities, like
visiting pages or clicking links.
It has three main phases:
1. Pre-processing
o Prepares raw data collected from websites (e.g., server logs) to make it ready
for analysis.
o Identifies users (using cookies or logins), sessions (all pages visited by one
user), and page references (specific pages visited).
2. Pattern Discovery
o Applies smart techniques (like statistics, machine learning, and data mining) to
find hidden patterns in user behavior.
o Common techniques include:
 Statistics: Finds the most visited pages, time spent on a page, and
helps make marketing decisions.
 Association Rules: Finds pages frequently visited together, even if not
linked.
 Clustering: Groups users or pages with similar behaviors (e.g.,
showing similar browsing habits).
 Classification: Groups users into predefined categories (e.g., users
aged 18–23 watching certain movies).
 Sequential Patterns: Analyzes the order of pages visited (e.g., trails
followed by users).
3. Pattern Analysis
o Filters out unimportant patterns from the discovered ones.
o Methods include:
 Query Mechanisms: Like SQL for extracting patterns.
 Data Cubes: For multi-dimensional views of data.
 Visualization: Using graphs or color coding to see trends.
Why is it Useful?
 Helps websites improve user experience.
 Supports marketing strategies by analyzing popular products or pages.
 Helps create personalized content for users.
 Optimizes website structure by understanding user navigation.
By following these three phases, businesses can better understand users and make their
websites more efficient and user-friendly!

MODULE 2
1) What is Hadoop? Explain Hadoop eco-system with neat diagram
Hadoop
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming models.
The Hadoop framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to
scale up from single server to thousands of machines, each offering local computation and
storage.
Key features of Hadoop:
 Scalability: Can handle increasing amounts of data by adding more nodes.
 Fault Tolerance: Automatically handles node failures.
 Distributed Storage and Processing: Splits data and stores it across multiple nodes.
 Open Source: Freely available for use and modification.

Hadoop Ecosystem
The Hadoop Ecosystem consists of several components that work together for efficient big
data storage, processing, and analysis.

Core Components of Hadoop:


1. HDFS (Hadoop Distributed File System):
o Provides distributed storage.
o Stores large files across multiple nodes in blocks (default: 128 MB).
2. YARN (Yet Another Resource Negotiator):
o Handles resource allocation and job scheduling for applications running on
Hadoop.
3. MapReduce:
o Programming model for processing large datasets in parallel across clusters.
o Processes data in two stages: Map (filtering and sorting) and Reduce
(aggregating results).

Supporting Components in the Hadoop Ecosystem:


1. Hive:
o A SQL-like query language for structured data.
2. Pig:
o A high-level platform for writing scripts to process data.
3. HBase:
o A NoSQL database for real-time data.
4. Sqoop:
o Transfers data between Hadoop and relational databases.
5. Flume:
o Collects and transfers streaming data into Hadoop.
6. Oozie:
o A workflow scheduler for managing Hadoop jobs.
7. Zookeeper:
o Manages and coordinates distributed systems.
8. Mahout:
o Provides machine learning algorithms for clustering and classification.

Conclusion:
Hadoop, with its ecosystem, is a powerful solution for managing big data. It supports various
tasks, from storage to real-time processing and machine learning. The seamless integration of
its components allows for flexible and efficient handling of massive datasets, making it a
cornerstone in modern big data analytics.

2) Explain with neat diagram HDFS Components.


HDFS (Hadoop Distributed File System) Components – Simplified
HDFS is a core part of Hadoop used to store huge amounts of data across many machines. It
uses a master-slave structure with the following key components:

1. NameNode (Master)
 Acts like a manager of the file system.
 Stores metadata (details about files, such as names, locations, and permissions).
 Tracks where data blocks are stored and manages replication.
 Handles tasks like opening, closing, and renaming files.
 Detects failures in Data Nodes and ensures data is still accessible.

2. Data Nodes (Slaves)


 These are the machines that actually store the data blocks.
 Perform tasks like reading and writing data as instructed by the NameNode.
 Regularly send heartbeat signals to the NameNode to confirm they are working fine.

3. Client
 The user interface to interact with HDFS.
 Requests the NameNode for metadata (e.g., where to read/write data).
 Sends data directly to DataNodes for storage or retrieval.

4. Replication
 To ensure fault tolerance, data blocks are replicated (copied) to multiple DataNodes.
 The default replication factor is 3, meaning each block is stored on three different
machines.
 Replicas are spread across different racks (groups of machines) to reduce data loss
during hardware failures.

How Data is Written in HDFS


1. File Creation:
o The client tells the NameNode to create a file.
o The Name Node decides how many blocks the file will need and assigns Data
Nodes for storage.
2. Block Writing:
o Data is divided into blocks and sent to the chosen DataNodes.
o Replicas of each block are created on other DataNodes for reliability.
3. Acknowledgment:
o After replication, DataNodes confirm the process is complete.
o The client informs the NameNode that the file is successfully stored.

Summary
HDFS is like a giant digital library where the NameNode is the librarian keeping track of
books (data blocks), and the DataNodes are the shelves that store the books. It ensures data is
always available, even if some shelves fail, by making multiple copies of each book. This
setup allows HDFS to handle huge amounts of data efficiently and reliably.
3) Write short note on Apache hive.
What is Apache Hive?
Apache Hive is a tool that works on Hadoop to make analyzing Big Data easy. Instead of
writing complicated programs, Hive lets you use a language like SQL (called HiveQL) to
query and manage data stored in Hadoop’s HDFS.

Key Features of Hive


1. SQL-Like Interface:
o Use HiveQL, which is similar to SQL, to query and analyze data without
coding.
2. Scalable:
o Can process huge amounts of data spread across many computers.
3. Custom Functions:
o You can create your own functions (UDFs) for specific tasks.
4. Data Support:
o Works with structured and semi-structured data stored in Hadoop (e.g.,
formats like ORC, Parquet, Avro).
5. Integration:
o Works well with other Hadoop tools like Spark and Pig.

How is Hive Used?


 Data Analysis: For business intelligence and decision-making.
 ETL (Extract, Transform, Load): Process large amounts of data for various
pipelines.
 Reporting: Generate useful insights with HiveQL queries.

How Does Hive Work?


1. Accessing Hive:
o Type hive in the terminal to start. You’ll see the hive> prompt to run
commands.
2. Example Commands:
o Create a Table:
o CREATE TABLE pokes (foo INT, bar STRING);
o Check Tables:
o SHOW TABLES;
o Drop a Table:
o DROP TABLE pokes;
o Commands must end with a semicolon (;).
3. Execution:
o When you run a Hive query, it converts it into MapReduce or Tez jobs to
process the data in Hadoop.

Why Use Hive?


Hive is like a translator for Big Data. It converts SQL-like queries into Hadoop processes,
making it easier for non-programmers to work with data. This is especially useful for
businesses handling huge datasets for analysis and reporting.

4) Explain Apache Sqoop Import and Export methods.


5) Explain Apache Oozie with neat diagram.
6) Explain YARN application framework.
Classification of Data: Structured, Semi-Structured, Multi-Structured, and
Unstructured
Data can be organized into four main types:

1. Structured Data
 What it is: Data organized in rows and columns like tables in a database.
 Features:
o Easy to add, delete, or modify.
o Supports indexing, which makes searching faster.
o Can scale (grow or shrink) based on need.
o Provides security through encryption.
 Examples: Bank records, spreadsheets, customer details in a database.

2. Semi-Structured Data
 What it is: Data that has some structure, but not as rigid as rows and columns.
 Features:
o Uses tags or markers to organize the data.
o Does not follow strict rules like databases.
o Examples: XML files, JSON files (commonly used in web development).

3. Multi-Structured Data
 What it is: A mix of different types of data (structured, semi-structured, and
unstructured).
 Features:
o Found in complex systems, such as streaming data or data from multiple
sensors.
o Examples include social media interactions or web server logs.

4. Unstructured Data
 What it is: Data without any specific format or structure.
 Features:
o Not in tables or databases.
o Stored in files like TXT or CSV.
o Sometimes contains internal structure, like email content.
o Needs extra effort to understand relationships or patterns.
 Examples: Images, videos, emails, or plain text files.

In simple terms:
 Structured: Well-organized, like a table.
 Semi-Structured: Partially organized, with tags or markers.
 Multi-Structured: A mix of all types.
 Unstructured: No specific organization.

1. Key-Value Pair Data Stores


 Definition:
A schema-less database where data is stored as key-value pairs.
Example: A dictionary where a key fetches a corresponding value.
 Features:
o High performance, scalability, and flexibility.
o Keys are simple strings; values can store any data (e.g., text, images, files).
 Functions:
o Get(key): Fetches value for a given key.
o Put(key, value): Adds or updates a key-value pair.
o Delete(key): Removes a key-value pair.
 Advantages:
1. Supports any type of data in the value field.
2. Highly scalable and reliable.
3. Low operational cost.
 Limitations:
o No advanced queries like searching or indexing within values.
o Managing unique keys can be challenging.
 Uses:
o Image/document storage, lookup tables, and query caches.

Significance of Web Graphs


A web graph is a representation of the World Wide Web as a graph where:
 Nodes represent web pages.
 Edges represent hyperlinks connecting these web pages.
Significance:
1. Efficient Search Engine Functioning:
o Search engines like Google use web graphs to crawl and index web pages.
o Algorithms like PageRank rank pages based on their importance in the graph.
2. Understanding Relationships:
o Web graphs show how different web pages are interlinked, revealing the
structure of the web.
o Helps in identifying clusters of related websites or communities.
3. Improved Navigation:
o Identifies the shortest or most relevant paths between web pages.
o Enhances user experience by suggesting related content.
4. Detecting Spam or Malware:
o Analyzing unusual patterns in the graph helps detect malicious websites or
spam links.
5. Efficient Resource Allocation:
o Helps in optimizing web servers and content delivery networks (CDNs) based
on web traffic patterns.
6. Big Data Analysis:
o Useful in social media analysis, recommendation systems, and finding trends
in interconnected data.

In simple terms, web graphs are essential for making the web more organized, searchable,
and user-friendly while also enabling advanced web analytics and security.

You might also like