BDA
BDA
BDA
Definitions:
"Big Data is a collection of data sets so large or complex that traditional
data processing applications are inadequate."
"It is data of a very large size, typically to the extent that its manipulation
and management present significant logistical challenges."
"Big Data refers to data sets whose size is beyond the ability of typical
database software tools to capture, store, manage, and analyze."
Summary
Cloud computing provides flexible and scalable solutions for data
storage, software, and computing resources. It offers three main types of
services—IaaS, PaaS, and SaaS—to meet various user needs, enabling
efficient and cost-effective computing solutions over the internet.
Here's an explanation of the CAP Theorem in a simple way, using the content
from the image:
CAP Theorem:
The CAP Theorem states that in any distributed database system, only two out
of the following three properties can be guaranteed simultaneously:
1. Consistency (C):
o All copies of data in the system must show the same data at the
same time.
o For example, if a sales record is updated in one part of the
database, the same change should reflect everywhere instantly.
2. Availability (A):
o The system must respond to every request, even during partial
failures.
o This means the system ensures that users can always access their
data.
3. Partition Tolerance (P):
o The system should continue to function even if there is a network
partition (e.g., a part of the network becomes unreachable).
o It ensures the database keeps running despite message losses or
network failures.
Key Insight:
A distributed database cannot achieve all three properties (C, A, P) at the
same time.
Developers must choose two out of the three, depending on the
requirements of their application.
Brewer’s Theorem:
Brewer’s CAP Theorem further explains that during a network failure, a
system has to decide:
1. Provide old/wrong data to maintain Availability (AP trade-off).
2. Refuse to provide any data until the latest copy is available
(Consistency priority, CP trade-off).
Conclusion:
The CAP Theorem helps in designing distributed systems by understanding the
trade-offs between Consistency, Availability, and Partition Tolerance. Each
application’s priorities (like speed or accuracy) guide the choice between CA,
AP, or CP models.
2. Document Stores
Definition: Store unstructured or semi-structured data as documents (e.g., JSON,
XML).
Features:
o Schema-less and hierarchical tree structures.
o Supports flexible schema changes.
Examples: MongoDB, CouchDB.
Uses:
o Storing office documents, forms, and inventory data.
7. Graph Databases
Definition: Store data as interconnected nodes (objects) and edges (relationships).
Features:
o High flexibility for relationship-heavy data.
o Specialized query languages (e.g., SPARQL).
Examples: Neo4j, HyperGraph.
Uses:
o Social networks, pattern matching, and relationship-based queries.
Summary
NoSQL databases provide different patterns like key-value, document, columnar, and graph-
based storage, each suited for specific use cases. They ensure scalability, flexibility, and
efficiency, making them ideal for modern data-driven applications.
3) Explain Shared Nothing Architecture for Big Data tasks.
Shared-Nothing Architecture for Big Data Tasks
Definition:
Shared-Nothing (SN) is a cluster architecture where nodes do not share data with
each other.
It is used in distributed computing to connect independent nodes via a network.
Key Points:
1. Independence:
o Each node works independently without sharing memory or data, making it
self-sufficient.
2. Partitioning:
o Big Data is divided into shards (smaller pieces).
o Each shard is processed by a different node, enabling parallel processing.
3. Self-Healing:
o If a node or link fails, the system creates a new link to maintain operations.
4. No Network Contention:
o Nodes do not compete for shared resources, ensuring better performance.
5. Data Management:
o Each node maintains its own copy of data using a coordination protocol.
Examples:
Hadoop, Flink, Spark
Advantages:
High Scalability: Nodes can be added easily.
Fault Tolerance: System continues to function even if a node fails.
Efficient Parallel Processing: Multiple queries run simultaneously.
Summary:
The Shared-Nothing architecture is ideal for Big Data tasks as it distributes data
across nodes, ensuring independence, fault tolerance, and scalability. It’s
widely used in tools like Hadoop and Spark for handling large-scale data
efficiently.
Summary:
Choose the model based on scale, performance, and fault tolerance needs:
SSD for small systems.
Sharding for parallel processing.
Master-Slave for optimized reads.
Peer-to-Peer for high availability.
Features of MongoDB:
1. Database Structure:
o A database contains collections (like tables in SQL).
2. Collections:
o Stores documents (like rows in SQL).
3. Document Model:
o Data is stored in BSON (Binary JSON), which is flexible.
4. JSON-Like Storage:
o Stores data in a format similar to JSON, allowing complex data structures.
7. No Complex Joins:
o Doesn’t require complex joins, making it faster for big datasets.
8. Distributed Architecture:
o Data is spread across servers for scalability and high availability.
9. Real-Time Aggregation:
o Supports grouping, filtering, and analyzing data in real time.
MODULE 4
1) Explain Map Reduce Execution steps with neat diagram.
1. Input Split
Purpose: Divides large input data into smaller chunks for parallel processing.
How It Works: Data is split into fixed-size blocks (e.g., 64 MB). Each block is
processed by a mapper.
Example: A log file is divided into chunks like lines 1–1000, 1001–2000, etc.
2. Record Reader
Purpose: Converts raw input data into key-value pairs for processing.
How It Works: Converts each line of the input split into (key, value) format.
Example:
o Line 5: Key = 5, Value = This is an example line.
3. Mapping Phase (MAP)
Purpose: Processes the key-value pairs and generates intermediate key-value pairs.
How It Works: Applies the user-defined map() function.
Example:
o Input: (line_number, "apple banana apple")
o Output: ("apple", 1), ("banana", 1), ("apple", 1)
7. Output
Purpose: Writes final key-value pairs to Hadoop Distributed File System (HDFS).
How It Works: Each reducer writes results to separate HDFS files.
Example:
o Reducer 1: ("apple", 3)
o Reducer 2: ("banana", 2), ("orange", 1)
o Clients can send requests in different programming languages and retrieve results.
o When running on a Hadoop cluster, Hive CLI operates in "local mode," meaning it
uses local storage instead of Hadoop Distributed File System (HDFS).
3. Web Interface (HWI Server):
o Hive can also be accessed using a web browser.
o The HWI server must be running on a designated node to enable web access.
o You can use the URL http://hadoop:<port number>/hwi to access Hive through the
web interface.
4. Metastore:
o The Metastore is the system catalog in Hive.
o It stores important metadata like table schemas, database structures, columns, data
types, and mappings to HDFS locations.
o Every component of Hive interacts with the Metastore to get or store information.
5. Hive Driver:
o The Hive Driver manages the lifecycle of a HiveQL statement.
Hive Installation:
To install Hive, the following software packages are required:
1. Java Development Kit (JDK): Needed for compiling and running Java code.
2. Hadoop: Hive runs on top of Hadoop, so Hadoop must be installed.
3. Hive (Compatible version): Make sure to use a version of Hive that is compatible with the
JDK. For instance, Hive 1.2 and onward supports Java 1.7 or newer.
In summary, Hive provides a simple and efficient way to query large datasets stored in Hadoop. It is
an essential tool for working with Big Data in a scalable and user-friendly way.
Flow Summary:
To summarize the entire flow of executing a Pig script:
1. Parse the Script: Check for errors and generate the DAG (data flow graph).
2. Optimize the DAG: Apply optimization techniques to reduce unnecessary processing
and data flow.
3. Compile the Jobs: Convert the optimized DAG into MapReduce jobs.
4. Execute the Jobs: The execution engine runs the MapReduce jobs on the cluster, and
the result is produced.
In simple terms, Pig takes a script, checks and optimizes it, then runs it on a cluster to process
large amounts of data efficiently. This process is designed to minimize unnecessary work and
speed up the execution by using techniques like filtering early and removing unnecessary
data.
4) Explain Key Value pairing in Map Reduce.
Key Points:
It groups similar data points together.
The process adjusts until the clusters are as good as possible.
You choose the number of clusters (K) before starting.
1. Text Pre-processing
This is the first step to clean and organize the text so that it can be analyzed easily.
Clean Up: Fix typos, remove irrelevant parts like extra spaces or special characters.
Tokenization: Break sentences into words.
POS Tagging: Label words as nouns, verbs, etc.
Word Sense Disambiguation: Decide a word's meaning based on its context (e.g.,
"bank" as a river or a financial institution).
Parsing: Understand sentence structure (who is doing what).
2. Feature Generation
Turn text into a format that machines can understand.
Bag of Words: Count how many times each word appears.
Stemming: Simplify words to their root (e.g., "running" → "run").
Remove Stop Words: Get rid of common words like "the" or "and."
VSM(Vector Space Model)TF(Term Frequency)-IDF(Inverse Document
Frequency): Highlight important words in a document by considering their
uniqueness.
3. Feature Selection
Pick only the most useful parts of the text to save time and improve results.
Dimensionality Reduction: Remove repetitive or unimportant features.
N-grams: Look at combinations of words (e.g., "good morning").
Noise Removal: Eliminate unnecessary or irrelevant data.
5. Analyzing Results
Check and use the output for decision-making.
Evaluate: See if the results are useful.
Interpret: Understand what the results mean.
Visualize: Create graphs or charts to make it easy to explain.
Apply Insights: Use the findings to improve things like strategies or processes.
MODULE 2
1) What is Hadoop? Explain Hadoop eco-system with neat diagram
Hadoop
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming models.
The Hadoop framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to
scale up from single server to thousands of machines, each offering local computation and
storage.
Key features of Hadoop:
Scalability: Can handle increasing amounts of data by adding more nodes.
Fault Tolerance: Automatically handles node failures.
Distributed Storage and Processing: Splits data and stores it across multiple nodes.
Open Source: Freely available for use and modification.
Hadoop Ecosystem
The Hadoop Ecosystem consists of several components that work together for efficient big
data storage, processing, and analysis.
Conclusion:
Hadoop, with its ecosystem, is a powerful solution for managing big data. It supports various
tasks, from storage to real-time processing and machine learning. The seamless integration of
its components allows for flexible and efficient handling of massive datasets, making it a
cornerstone in modern big data analytics.
1. NameNode (Master)
Acts like a manager of the file system.
Stores metadata (details about files, such as names, locations, and permissions).
Tracks where data blocks are stored and manages replication.
Handles tasks like opening, closing, and renaming files.
Detects failures in Data Nodes and ensures data is still accessible.
3. Client
The user interface to interact with HDFS.
Requests the NameNode for metadata (e.g., where to read/write data).
Sends data directly to DataNodes for storage or retrieval.
4. Replication
To ensure fault tolerance, data blocks are replicated (copied) to multiple DataNodes.
The default replication factor is 3, meaning each block is stored on three different
machines.
Replicas are spread across different racks (groups of machines) to reduce data loss
during hardware failures.
Summary
HDFS is like a giant digital library where the NameNode is the librarian keeping track of
books (data blocks), and the DataNodes are the shelves that store the books. It ensures data is
always available, even if some shelves fail, by making multiple copies of each book. This
setup allows HDFS to handle huge amounts of data efficiently and reliably.
3) Write short note on Apache hive.
What is Apache Hive?
Apache Hive is a tool that works on Hadoop to make analyzing Big Data easy. Instead of
writing complicated programs, Hive lets you use a language like SQL (called HiveQL) to
query and manage data stored in Hadoop’s HDFS.
1. Structured Data
What it is: Data organized in rows and columns like tables in a database.
Features:
o Easy to add, delete, or modify.
o Supports indexing, which makes searching faster.
o Can scale (grow or shrink) based on need.
o Provides security through encryption.
Examples: Bank records, spreadsheets, customer details in a database.
2. Semi-Structured Data
What it is: Data that has some structure, but not as rigid as rows and columns.
Features:
o Uses tags or markers to organize the data.
o Does not follow strict rules like databases.
o Examples: XML files, JSON files (commonly used in web development).
3. Multi-Structured Data
What it is: A mix of different types of data (structured, semi-structured, and
unstructured).
Features:
o Found in complex systems, such as streaming data or data from multiple
sensors.
o Examples include social media interactions or web server logs.
4. Unstructured Data
What it is: Data without any specific format or structure.
Features:
o Not in tables or databases.
o Stored in files like TXT or CSV.
o Sometimes contains internal structure, like email content.
o Needs extra effort to understand relationships or patterns.
Examples: Images, videos, emails, or plain text files.
In simple terms:
Structured: Well-organized, like a table.
Semi-Structured: Partially organized, with tags or markers.
Multi-Structured: A mix of all types.
Unstructured: No specific organization.
In simple terms, web graphs are essential for making the web more organized, searchable,
and user-friendly while also enabling advanced web analytics and security.