[go: up one dir, main page]

0% found this document useful (0 votes)
13 views55 pages

Cloud Computing Unit4

Uploaded by

sudip
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views55 pages

Cloud Computing Unit4

Uploaded by

sudip
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Unit 4: Cloud Programming Models (12 Hrs.

)
• Thread programming,
• Task programming,
• Map-reduce programming,
• Parallel efficiency of Map-Reduce,
• Enterprise batch processing using Map-
Reduce,
• Comparisons between Thread, Task and Map
reduce
Concurrent Computing: Thread Programming

Throughput computing focuses on delivering


high volumes of computation in the form of
transactions.
Throughput computing is realized by means of
multiprocessing and multithreading.
Multiprocessing is the execution of multiple
programs in a single machine.
Multi threading relates to possibility of
multiple streams within the same program.
Parallelism for single-machine computation
• Multiprocessing: Which is the use of multiple
processing units with in a single machine.
• Asymmetric multiprocessing: Involves the concurrent
use of different processing units that are specialized to
perform different functions.
• Non uniform memory access(NUMA): define a specific
architecture for accessing a shared memory between
processors.
• Clustered multiprocessing: Use of multiple computers
joined together as a single virtual computer.
• Multiprocessor and specially multi-core technologies
are now of fundamental importance because of the
physical constraint imposed on frequency scaling.
Programming application with threads

• Use of threading as a support for the design of


parallel and distributed algorithms.
• What is a thread?
• Context switch between threads is preferred
practice over switching between processes.
Thread APIs
Portable Operating System Interface for
Unix(POSIX) threads:
- Standard POSIX 1.c(IEEE std 1003.1c-1995)
addresses that should be available for
application programmers to develop portable
multithreaded applications.
- Operations: Creation of threads with
attributes, termination of a thread, and
waiting for thread completion(join operation).
- Semaphores, conditions, read-writer locks,
supports, synchronization among threads.
POSIX Threads: Programming point of view
• A thread identifies a logical sequence of
instructions.
• A thread is mapped to a function that contains
the sequence of instructions to execute.
• A thread can be created , terminated, or joined.
• A thread can be created, terminated, or joined.
• A thread has a state that determines its current
condition, whether it is executing, stopped,
terminated, waiting for I/O.
• Threads share the memory of the process.
• Pthread.h
Threading support in java and .NET
• Operations performed on the threads: start,
stop, seupend,resume, abort, sleep, join, and
interrupt.
• Java: java.util.concurrent package
• .NET parallel Extension framework.
Techniques For parallel Computation with Threads

1. Decomposition: Useful technique that aids in


understanding whether a problem is divided
into components(or tasks) that can be
executed concurrently.
2. Two main decomposition/partitioning
techniques are domain and functional
decompositions.
Domain Decomposition
• Domain decomposition is the process of identifying
patterns of functionality repetitive, but independent,
computation on data.
• The master-slave model:
- The system is divided into two major code components.
- One code segment contains the decomposition and
coordination logic
- Another code segment contains the repetitive computation
to perform.
- A master thread executes the first code segment.
- As a result of the master thread execution as many slave
threads as needed are created to execute the repetitive
computation.
- The collection of the results from each of the slave threads
and eventual composition of the final result are performed
by the master thread.
Functional Decomposition
• Functional Decomposition is the process of
identifying functionally distinct but independent
computations.
• The focus is on the type of computation rather
than on the data manipulated by the
computation.
• F(x)=sin(x)+cos(x)+tan(x)
• The program computes the sine,cosine, and
tangent functions in three separate threads and
aggregates the results.
Task programming
• The Task Programming Model is a high-level multithreaded
programming model.
• It is designed to allow Maple code to be written that takes advantage
of multiple processors, while avoiding much of the complexity of
traditional multithreaded programming.
• In computing, a task is a unit of execution or a unit of work. The term
is ambiguous; precise alternative terms include process, light-weight
process, thread (for execution), step, request, or query (for work).
• In the adjacent diagram, there are queues of incoming work to do and
outgoing completed work, and a thread pool of threads to perform this
work.
• Either the work units themselves or the threads that perform the work
can be referred to as "tasks", and these can be referred to respectively
as requests/responses/threads, incoming tasks/completed tasks/threads
(as illustrated), or requests/responses/tasks.
• When you work directly with threads, a Thread serves as both a
unit of work and the mechanism for executing it. In the executor
framework, the unit of work and the execution mechanism are
separate. The key abstraction is the unit of work, which is called
a task.
Here are a few advantages of using the Task Programming
Model:
• No explicit threading, users create Tasks, not Threads.
• Maple schedules the Tasks to the processors so that the code
scales to the number of available processors.
• Multiple algorithms written using the Task Programming
Model can run at the same time without significant
performance impact.
• Complex problems can be solved without requiring
traditional synchronization tools such as Mutexes and
Condition Variables.
• If such synchronization tools are not used, the function
cannot be deadlocked.
• The Task functions are simple and the model mirrors
conventional function calling.
Map reduce Programming
Big Data:
• Big data refers to data sets that are too large or
complex to be dealt with by traditional data-
processing application software.
• Data with many fields (rows) offer
greater statistical power, while data with higher
complexity (more attributes or columns) may lead
to a higher false discovery rate.
• Big data analysis challenges include capturing
data, data storage, data analysis,
search, sharing, transfer, visualization, querying,
updating, information privacy, and data source.
• Big data was originally associated with three key
concepts: volume, variety, and velocity.
• The analysis of big data presents challenges in
sampling, and thus previously allowing for
only observations and sampling.
• Thus a fourth concept, veracity, refers to the
quality or insightfulness of the data. Without
sufficient investment in expertise for big data
veracity, then the volume and variety of data
can produce costs and risks that exceed an
organization's capacity to create and
capture value from big data.
What is an Example of Big Data?
Following are some of the Big Data examples-
• The New York Stock Exchange is an example
of Big Data that generates about one
terabyte of new trade data per day.
• Social Media
The statistic shows that 500+terabytes of new
data get ingested into the databases of social
media site Facebook, every day.
This data is mainly generated in terms of photo
and video uploads, message exchanges, putting
comments etc.
• A single Jet engine can generate 10+terabytes of
data in 30 minutes of flight time. With many
thousand flights per day, generation of data
reaches up to many Petabytes.
Types Of Big Data
• Following are the types of Big Data:
 Structured
 Unstructured
 Semi-structured
1. Structured
• Any data that can be stored, accessed and processed in
the form of fixed format is termed as a „structured‟ data.
• Over the period of time, talent in computer science has
achieved greater success in developing techniques for
working with such kind of data (where the format is
well known in advance) and also deriving value out of
it.
• However, nowadays, we are foreseeing issues when a
size of such data grows to a huge extent, typical sizes
are being in the rage of multiple zettabytes.
2. Unstructured
• Any data with unknown form or the structure is
classified as unstructured data.
• In addition to the size being huge, un-structured data
poses multiple challenges in terms of its processing
for deriving value out of it.
• A typical example of unstructured data is a
heterogeneous data source containing a combination
of simple text files, images, videos etc.
• Now day organizations have wealth of data available
with them but unfortunately, they don‟t know how to
derive value out of it since this data is in its raw form
or unstructured format.
3. Semi-structured
• Semi-structured data can contain both the
forms of data.
• We can see semi-structured data as a structured
in form but it is actually not defined with e.g. a
table definition in relational DBMS.
• Example of semi-structured data is a data
represented in an XML file.
Characteristics Of Big Data
Big data can be described by the following
characteristics:
• Volume
• Variety
• Velocity
• Variability
(i) Volume –
• The name Big Data itself is related to a size which
is enormous.
• Size of data plays a very crucial role in
determining value out of data.
• Also, whether a particular data can actually be
considered as a Big Data or not, is dependent
upon the volume of data.
• Hence, ‘Volume’ is one characteristic which
needs to be considered while dealing with Big
Data solutions.
ii) Variety –
• The next aspect of Big Data is its variety.
• Variety refers to heterogeneous sources and the
nature of data, both structured and unstructured.
• During earlier days, spreadsheets and databases
were the only sources of data considered by most
of the applications.
• Nowadays, data in the form of emails, photos,
videos, monitoring devices, PDFs, audio, etc. are
also being considered in the analysis applications.
• This variety of unstructured data poses certain
issues for storage, mining and analyzing data.
(iii) Velocity –
• The term ‘velocity’ refers to the speed of generation
of data. How fast the data is generated and processed
to meet the demands, determines real potential in the
data.
• Big Data Velocity deals with the speed at which data
flows in from sources like business processes,
application logs, networks, and social media sites,
sensors, Mobile devices, etc.
• The flow of data is massive and continuous.
(iv) Variability –
This refers to the inconsistency which can be shown by
the data at times, thus hampering the process of being
able to handle and manage the data effectively.
Map-reduce programming
• MapReduce is a programming model and an associated
implementation for processing and generating big data sets
with a parallel, distributed algorithm on a cluster
• A MapReduce program is composed of a map procedure,
which performs filtering and sorting (such as sorting students
by first name into queues, one queue for each name), and
a reduce method, which performs a summary operation (such
as counting the number of students in each queue, yielding
name frequencies).
• The "MapReduce System" (also called "infrastructure" or
"framework") orchestrates the processing by marshalling the
distributed servers, running the various tasks in parallel,
managing all communications and data transfers between the
various parts of the system, and providing
for redundancy and fault tolerance.
Hadoop as Map reduce programming
model for big data.
• framework written in Java that utilizes a large
cluster of commodity hardware to maintain
and store big size data.
• Hadoop works on MapReduce Programming
Algorithm that was introduced by Google.
• Today lots of Big Brand Companies are using
Hadoop in their Organization to deal with big
data, eg. Facebook, Yahoo, Netflix, eBay, etc
Hadoop Architecture Mainly consists of 4
components.
• MapReduce
• HDFS(Hadoop distributed File System)
• YARN(Yet Another Resource Framework)
• Common Utilities or Hadoop Common
Explaination
1. MapReduce
• Algorithm or a data structure that is based on the
YARN framework.
• The major feature of MapReduce is to perform
the distributed processing in parallel in a Hadoop
cluster which Makes Hadoop working so fast.
• When you are dealing with Big Data, serial
processing is no more of any use.
• MapReduce has mainly 2 tasks which are divided
phase-wise:
• In first phase, Map is utilized and in next
phase Reduce is utilized.
• The Map() function here breaks this DataBlocks
into Tuples that are nothing but a key-value pair.
• These key-value pairs are now sent as input to the
Reduce(). The Reduce() function then combines this
broken Tuples or key-value pair based on its Key
value and form set of Tuples, and perform some
operation like sorting, summation type job, etc.
• which is then sent to the final Output Node. Finally,
the Output is Obtained.
• The data processing is always done in Reducer
depending upon the business requirement of that
industry.
• This is How First Map() and then Reduce is utilized
one by one.
Hadoop HDFS
• The Hadoop Distributed File System (HDFS) is Hadoop‟s
storage layer. Housed on multiple servers, data is divided
into blocks based on file size.
• These blocks are then randomly distributed and stored
across slave machines.
• HDFS in Hadoop Architecture divides large data into
different blocks.
• Replicated three times by default, each block contains
128 MB of data.
• Replications operate under two rules:
• Two identical blocks cannot be placed on the same
DataNode
• When a cluster is rack aware, all the replicas of a block
cannot be placed on the same rack
There are three components of the Hadoop
Distributed File System:
• NameNode (a.k.a. masternode): Contains
metadata in RAM and disk
• Secondary NameNode: Contains a copy of
NameNode’s metadata on disk
• Slave Node: Contains the actual data in the
form of blocks
Hadoop YARN
• Hadoop YARN (Yet Another Resource
Negotiator) is the cluster resource management
layer of Hadoop and is responsible for resource
allocation and job scheduling.
• Introduced in the Hadoop 2.0 version, YARN is
the middle layer between HDFS and MapReduce
in the Hadoop architecture.
The elements of YARN include:
• ResourceManager (one per cluster)
• ApplicationMaster (one per application)
• NodeManagers (one per node)
Resource Manager
• Resource Manager manages the resource
allocation in the cluster and is responsible for
tracking how many resources are available in the
cluster and each node manager‟s contribution. It
has two main components:
1. Scheduler: Allocating resources to various
running applications and scheduling resources
based on the requirements of the application; it
doesn‟t monitor or track the status of the
applications
2. Application Manager: Accepting job
submissions from the client or monitoring and
restarting application masters in case of failure
Application Master
• Application Master manages the resource needs
of individual applications and interacts with the
scheduler to acquire the required resources.
• It connects with the node manager to execute and
monitor tasks.
Node Manager
• Node Manager tracks running jobs and sends
signals (or heartbeats) to the resource manager to
relay the status of a node.
• It also monitors each container‟s resource
utilization.
MapReduce Job Execution
• The input data is stored in the HDFS and read using an input
format.
• The file is split into multiple chunks based on the size of the file and
the input format.
• The default chunk size is 128 MB but can be customized.
• The record reader reads the data from the input splits and forwards
this information to the mapper.
• The mapper breaks the records in every chunk into a list of data
elements (or key-value pairs).
• The combiner works on the intermediate data created by the map
tasks and acts as a mini reducer to reduce the data.
• The partitioner decides how many reduce tasks will be required to
aggregate the data.
• The data is then sorted and shuffled based on their key-value pairs
and sent to the reduce function.
• Based on the output format decided by the reduce function, the
output data is then stored on the HDFS.
Execution Flow
• Here are the the 7 steps of the MapReduce
execution. Some of the steps are optional and
it depends on the type of problem being
solved as well as the implementation used.
• Input splitting
• Task allocation
• Map phase
• Combiner phase
• Partition phase
• Shuffle & Sort phase
• Reduce phase
1. Input splitting
• Input data should be BIG to benefit from MR
processing
• In order to parallelize data processing, input data
(file or files) are split into large chunks, they are
called, “splits”
• split size is a user defined value and you can
choose your own split size based on your volume
of data
2. Task allocation
• The number of resulting splits will determine
the number of Mappers needed to process your
data
• MR will create that number of instances of
your Map code, and will assign splits for each
Map job
• One split can be processed by one Map job
only
3. Map Phase
• Each Map job will reads its assigned spit of data
• It parses it as a list of <key, value> pairs
• For each <key, value> pair it invokes its map
function (your code)
• The output of the map function is another <key,
value> pair
• The output of map function is called
“intermediate results” and is buffered in the
Mapper memory
4. Combiner Phase
• Combiner is optional
• It can pre-aggregate results of the Map
function
• This can drastically reduce the amount of data
sent over the network to reducers
• Sorting also can be done in this phase — on
the in-memory buffer data only
5. Partitioning Phase
• If there are more than one Reducer — Partitioning has to happen
• Partitioner is invoked when:
○ Mapper‟s memory buffer is full or
○ when Mapper is done and
○ after Combiner potentially pre-aggregated the intermediate
results
○ before the <key, value> pairs from memory are written onto
the local disk
• Partitioner calculates hash values of each <key> — either using
default or your own logic — based on the number of partitions
used
• The total number of partitions is the same as the number of
reduce tasks for the job. Thus, this controls which reducer each
key is sent to
• It then groups and writes all <key, value> into separate locations
on disk — separate per each partition
6. Shuffle/ Sort Phase
• Up until now all data processing was local to the
Map jobs — all writes were to local disks only
• Shuffle and Sort phase is where the actual data
sending over the network is done (via HTTP,
usually)
• The work here is done on all results from the Map
job — which can be multiple local files generated
from each in-memory buffer
• All these local files are merged, and another
round of sorting is performed, per each partition
• Then each partition data is sent over the network
to the corresponding Reducer
7. Reduce Phase
• Data from all Mappers will come to one or more Reducer, based on
the partitioning
• There can be many keys (and their associated values) in each
partition, but the records for any given key are all in a single
partition
• Since there maybe multiple [locally] sorted files for the same keys
arriving from multiple Mappers, an additional sort and merge has to
happen before the actual Reduce function is invoked
• If all data fits into Reducer‟s memory — this sort and merge can
happen in memory; if not — it has to happen on-disk (local to
Reducer)
• Once done — the final list of <key, value> pairs is passed to the
Reduce function (your code)
• The results are written out
• When determining an optimal number of Reducers, one rule of
thumb is to aim for reducers that each run for five minutes or so,
and which produce a large block of output . Too many reducers may
lead to lots of small files which is very inefficient
Homeworks
• Explain Parallel Efficiency of map reduce?
• Compare between thread, task and map
reduce programming.

You might also like