0% found this document useful (0 votes)

21 views6 pages

semi-join

Uploaded by

usersnamehasbeentaken

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views6 pages

semi-join

Uploaded by

usersnamehasbeentaken

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

EXERCISE-9

Perform an efficient semi-join in MapReduce. Hint: Perform a semi-join by

having the mappers load a Bloom filter from the Distributed Cache, and then
filter results from the actual MapReduce data source by performing
membership queries against the Bloom filter to determine which data source
records should be emitted to the reducers.
Description:
A semi-join is a type of join operation that returns only the rows from one
table that have matching records in another table. In MapReduce, an
efficient way to perform a semi-join operation is to use the Map-side Join
technique.

The Map-side Join technique involves replicating the smaller table in

memory across the Map tasks, and then broadcasting it to all the Map
tasks to perform the join operation locally. This reduces the amount of
data that needs to be shuffled across the network, thereby reducing the
overall computational time.

Another efficient way to perform a semi-join in MapReduce is by using a

Bloom filter. A Bloom filter is a probabilistic data structure used to test
whether an element is a member of a set. By loading a Bloom filter in the
mapper phase, we can filter out the records that do not match, without
performing a full reduce operation.

Here are the steps involved in performing a semi-join using a Bloom filter:

1. Create a Bloom filter on the join attribute of the dataset that we

want to filter.
2. Load the Bloom filter in the mapper phase using the Distributed
Cache.
3. Map each record in the dataset and test whether the join attribute
exists in the Bloom filter.
4. If the join attribute is present in the Bloom filter, emit the record to
the reducer phase.
5. In the reducer phase, only process the records from the first dataset
and ignore the records from the second dataset.

Here is the pseudo-code to perform a semi-join using a Bloom filter:

Setup:

// Load the Bloom filter from the Distributed Cache

bloomFilter = loadBloomFilter()

Map (key, value):

EXERCISE-9
// value is the record from dataset

joinAttribute = extractJoinAttribute(value)

if (bloomFilter.contains(joinAttribute)):

emit (joinAttribute, value)

Reduce (key, values):

// values is a list of records

for each value in values:

// Check if the record is from the first dataset

if (value is from first dataset):

// Process the record

...

Cleanup:

// Release the resources used by the Bloom filter

bloomFilter.cleanup()

By performing the semi-join in this way, we can significantly reduce the amount
of data that needs to be shuffled and sorted, leading to faster processing times.
However, it is important to note that Bloom filters have a probability of false
positives, which means that some records may be erroneously included in the
output. Therefore, Bloom filters should be used with caution and the probability
of false positives should be carefully controlled to ensure the correctness of the
output.

Program:
import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
EXERCISE-9
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import org.apache.hadoop.util.bloom.BloomFilter;

import org.apache.hadoop.util.bloom.Key;

import java.io.IOException;

import java.util.Arrays;

import java.util.HashSet;

import java.util.Set;

public class SemiJoinMapReduce {

// The size of the Bloom filter

private static final int BLOOM_FILTER_SIZE = 1000000;

// The number of hash functions to use in the Bloom filter

private static final int NUM_HASH_FUNCTIONS = 5;

// The false positive rate of the Bloom filter

private static final float FALSE_POSITIVE_RATE = 0.01f;

public static class BloomFilterMapper extends Mapper<LongWritable, Text, Text, Text> {

private BloomFilter bloomFilter = new BloomFilter(BLOOM_FILTER_SIZE,

NUM_HASH_FUNCTIONS);

@Override

protected void setup(Context context) throws IOException, InterruptedException {

// Load the Bloom filter from the Distributed Cache

String[] bloomFilterStrings = context.getConfiguration().getStrings("bloomFilter");

for (String bloomFilterString : bloomFilterStrings) {

bloomFilter.add(new Key(bloomFilterString.getBytes()));

}
EXERCISE-9

@Override

protected void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {

// Split the input record into fields

String[] fields = value.toString().split(",");

// Get the join attribute from the record

String joinAttribute = fields[0];

// Check if the join attribute is present in the Bloom filter

if (bloomFilter.membershipTest(new Key(joinAttribute.getBytes()))) {

// Emit the record to the reducer phase

context.write(new Text(joinAttribute), value);

public static class SemiJoinReducer extends Reducer<Text, Text, Text, Text> {

@Override

protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException {

Set<String> firstDataset = new HashSet<>();

// Collect the records from the first dataset

for (Text value : values) {

String[] fields = value.toString().split(",");

if (fields.length > 1) {

firstDataset.add(Arrays.toString(Arrays.copyOfRange(fields, 1, fields.length)));

// Process the records from the first dataset

EXERCISE-9
for (String value : firstDataset) {

context.write(key, new Text(value));

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "Semi Join MapReduce");

job.setJarByClass(SemiJoinMapReduce.class);

job.setMapperClass(BloomFilterMapper.class);

job.setReducerClass(SemiJoinReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

job.setOutputFormatClass(TextOutputFormat.class);

// Set the Bloom filter in the Distributed Cache

String[] bloomFilterStrings = {"value1", "value2", "value3"};

job.getConfiguration().setStrings("bloomFilter", bloomFilterStrings);

DistributedCache.addCacheFile(new Path("/path/to/bloomfilter/file").toUri(),
job.getConfiguration());

// Set the input and output paths

FileInputFormat.setInputPaths(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

// Submit the job to the cluster and wait for it to finish

boolean success = job.waitForCompletion(true);

// Print a message indicating whether the job was successful or not

if (success) {

System.out.println("Semi Join MapReduce job completed successfully!");

EXERCISE-9
} else {

System.out.println("Semi Join MapReduce job failed!");

03-MapReduce
No ratings yet
03-MapReduce
184 pages
Zabbix Documentation 6.2.en
No ratings yet
Zabbix Documentation 6.2.en
1,602 pages
Reference Guide: TMS320C674x DSP CPU and Instruction Set
No ratings yet
Reference Guide: TMS320C674x DSP CPU and Instruction Set
770 pages
MR IntroSimpleWordCount
No ratings yet
MR IntroSimpleWordCount
125 pages
UNIT 3 BDA
No ratings yet
UNIT 3 BDA
41 pages
Mapreduce Final
No ratings yet
Mapreduce Final
55 pages
KGiSL Institute of Technolog(Final) (3)
No ratings yet
KGiSL Institute of Technolog(Final) (3)
33 pages
Datos mv440
No ratings yet
Datos mv440
39 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Bizhub C 351 Service Manual
100% (2)
Bizhub C 351 Service Manual
846 pages
L4
No ratings yet
L4
65 pages
Java_Streams__1750533436
No ratings yet
Java_Streams__1750533436
16 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
M4_06_MapReduce
No ratings yet
M4_06_MapReduce
28 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
HADOOP AND BIG DATA - Final
No ratings yet
HADOOP AND BIG DATA - Final
26 pages
Streams API Intermediate Operations
No ratings yet
Streams API Intermediate Operations
14 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Lec 8
No ratings yet
Lec 8
19 pages
Big Data Infrastructure: Week 2: Mapreduce Algorithm Design (2/2)
No ratings yet
Big Data Infrastructure: Week 2: Mapreduce Algorithm Design (2/2)
55 pages
BDA List of Experiments For Practical Exam
No ratings yet
BDA List of Experiments For Practical Exam
21 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
java8
No ratings yet
java8
27 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
34. Free Operation Management Course with Certificate For Beginners
No ratings yet
34. Free Operation Management Course with Certificate For Beginners
11 pages
6.UNIT 3 BDA
No ratings yet
6.UNIT 3 BDA
18 pages
UNIT 3 NOTES (1)
No ratings yet
UNIT 3 NOTES (1)
21 pages
Experiment9_week-10
No ratings yet
Experiment9_week-10
4 pages
Toshiba Satellite C50-A393 Drivers For Windows 7: 173.pdf 174 PDF
No ratings yet
Toshiba Satellite C50-A393 Drivers For Windows 7: 173.pdf 174 PDF
25 pages
ADS EXP 8 Tanisha Kanal
No ratings yet
ADS EXP 8 Tanisha Kanal
10 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
Lec 8
No ratings yet
Lec 8
24 pages
Android File Manager Report PDF
No ratings yet
Android File Manager Report PDF
63 pages
21SE28_BDA_CA_III_SET_B-key
No ratings yet
21SE28_BDA_CA_III_SET_B-key
8 pages
Enterprise Micro-Service Architecture - CI - CD Orchestration Under Micro-Service
No ratings yet
Enterprise Micro-Service Architecture - CI - CD Orchestration Under Micro-Service
15 pages
Video Sum5
No ratings yet
Video Sum5
5 pages
Dllction To MAPREDUCE Afflrlling: L Tro
No ratings yet
Dllction To MAPREDUCE Afflrlling: L Tro
12 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Week-8 de
No ratings yet
Week-8 de
9 pages
Docket TEMILEKAN OMOLEYE
No ratings yet
Docket TEMILEKAN OMOLEYE
3 pages
Connecting Dots
No ratings yet
Connecting Dots
71 pages
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
No ratings yet
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
20 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Big Data Practical 2
No ratings yet
Big Data Practical 2
11 pages
Lab 6 Programming Algorithms and Patterns
No ratings yet
Lab 6 Programming Algorithms and Patterns
4 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
BDA Lab 8 Manual
No ratings yet
BDA Lab 8 Manual
7 pages
Lesson 7 Microsof Excel
No ratings yet
Lesson 7 Microsof Excel
42 pages
6min Pigeon Message
No ratings yet
6min Pigeon Message
5 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
Honeywell Series Black Farenhyt Ifp 2100 Data
No ratings yet
Honeywell Series Black Farenhyt Ifp 2100 Data
2 pages
Bms College of Engineering: (Autonomous College Under VTU) Bull Temple Road, Basavanagudi, Bangalore - 560019
No ratings yet
Bms College of Engineering: (Autonomous College Under VTU) Bull Temple Road, Basavanagudi, Bangalore - 560019
17 pages
ZCPPC Master Recipe
0% (1)
ZCPPC Master Recipe
2 pages
Photoshop Pen Tool - Drawing Shapes and Making Selections PDF
No ratings yet
Photoshop Pen Tool - Drawing Shapes and Making Selections PDF
6 pages
Code Python Notes
No ratings yet
Code Python Notes
17 pages
Join Algorithms Using Mapreduce: A Survey: Vikas Jadhav, Jagannath Aghav, Sunil Dorwani
No ratings yet
Join Algorithms Using Mapreduce: A Survey: Vikas Jadhav, Jagannath Aghav, Sunil Dorwani
5 pages
Palak
No ratings yet
Palak
10 pages
Mitisha Pandey Resume
No ratings yet
Mitisha Pandey Resume
1 page
Scala Cheat Sheet
No ratings yet
Scala Cheat Sheet
2 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Kali Linux
No ratings yet
Kali Linux
3 pages
18mcs35e U4
No ratings yet
18mcs35e U4
7 pages
Why MapReduce
No ratings yet
Why MapReduce
8 pages
CCDH Exam With Answers
No ratings yet
CCDH Exam With Answers
17 pages
Feature Support Matrix For Horizon Agent (2150305)
No ratings yet
Feature Support Matrix For Horizon Agent (2150305)
2 pages
Building Information Modeling
No ratings yet
Building Information Modeling
35 pages
Unit 4 BDA
No ratings yet
Unit 4 BDA
31 pages
CE R12 Test Scripts
100% (2)
CE R12 Test Scripts
49 pages
03-IBM FlashSystem Portfolio v1 - 2022-Mar-25
100% (1)
03-IBM FlashSystem Portfolio v1 - 2022-Mar-25
13 pages
M32-Edit Release History
No ratings yet
M32-Edit Release History
2 pages
Dsebl ZG522
No ratings yet
Dsebl ZG522
4 pages
Understanding Inputs and Outputs of Mapreduce
No ratings yet
Understanding Inputs and Outputs of Mapreduce
13 pages
Ender Xtender XL Installation Guide
No ratings yet
Ender Xtender XL Installation Guide
18 pages
Case of Normalisation
No ratings yet
Case of Normalisation
5 pages
Generating Plots in Datamine
No ratings yet
Generating Plots in Datamine
30 pages
Final Quiz - Cisco1
No ratings yet
Final Quiz - Cisco1
8 pages
Angular Routing: Everything you need to know
From Everand
Angular Routing: Everything you need to know
Abdelfattah Ragab
No ratings yet
NgRx SignalStore: An effortless solution for state management
From Everand
NgRx SignalStore: An effortless solution for state management
Abdelfattah Ragab
No ratings yet
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
Understanding Software Engineering Vol 3: Programming Basic Software Functionalities.
From Everand
Understanding Software Engineering Vol 3: Programming Basic Software Functionalities.
Gabriel Clemente
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Introduction to PHP, Part 4, Second Edition
From Everand
Introduction to PHP, Part 4, Second Edition
Adam Majczak
No ratings yet
Introduction to PHP, Part 5, Second Edition
From Everand
Introduction to PHP, Part 5, Second Edition
Adam Majczak
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Python: Learn Python in 24 Hours
From Everand
Python: Learn Python in 24 Hours
Alex Nordeen
4/5 (12)
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet