ch02 Mapreduce

The document provides information about using slides from a course on mining massive datasets and requests that users include an attribution message if using a significant portion of the slides. Specifically, it states that other teachers are welcome to use the slides verbatim or modify them for their own needs. It requests that if a significant portion of the slides is used, the user should include a message attributing the source or linking to the course website.

Uploaded by

jefferyleclerc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views7 pages

ch02 Mapreduce

Uploaded by

jefferyleclerc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Note to other teachers and users of these slides: We would be delighted if you found this our

material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org

Mining of Massive Datasets

Jure Leskovec, Anand Rajaraman, Jeﬀ Ullman
Stanford University
http://www.mmds.org
¡ Much of the course will be devoted to
large scale compu-ng for data mining
¡ Challenges:
§ How to distribute computa6on?
§ Distributed/parallel programming is hard

¡ Map-‐reduce addresses all of the above

§ Google’s computa6onal/data manipula6on model
§ Elegant way to work with big data

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, hJp://www.mmds.org 2
CPU
Machine Learning, Statistics
Memory

“Classical” Data Mining

Disk

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, hJp://www.mmds.org 3
¡ 20+ billion web pages x 20KB = 400+ TB
¡ 1 computer reads 30-‐35 MB/sec from disk
§ ~4 months to read the web
¡ ~1,000 hard drives to store the web
¡ Takes even more to do something useful
with the data!
¡ Today, a standard architecture for such
problems is emerging:
§ Cluster of commodity Linux nodes
§ Commodity network (ethernet) to connect them
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, hJp://www.mmds.org 4
2-‐10 Gbps backbone between racks
1 Gbps between Switch
any pair of nodes
in a rack
Switch Switch

CPU CPU CPU CPU

Mem … Mem Mem … Mem

Disk Disk Disk Disk

Each rack contains 16-‐64 nodes

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, hJp://www.mmds.org 5
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, hJp://www.mmds.org 6
¡ Large-‐scale compu-ng for data mining
problems on commodity hardware
¡ Challenges:
§ How do you distribute computa-on?
§ How can we make it easy to write distributed
programs?
§ Machines fail:
§ One server may stay up 3 years (1,000 days)
§ If you have 1,000 servers, expect to loose 1/day
§ People es6mated Google had ~1M machines in 2011
§ 1,000 machines fail every day!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, hJp://www.mmds.org 7

MapReduce-Final
No ratings yet
MapReduce-Final
92 pages
ch01 Intro
No ratings yet
ch01 Intro
28 pages
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
48 pages
MapReduce - 1
No ratings yet
MapReduce - 1
39 pages
Big Data Processing with MapReduce
No ratings yet
Big Data Processing with MapReduce
49 pages
Ch01 Intro
No ratings yet
Ch01 Intro
19 pages
3 Hadoop
No ratings yet
3 Hadoop
111 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
29 pages
Big Data Analytics Course Introduction
No ratings yet
Big Data Analytics Course Introduction
28 pages
Big Data - Spring 25 - Week01
No ratings yet
Big Data - Spring 25 - Week01
54 pages
Course Outline and Introduction
No ratings yet
Course Outline and Introduction
37 pages
ch04 Streams1
No ratings yet
ch04 Streams1
4 pages
2a Intro To Cluster Computing PDF
No ratings yet
2a Intro To Cluster Computing PDF
18 pages
ch07 Clustering
No ratings yet
ch07 Clustering
62 pages
Massive Dataset Mining Guide
No ratings yet
Massive Dataset Mining Guide
11 pages
MapReduce and The New Software Stack
No ratings yet
MapReduce and The New Software Stack
33 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
53 pages
1 Introduction
No ratings yet
1 Introduction
55 pages
Unit 4
No ratings yet
Unit 4
60 pages
ch-09 - Part 1
No ratings yet
ch-09 - Part 1
22 pages
(Ebook) Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Je Rey D. Ullman ISBN 9781107077232, 1107077230 Ready To Read
No ratings yet
(Ebook) Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Je Rey D. Ullman ISBN 9781107077232, 1107077230 Ready To Read
103 pages
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
Unit 5
No ratings yet
Unit 5
39 pages
BD - Lecture 3 - Decision Tree
No ratings yet
BD - Lecture 3 - Decision Tree
39 pages
Large-Scale Machine Learning Guide
No ratings yet
Large-Scale Machine Learning Guide
33 pages
Community Detection in Social Networks
No ratings yet
Community Detection in Social Networks
64 pages
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
64 pages
Stanford Data Mining Course Overview
No ratings yet
Stanford Data Mining Course Overview
29 pages
Marc Snir NGDM07
No ratings yet
Marc Snir NGDM07
36 pages
L2 Linkanalysis1 2024
No ratings yet
L2 Linkanalysis1 2024
59 pages
03 Intro HadoopAndMapReduce BigData
No ratings yet
03 Intro HadoopAndMapReduce BigData
91 pages
Mining Massive Datasets Preface
No ratings yet
Mining Massive Datasets Preface
17 pages
Mining of Massive Datasets: Jure Leskovec Anand Rajaraman Jeffrey D. Ullman
0% (1)
Mining of Massive Datasets: Jure Leskovec Anand Rajaraman Jeffrey D. Ullman
17 pages
Week 02
No ratings yet
Week 02
115 pages
1 Intro
No ratings yet
1 Intro
46 pages
05 Link Analysis and PageRank 9-39
No ratings yet
05 Link Analysis and PageRank 9-39
14 pages
ch01 Intro
No ratings yet
ch01 Intro
45 pages
Week 16 Lecture 01 02 SVD and CUR (Example)
No ratings yet
Week 16 Lecture 01 02 SVD and CUR (Example)
56 pages
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
No ratings yet
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
51 pages
Bloom Filters & Stream Algorithms
No ratings yet
Bloom Filters & Stream Algorithms
4 pages
Unit 1 Lecture 3
No ratings yet
Unit 1 Lecture 3
12 pages
Big-Data Computing: B. Ramamurthy
No ratings yet
Big-Data Computing: B. Ramamurthy
61 pages
ch05 Linkanalysis1
No ratings yet
ch05 Linkanalysis1
60 pages
Analysis of Large Graphs: Trustrank and Webspam
No ratings yet
Analysis of Large Graphs: Trustrank and Webspam
62 pages
Condie 2013
No ratings yet
Condie 2013
3 pages
20AIEL707-Mining Massive Datasets
No ratings yet
20AIEL707-Mining Massive Datasets
24 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
56 pages
Big Data: Techniques and Applications
100% (1)
Big Data: Techniques and Applications
107 pages
Mining Data Streams 1
No ratings yet
Mining Data Streams 1
46 pages
Big Data Evolution & Data Wrangling
No ratings yet
Big Data Evolution & Data Wrangling
56 pages
Workreport: Clusters With ELX
No ratings yet
Workreport: Clusters With ELX
27 pages
Kuliah M1 - TEKREK - Komputasi Big Data
No ratings yet
Kuliah M1 - TEKREK - Komputasi Big Data
55 pages
Big Data
No ratings yet
Big Data
957 pages
Hal Varian
No ratings yet
Hal Varian
36 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
Reference: Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 Edition, Oreilly's, 2010
100% (1)
Reference: Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 Edition, Oreilly's, 2010
57 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
10 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
2 pages
2 Mapreduce Model Principles
No ratings yet
2 Mapreduce Model Principles
7 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
4 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
3 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
7 pages
MapReduce - What It Is, and Why It Is So Popular
No ratings yet
MapReduce - What It Is, and Why It Is So Popular
7 pages
Optimizing BKM+ for Clustering Efficiency
No ratings yet
Optimizing BKM+ for Clustering Efficiency
3 pages
Paper Dvi
No ratings yet
Paper Dvi
7 pages
Hadoop
No ratings yet
Hadoop
7 pages
Balanced k-means Algorithm Analysis
No ratings yet
Balanced k-means Algorithm Analysis
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-14
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-14
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1Q
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1Q
2 pages
K-Means Clustering in SAP HANA PAL
No ratings yet
K-Means Clustering in SAP HANA PAL
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
3 pages
SAP HANA K-Means for Segmentation
No ratings yet
SAP HANA K-Means for Segmentation
6 pages
K-Means Clustering Optimization Algorithm Based On Mapreduce
No ratings yet
K-Means Clustering Optimization Algorithm Based On Mapreduce
6 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
4 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
3 pages
SVM Distance-Based Kernel Accuracy
No ratings yet
SVM Distance-Based Kernel Accuracy
1 page
Data Visualization Cheat Sheet For Basic Machine Learning Algorithms - by Boriharn K - Mar, 2024 - Towards Data Science
No ratings yet
Data Visualization Cheat Sheet For Basic Machine Learning Algorithms - by Boriharn K - Mar, 2024 - Towards Data Science
3 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
3 pages
SAP HANA K-Means Clustering Guide
No ratings yet
SAP HANA K-Means Clustering Guide
3 pages
Big Data Clustering with MapReduce
No ratings yet
Big Data Clustering with MapReduce
7 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
4 pages
Analysis of Mapreduce Algorithms: Harini Padmanaban
No ratings yet
Analysis of Mapreduce Algorithms: Harini Padmanaban
6 pages
Fast Scalable K-Means++ Algorithm With Mapreduce
No ratings yet
Fast Scalable K-Means++ Algorithm With Mapreduce
2 pages
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
No ratings yet
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
42 pages
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
No ratings yet
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
4 pages
Draft Resolution
No ratings yet
Draft Resolution
2 pages
Aniket Sidana: Web Developer Profile
No ratings yet
Aniket Sidana: Web Developer Profile
2 pages
Visio Spatial Sketchpad
No ratings yet
Visio Spatial Sketchpad
2 pages
Guide Geometry Spatial Sense 456
100% (2)
Guide Geometry Spatial Sense 456
252 pages
Lesson Plan 2 With Reflection
No ratings yet
Lesson Plan 2 With Reflection
5 pages
Order of National Artists Overview
No ratings yet
Order of National Artists Overview
12 pages
P2 - English Grammar Mock Paper (Term 2) - 1
No ratings yet
P2 - English Grammar Mock Paper (Term 2) - 1
13 pages
BSP Course
No ratings yet
BSP Course
2 pages
Academic Performance of Senior High School Working Students
83% (6)
Academic Performance of Senior High School Working Students
15 pages
Siop Science Lesson Plan
No ratings yet
Siop Science Lesson Plan
3 pages
Gartner - Magic Quadrant For Multichannel Marketing Hubs-2021Q2
No ratings yet
Gartner - Magic Quadrant For Multichannel Marketing Hubs-2021Q2
37 pages
Fernando Amorsolo
No ratings yet
Fernando Amorsolo
3 pages
Confirmation For 2023 2024 Graduates
100% (1)
Confirmation For 2023 2024 Graduates
3 pages
National Conference On Assessment & Evaluation - 30 Jun & 1 Jul 2025
No ratings yet
National Conference On Assessment & Evaluation - 30 Jun & 1 Jul 2025
7 pages
Bioinformatics: Phylogenetic Methods
No ratings yet
Bioinformatics: Phylogenetic Methods
37 pages
Women's Impact on Societal Progress
No ratings yet
Women's Impact on Societal Progress
1 page
Public Health
No ratings yet
Public Health
15 pages
Final Intro To Pol Course Outline
No ratings yet
Final Intro To Pol Course Outline
7 pages
Nursing Students' Community Engagement
No ratings yet
Nursing Students' Community Engagement
15 pages
Formative Assessment Tasks for Grades 4-6
No ratings yet
Formative Assessment Tasks for Grades 4-6
8 pages
Clause Classification Guide
No ratings yet
Clause Classification Guide
2 pages
Information Brochure: Joint Admission Test For M.Sc. 2011
No ratings yet
Information Brochure: Joint Admission Test For M.Sc. 2011
39 pages
IELTS Listening Test Guide
No ratings yet
IELTS Listening Test Guide
8 pages
Undergraduate Music Theory Terminology Used by Selected Spanish
No ratings yet
Undergraduate Music Theory Terminology Used by Selected Spanish
104 pages
ANHS GAD Plan 2021
No ratings yet
ANHS GAD Plan 2021
21 pages
Founder's Term Calendar, 2025
No ratings yet
Founder's Term Calendar, 2025
12 pages
Kent Hovind Doctoral Dissertation PDF
100% (2)
Kent Hovind Doctoral Dissertation PDF
5 pages
Invoice 2025PSLCE ANNASTAZIYA KASIYA 11401692515545
No ratings yet
Invoice 2025PSLCE ANNASTAZIYA KASIYA 11401692515545
1 page
Gracie Combatives At-Home Training Tracker Card
No ratings yet
Gracie Combatives At-Home Training Tracker Card
1 page

ch02 Mapreduce

Uploaded by

ch02 Mapreduce

Uploaded by

Note to other teachers and users of these slides: We would be delighted if you found this our

Mining of Massive Datasets

¡ Map-­‐reduce addresses all of the above

“Classical” Data Mining

CPU CPU CPU CPU

Mem … Mem Mem … Mem

Disk Disk Disk Disk

Each rack contains 16-­‐64 nodes

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO

You might also like

¡ Map-‐reduce addresses all of the above

Each rack contains 16-‐64 nodes