[go: up one dir, main page]

0% found this document useful (0 votes)
25 views7 pages

ch02 Mapreduce

The document provides information about using slides from a course on mining massive datasets and requests that users include an attribution message if using a significant portion of the slides. Specifically, it states that other teachers are welcome to use the slides verbatim or modify them for their own needs. It requests that if a significant portion of the slides is used, the user should include a message attributing the source or linking to the course website.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views7 pages

ch02 Mapreduce

The document provides information about using slides from a course on mining massive datasets and requests that users include an attribution message if using a significant portion of the slides. Specifically, it states that other teachers are welcome to use the slides verbatim or modify them for their own needs. It requests that if a significant portion of the slides is used, the user should include a message attributing the source or linking to the course website.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Note to other teachers and users of these slides: We would be delighted if you found this our

material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org

Mining  of  Massive  Datasets  


Jure  Leskovec,  Anand  Rajaraman,  Jeff  Ullman  
Stanford  University  
http://www.mmds.org    
¡ Much  of  the  course  will  be  devoted  to    
large  scale  compu-ng  for  data  mining  
¡ Challenges:  
§ How  to  distribute  computa6on?  
§ Distributed/parallel  programming  is  hard  

¡ Map-­‐reduce  addresses  all  of  the  above  


§ Google’s  computa6onal/data  manipula6on  model  
§ Elegant  way  to  work  with  big  data  

J.  Leskovec,  A.  Rajaraman,  J.  Ullman:  Mining  of  Massive  Datasets,  hJp://www.mmds.org   2  
CPU  
Machine  Learning,  Statistics  
Memory  

“Classical”  Data  Mining  


Disk  

J.  Leskovec,  A.  Rajaraman,  J.  Ullman:  Mining  of  Massive  Datasets,  hJp://www.mmds.org   3  
¡ 20+  billion  web  pages  x  20KB  =  400+  TB  
¡ 1  computer  reads  30-­‐35  MB/sec  from  disk  
§ ~4  months  to  read  the  web  
¡ ~1,000  hard  drives  to  store  the  web  
¡ Takes  even  more  to  do  something  useful    
with  the  data!  
¡ Today,  a  standard  architecture  for  such  
problems  is  emerging:  
§ Cluster  of  commodity  Linux  nodes  
§ Commodity  network  (ethernet)  to  connect  them  
J.  Leskovec,  A.  Rajaraman,  J.  Ullman:  Mining  of  Massive  Datasets,  hJp://www.mmds.org   4  
2-­‐10  Gbps  backbone  between  racks  
1  Gbps  between     Switch  
any  pair  of  nodes  
in  a  rack  
Switch   Switch  

CPU   CPU   CPU   CPU  

Mem   …   Mem   Mem   …   Mem  

Disk   Disk   Disk   Disk  

Each  rack  contains  16-­‐64  nodes  

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO


J.  Leskovec,  A.  Rajaraman,  J.  Ullman:  Mining  of  Massive  Datasets,  hJp://www.mmds.org   5  
J.  Leskovec,  A.  Rajaraman,  J.  Ullman:  Mining  of  Massive  Datasets,  hJp://www.mmds.org   6  
¡ Large-­‐scale  compu-ng  for  data  mining    
problems  on  commodity  hardware  
¡ Challenges:  
§ How  do  you  distribute  computa-on?  
§ How  can  we  make  it  easy  to  write  distributed  
programs?  
§ Machines  fail:  
§ One  server  may  stay  up  3  years  (1,000  days)  
§ If  you  have  1,000  servers,  expect  to  loose  1/day  
§ People  es6mated  Google  had  ~1M  machines  in  2011  
§ 1,000  machines  fail  every  day!  
J.  Leskovec,  A.  Rajaraman,  J.  Ullman:  Mining  of  Massive  Datasets,  hJp://www.mmds.org   7  

You might also like