Note to other teachers and users of these slides: We would be delighted if you found this our
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
Mining
of
Massive
Datasets
Jure
Leskovec,
Anand
Rajaraman,
Jeff
Ullman
Stanford
University
http://www.mmds.org
¡ Much
of
the
course
will
be
devoted
to
large
scale
compu-ng
for
data
mining
¡ Challenges:
§ How
to
distribute
computa6on?
§ Distributed/parallel
programming
is
hard
¡ Map-‐reduce
addresses
all
of
the
above
§ Google’s
computa6onal/data
manipula6on
model
§ Elegant
way
to
work
with
big
data
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
hJp://www.mmds.org
2
CPU
Machine
Learning,
Statistics
Memory
“Classical”
Data
Mining
Disk
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
hJp://www.mmds.org
3
¡ 20+
billion
web
pages
x
20KB
=
400+
TB
¡ 1
computer
reads
30-‐35
MB/sec
from
disk
§ ~4
months
to
read
the
web
¡ ~1,000
hard
drives
to
store
the
web
¡ Takes
even
more
to
do
something
useful
with
the
data!
¡ Today,
a
standard
architecture
for
such
problems
is
emerging:
§ Cluster
of
commodity
Linux
nodes
§ Commodity
network
(ethernet)
to
connect
them
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
hJp://www.mmds.org
4
2-‐10
Gbps
backbone
between
racks
1
Gbps
between
Switch
any
pair
of
nodes
in
a
rack
Switch
Switch
CPU
CPU
CPU
CPU
Mem
…
Mem
Mem
…
Mem
Disk
Disk
Disk
Disk
Each
rack
contains
16-‐64
nodes
In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
hJp://www.mmds.org
5
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
hJp://www.mmds.org
6
¡ Large-‐scale
compu-ng
for
data
mining
problems
on
commodity
hardware
¡ Challenges:
§ How
do
you
distribute
computa-on?
§ How
can
we
make
it
easy
to
write
distributed
programs?
§ Machines
fail:
§ One
server
may
stay
up
3
years
(1,000
days)
§ If
you
have
1,000
servers,
expect
to
loose
1/day
§ People
es6mated
Google
had
~1M
machines
in
2011
§ 1,000
machines
fail
every
day!
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
hJp://www.mmds.org
7