Lecture1 Introduction V0
Lecture1 Introduction V0
Lecture 1 Introduction
to Big Data Analytics
HUANG Xiao
xiaohuang@comp.polyu.edu.hk
Arrangement
§ Prerequisites:
§ Basic statistics, probability, linear algebra
§ Basic Computer Science fundamentals
§ programming (Python)
§ data structures
§ algorithms
§ database systems
COMP4434 2
New Jersey Institute of Technology
What is big data?
COMP4434 3
New Jersey Institute of Technology
Definition of big data
COMP4434 4
New Jersey Institute of Technology
4Vs Characteristics of big data
§ Volume, Velocity, Variety, Veracity
5
New Jersey Institute of Technology
Volume: scale of data
§ Data volume is increasing exponentially
§ Generated by huge number of devices and sensors
§ The number of smartphone mobile network subscriptions
worldwide reached almost 6.4 billion in 2022
2,958
Most popular social networks worldwide as of January 2023, ranked
Number of active users in millions
2,000 2,000
1,309
1,051
931
715 700 635 626 584 574 556
445
…
ou
at
t
**
am
m
er
bo
*
e
ok
ok
ok
es
ha
pp
b
Q
h
ra
i tt
bo
bo
yi n
kT
ei
Tu
er
r
pc
eC
ag
A
ais
leg
Tw
W
Ti
ce
ce
nt
u
u
ts
a
W
st
Yo
Ku
Do
a
Sn
Te
ha
Pi
Fa
Fa
In
n
Si
W
COMP4434 6
New Jersey Institute of Technology
Velocity: speed of data generation
§ Data is generated fast
§ Data need to be processed fast
§ Online Data Analytics: late decisions mean missing opportunities
§ e.g., 1: Based on your current location and your purchase history,
send promotions right now for store next to you
§ e.g., 2: Sensors monitoring your activities and body, notify you if
there are abnormal measurements
Amount per minute
Emails sent 231400000
Media
Cryptocurrency purchased (USD) 90200000
usage in an
internet
Texts sent 16000000
Searches conducted on Google 5900000
Snaps shared on Snapchat 2430000 minute as of
Pieces of content shared on Facebook 1700000
April 2022
Swipes on Tinder 1100000
Hours streamed 1000000
USD spent on Amazon 443000
USD sent on Venmo 437600
Tweets shared on Twitter 347200
Hours spent in Zoom meetings 104600
USD spent on DoorDash 76400 7
New Jersey Institute of Technology
Variety: data in many forms
§ A single application may generate/collect many types
of data, e.g., types of data are stored in emails
§ Tabular data: attributes like subject, to, from
§ Text (in email body)
§ Image (in attachment)
§ Hyperlinks
§ Types of data
§ Relational Data (e.g., Tables)
§ Text Data (e.g., comments)
§ Semi-structured Data (e.g., XML)
§ Graph Data (e.g., social network)
§ What else?
COMP4434 8
New Jersey Institute of Technology
Data in Many Forms
1
John likes to
watch movies. 3
Mary likes 2
movies too.
4
Signal
Text Image Graph
(Voice, Audio)
𝑨 1 2 3 4
signal1 signal2 1 0 1.2 4.3 0
12 20 22
Tuple 𝑥! 2 1.2 0 0.8 0
13.58 7.24 5 0 5
John also likes to
12.11 12.50 3 4.3 0.8 0 2.6
1 watch football 10 15 25
games. 4 0 0 2.6 0
13.49 8.66 5 0 5
John likes to watch 𝑬 Node1 Node2 Weight
2 movies. Mary likes 11.25 10.98 17
15 75 1 1 2 1.2
movies too. 14.57 13.75 5
2 1 3 4.3
13.22 9.02 3 2 3 0.8
4 3 4 2.6
COMP4434 9
New Jersey Institute of Technology
Veracity: uncertainty of data
§ Is the data accurate?
§ Measurement error
§ Human errors like typos in names/addresses
§ Does the data come from a reliable source?
§ What if data from different sources are not consistent?
COMP4434 10
New Jersey Institute of Technology
Applications: Artificial Intelligence with big data
§ Artificial Intelligence: the theory and development of computer
systems able to perform tasks normally requiring human
intelligence
§ Before the Age of “big data”
§ ELIZA is an early natural language
processing computer program created
from 1964 to 1967 at MIT
§ On May 11, 1997, an IBM computer
called IBM Deep Blue beat the world
chess champion after a six-game match
§ Big data has changed AI: “AI Would Be Nothing Without Big Data”
§ “Data is the new oil”
COMP4434 11
New Jersey Institute of Technology
History of AI
COMP4434 12
New Jersey Institute of Technology
Recent breakthroughs in AI
§ At the 2017 Future of Go Summit, the Master version of AlphaGo beat Ke Jie, the
number one ranked player in the world at the time, in a three-game match, after
which AlphaGo was awarded professional 9-dan by the Chinese Weiqi Association
COMP4434 13
New Jersey Institute of Technology
What Big Data Can Help
§ Recommendation
§ Ex: Amazon, YouTube, Netflix, ……
§ What item for what people?
§ How to improve users’ satisfaction?
COMP4434 14
New Jersey Institute of Technology
What Big Data Can Help
§ More recommendations:
§ News feed
§ Music feed
§ Twitter feed
15
New Jersey Institute of Technology
What Big Data Can Help
COMP4434 16
New Jersey Institute of Technology
What Big Data Can Help
§ Disease Treatment: Joint research between Google
DeepMind and Moorfields Eye Hospital
§ Eyecare professionals diagnose eye conditions by using
optical coherence tomography (OCT) scans (over 1,000 a
day at Moorfields alone)
§ Achieving expert error rate 5.5% comparably to the two
best retina specialists (6.7% and 6.8% error rate)
COMP4434 17
New Jersey Institute of Technology
Demo 1: ChatGPT
§ Write essays:
https://youtu.be/oLjZva6JvLI
§ Programming:
https://youtu.be/TIDA6pvjEE0
https://youtu.be/B3yuK2XHmvM
§ Conversation:
https://youtu.be/GYeJC31JcM0
COMP4434 18
New Jersey Institute of Technology
Demo 2: Autonomous driving
§ https://www.notateslaapp.com/news/1579/musk-live-steams-
fsd-v12-and-it-s-too-human-why-that-s-a-problem-video
19
New Jersey Institute of Technology
Demo 3: AlphaFold
COMP4434 21
New Jersey Institute of Technology
We can do more
COMP4434 22
New Jersey Institute of Technology
Big Data Analysis Procedure
23
New Jersey Institute of Technology
Relations among big data analytics and AI
COMP4434 24
New Jersey Institute of Technology
n
k
tro
on
or
Ove
F1
a ti
od ep
tw
ne
sco aluat
nc rc
ag
rf i t
)
l ) rk
er
e
ra eries wo
ul prop
p
re, ion m valida
(Ev
t
ti n
u e
(a er
e -s ln
t n , time eura
oe
pre
ay
g&
ck
ut
n
ti l
e t n
rr (tex onal age)
Ba
cisi etrics)
cro
c u t i (m
M
i
Re nvolu ce
on,
du
ss
e
Co pR
rec
Ma
a ll
Dimensionality reduction oop
ing
Had
tion
(autoencoder, SVD)
arn
Clustering: K-means Un
p le
s
leaupe
De e
rn rvi Large-scale data
in s e
g d analytics systems Volume
r vised
fi e Superning Machine Velocity
s si lear Var
c la e learning Big Data Characteristics tim iety
n in e-
s io a ch Analytics of big data Ver series, (i tabular
s acit mage , text
re rm y , gr ,
reg c to Basic statistical
a ph)
ic ve
gi st o rt analysis
Lo pp Graph Applications: AI
ChatGPT
Su Alph
on
es ar
Al
gr ne
si
D)
p
Re Au ha
ce
re Li
n( e
Fo
SV
c to
io lu
es
om no ld
Fa
Web sea
2
d
cia
m ou
nt
po ar
en
lr
sd
ie
ec
de
m ul
riv
ad
og
rs in
co ng
tri x
n
Gr
ys
iti
rc
de Si
on
te
h
ma
m C n
Reco
n
te
ommnt-base
actio
y
end d
enc
Co
nk
a ti o
inter
ork
lla n
Pag
a
jac
Factoriza
fil bo
eR
netw
eRa
-item
te ra
Ad
(SVD)
ri ti
Pag
nk
ng ve
User
tion
Map
Red
uce