[go: up one dir, main page]

0% found this document useful (0 votes)
32 views25 pages

Lecture1 Introduction V0

Uploaded by

taiiq zhou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views25 pages

Lecture1 Introduction V0

Uploaded by

taiiq zhou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

COMP4434 Big Data Analytics

Lecture 1 Introduction
to Big Data Analytics
HUANG Xiao
xiaohuang@comp.polyu.edu.hk
Arrangement

§ Prerequisites:
§ Basic statistics, probability, linear algebra
§ Basic Computer Science fundamentals
§ programming (Python)
§ data structures
§ algorithms
§ database systems

§ 26 hours Lectures + 13 hours Lab


§ Four lab sessions

COMP4434 2
New Jersey Institute of Technology
What is big data?

§ Posts on social media sites


§ Purchase transaction records
§ Digital pictures and videos
§ Software logs
§ Microphone & camera recordings
§ Cell phone GPS signals
§ Sensing data
§ Scans of government documents
§ Traffic data
§ ...

COMP4434 3
New Jersey Institute of Technology
Definition of big data

§ Definition 1: a combination of structured, semi-structured and


unstructured data collected by organizations that can be mined
for information and used in machine learning projects,
predictive modeling and other advanced analytics applications.

§ Definition 2: “Big data is high-volume, high-velocity and high-


variety information assets that demand cost-effective,
innovative forms of information processing for enhanced
insight and decision making.” -- Gartner

§ 4Vs Characteristics of big data


§ Volume, Velocity, Variety, Veracity

COMP4434 4
New Jersey Institute of Technology
4Vs Characteristics of big data
§ Volume, Velocity, Variety, Veracity

5
New Jersey Institute of Technology
Volume: scale of data
§ Data volume is increasing exponentially
§ Generated by huge number of devices and sensors
§ The number of smartphone mobile network subscriptions
worldwide reached almost 6.4 billion in 2022

2,958
Most popular social networks worldwide as of January 2023, ranked
Number of active users in millions

2,514 by number of monthly active users (in millions)

2,000 2,000

1,309
1,051
931
715 700 635 626 584 574 556
445

ou
at

t
**

am
m

er
bo
*
e
ok

ok
ok

es
ha
pp
b

Q
h
ra

i tt
bo

bo

yi n
kT

ei
Tu

er
r

pc
eC
ag
A

ais
leg

Tw
W
Ti
ce

ce

nt
u

u
ts

a
W
st
Yo

Ku
Do

a
Sn
Te
ha

Pi
Fa

Fa
In

n
Si
W

COMP4434 6
New Jersey Institute of Technology
Velocity: speed of data generation
§ Data is generated fast
§ Data need to be processed fast
§ Online Data Analytics: late decisions mean missing opportunities
§ e.g., 1: Based on your current location and your purchase history,
send promotions right now for store next to you
§ e.g., 2: Sensors monitoring your activities and body, notify you if
there are abnormal measurements
Amount per minute
Emails sent 231400000
Media
Cryptocurrency purchased (USD) 90200000
usage in an
internet
Texts sent 16000000
Searches conducted on Google 5900000
Snaps shared on Snapchat 2430000 minute as of
Pieces of content shared on Facebook 1700000
April 2022
Swipes on Tinder 1100000
Hours streamed 1000000
USD spent on Amazon 443000
USD sent on Venmo 437600
Tweets shared on Twitter 347200
Hours spent in Zoom meetings 104600
USD spent on DoorDash 76400 7
New Jersey Institute of Technology
Variety: data in many forms
§ A single application may generate/collect many types
of data, e.g., types of data are stored in emails
§ Tabular data: attributes like subject, to, from
§ Text (in email body)
§ Image (in attachment)
§ Hyperlinks
§ Types of data
§ Relational Data (e.g., Tables)
§ Text Data (e.g., comments)
§ Semi-structured Data (e.g., XML)
§ Graph Data (e.g., social network)
§ What else?
COMP4434 8
New Jersey Institute of Technology
Data in Many Forms

1
John likes to
watch movies. 3
Mary likes 2
movies too.
4

Signal
Text Image Graph
(Voice, Audio)
𝑨 1 2 3 4
signal1 signal2 1 0 1.2 4.3 0
12 20 22
Tuple 𝑥! 2 1.2 0 0.8 0
13.58 7.24 5 0 5
John also likes to
12.11 12.50 3 4.3 0.8 0 2.6
1 watch football 10 15 25
games. 4 0 0 2.6 0
13.49 8.66 5 0 5
John likes to watch 𝑬 Node1 Node2 Weight
2 movies. Mary likes 11.25 10.98 17
15 75 1 1 2 1.2
movies too. 14.57 13.75 5
2 1 3 4.3
13.22 9.02 3 2 3 0.8
4 3 4 2.6

COMP4434 9
New Jersey Institute of Technology
Veracity: uncertainty of data
§ Is the data accurate?
§ Measurement error
§ Human errors like typos in names/addresses
§ Does the data come from a reliable source?
§ What if data from different sources are not consistent?

Fake, Paid-For Reviews in Amazon

COMP4434 10
New Jersey Institute of Technology
Applications: Artificial Intelligence with big data
§ Artificial Intelligence: the theory and development of computer
systems able to perform tasks normally requiring human
intelligence
§ Before the Age of “big data”
§ ELIZA is an early natural language
processing computer program created
from 1964 to 1967 at MIT
§ On May 11, 1997, an IBM computer
called IBM Deep Blue beat the world
chess champion after a six-game match

§ Big data has changed AI: “AI Would Be Nothing Without Big Data”
§ “Data is the new oil”

COMP4434 11
New Jersey Institute of Technology
History of AI

COMP4434 12
New Jersey Institute of Technology
Recent breakthroughs in AI

§ At the 2017 Future of Go Summit, the Master version of AlphaGo beat Ke Jie, the
number one ranked player in the world at the time, in a three-game match, after
which AlphaGo was awarded professional 9-dan by the Chinese Weiqi Association
COMP4434 13
New Jersey Institute of Technology
What Big Data Can Help

§ Recommendation
§ Ex: Amazon, YouTube, Netflix, ……
§ What item for what people?
§ How to improve users’ satisfaction?

COMP4434 14
New Jersey Institute of Technology
What Big Data Can Help
§ More recommendations:
§ News feed
§ Music feed
§ Twitter feed

15
New Jersey Institute of Technology
What Big Data Can Help

§ Web search, image search


§ Chatbot
§ Virtual assistant
§ High-frequency trading

COMP4434 16
New Jersey Institute of Technology
What Big Data Can Help
§ Disease Treatment: Joint research between Google
DeepMind and Moorfields Eye Hospital
§ Eyecare professionals diagnose eye conditions by using
optical coherence tomography (OCT) scans (over 1,000 a
day at Moorfields alone)
§ Achieving expert error rate 5.5% comparably to the two
best retina specialists (6.7% and 6.8% error rate)

COMP4434 17
New Jersey Institute of Technology
Demo 1: ChatGPT

§ Write essays:
https://youtu.be/oLjZva6JvLI

§ Programming:
https://youtu.be/TIDA6pvjEE0
https://youtu.be/B3yuK2XHmvM

§ Conversation:
https://youtu.be/GYeJC31JcM0

COMP4434 18
New Jersey Institute of Technology
Demo 2: Autonomous driving
§ https://www.notateslaapp.com/news/1579/musk-live-steams-
fsd-v12-and-it-s-too-human-why-that-s-a-problem-video

19
New Jersey Institute of Technology
Demo 3: AlphaFold

§ A solution to a 50-year-old grand


challenge in biology
20
New Jersey Institute of Technology
An example in drug design

COMP4434 21
New Jersey Institute of Technology
We can do more

COMP4434 22
New Jersey Institute of Technology
Big Data Analysis Procedure

23
New Jersey Institute of Technology
Relations among big data analytics and AI

COMP4434 24
New Jersey Institute of Technology
n
k

tro
on
or

Ove

F1

a ti

od ep
tw
ne

sco aluat

nc rc
ag
rf i t

)
l ) rk

er
e
ra eries wo

ul prop

p
re, ion m valida
(Ev
t

ti n
u e

(a er
e -s ln
t n , time eura

oe
pre

ay
g&

ck

ut
n

ti l
e t n
rr (tex onal age)

Ba
cisi etrics)
cro
c u t i (m

M
i
Re nvolu ce

on,
du

ss
e
Co pR

rec
Ma

a ll
Dimensionality reduction oop

ing
Had

tion
(autoencoder, SVD)

arn
Clustering: K-means Un

p le
s
leaupe

De e
rn rvi Large-scale data
in s e
g d analytics systems Volume
r vised
fi e Superning Machine Velocity
s si lear Var
c la e learning Big Data Characteristics tim iety
n in e-
s io a ch Analytics of big data Ver series, (i tabular
s acit mage , text
re rm y , gr ,
reg c to Basic statistical
a ph)
ic ve
gi st o rt analysis
Lo pp Graph Applications: AI
ChatGPT
Su Alph
on
es ar

analytics with big data aGo


sit va nt

Al
gr ne
si

D)

p
Re Au ha
ce
re Li

n( e

Fo
SV

c to
io lu
es

om no ld

Fa
Web sea
2
d

cia
m ou
nt

po ar

en

lr
sd
ie

ec
de
m ul

riv
ad

og
rs in
co ng

tri x

n
Gr

ys

iti
rc
de Si

on
te

h
ma

m C n
Reco
n

te
ommnt-base
actio
y

end d
enc

Co
nk

a ti o
inter
ork

lla n
Pag
a
jac

Factoriza
fil bo
eR
netw

eRa
-item

te ra
Ad

(SVD)
ri ti
Pag

nk

ng ve
User

tion
Map
Red
uce

New Jersey Institute of Technology

You might also like