DAC ML Tutorial Final Deck
DAC ML Tutorial Final Deck
Source: https://m.xkcd.com/1838/
A Few Quotes
• “A breakthrough in machine learning would be
worth ten Microsofts” (Bill Gates, Chairman,
Microsoft)
• “Machine learning is the next Internet”
(Tony Tether, Director, DARPA)
• Machine learning is the hot new thing”
(John Hennessy, President, Stanford)
• Machine learning will bring about not just a new
era of civilization, but a new stage in the evolution
of life on Earth (Pedro Domingos, Professor UW)
• “It is a renaissance, it is a golden age. We are
solving problems with machine learning and
artificial intelligence that were in the realm of
science fiction for the last several decades. Natural
language understanding, machine vision problems,
it really is an amazing renaissance.” (Jeff Bezos,
Amazon CEO)
What is Machine Learning?
• Algorithms that can
improve performance
using training data
• Typically, a large
number of parameter
values learned from
data
• Applicable to situations
where challenging to
define rules manually
How many
variables are we
talking about?
• From tens to millions
of variables
• Learn a complex
multi-dimensional
function that
captures a solution to
the problem
Machine Learning Example
a
Machine b
c
Learning d
500
Model
z
Example Details Cont’d
• Each character is represented by a 20x25 pixels. x ∈ R500
• Character recognition machine learning task:
Find a classifier y(x) such that y : x → {a, b, c, …, z} y( ) = v
x y
a
Machine b
c
500-dimension Input
Learning d
500
Model
z 13026 variable function
to model the mapping of
Wx + b = y pixels to characters
26x500 500x1 26x1 26x1
Learning Details Continued
Wx + b = y
26x500 500x1 26x1 26x1
• Training
– Start with random weights W[26x500] and bias b[26x1]
– For a given “training data set”, determine the error at the output y
– Pixels for known characters – test how “far” are the results from actual value.
– Adjust the weights and bias to minimize the error
• Validation
– Test on data that the model has not been trained on
Training: Solving for W and b
Given input x, and associated label L
▪ Compute y = Wx + b x= L = [0,0,0,….,0,1,0,0,0,0]
▪ Compute S(y)
S
▪ Cross entropy is
D(S, L) = − σ𝑖 𝐿𝑖 log(𝑆𝑖 )
▪ Loss function
1
LossL = 𝐷(𝑆 𝑊𝑥𝑖 + 𝑏 , 𝐿𝑖 )
𝑁
𝑖
▪ Compute derivative of W and b w.r.t. Loss = 𝛻𝑤
▪ Adjust W and b
▪ W = W - 𝛻𝑤 ∗ 𝑠𝑡𝑒𝑝_𝑠𝑖𝑧𝑒
Gradient Decent
w2
w1
X
Linear Regression
The predicted value of y is given by:
𝑦ො = 𝛽መ0 + 𝑋𝑗 𝛽መ𝑗
𝑗=1
𝑦ො = 𝛽መ0 + 𝑋𝑗 𝛽መ𝑗
𝑗=1
if 𝑋0 = 1, can be written as a matrix product with X a row vector:
𝑦ො = X 𝛽መ
We get this by writing all of the input samples in a single matrix X:
𝑋11 ⋯ 𝑋1𝑛
i.e. rows of 𝐗 = ⋮ ⋱ ⋮
𝑋𝑚1 ⋯ 𝑋𝑚𝑛
are distinct observations, columns of X are input features.
Least Squares Solution
The most common measure of fit between the line and the data is
the least-squares fit.
• Its very scalable and can be very fast to train. It’s used for
– Spam filtering
– News message classification
– Web site classification
– Product classification
– Most classification problems with large, sparse feature sets.
Logistic Regression
• Logistic regression maps the “regression” value −𝑋𝛽 in
(-,) to the range [0,1] using a “logistic” function:
1
𝑝 𝑋 =
1 + exp −𝑋𝛽
• i.e. the logistic function maps any value on the real line to a
probability in the range [0,1]
Logistic Training
For training, we start with a collection of input values 𝑋 𝑖 and
corresponding output labels 𝑦 𝑖 ∈ 0,1 . Let 𝑝𝑖 be the
predicted output on input 𝑋 𝑖 , so
𝑖
1
𝑝 =
1 + exp −𝑋 𝑖 𝛽
The accuracy on an input 𝑋 𝑖 is
𝐴𝑖 = 𝑦 𝑖 𝑝𝑖 + 1 − 𝑦 𝑖 1 − 𝑝𝑖
Logistic regression maximizes either the sum of the log accuracy,
or the total accuracy, e.g.
𝑁
𝐴 = log 𝐴𝑖
𝑖=1
Logistic Regression Training - SGD
• A very efficient way to train logistic models is with Stochastic
Gradient Descent (SGD) – we keep updating the model with
gradients from small blocks (mini-batches) of input data.
• One challenge with training on power law data (i.e. most data)
is that the terms in the gradient can have very different
strengths (because of the power law distribution). We’ll
discuess this soon…
Support Vector Machines
• A Support Vector Machine (SVM) is a classifier that tries to
maximize the margin between training data and the
classification boundary (the plane defined by 𝑋𝛽 = 0)
Support Vector Machines
• The idea is that maximizing the margin maximizes the chance
that classification will be correct on new data. We assume
the new data of each class is near the training data of that
type.
SVM Training
SVMs can be trained using SGD.
The SVM gradient can be defined as (here 𝑝𝑖 = 𝑋 𝑖 𝛽)
𝑁
𝑑𝐴
= if 𝑝𝑖 𝑦 𝑖 < 1 then 𝑦 𝑖 𝑋 𝑖 else 0
𝑑𝛽
𝑖=1
N=2
Unsupervised Learning
N=3
Unsupervised Learning
N=4
K-means clustering
• K-means is a partitional clustering algorithm
• Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X Rr, and
r is the number of attributes (dimensions) in the data.
• The k-means algorithm partitions the given data into k clusters.
– Each cluster has a cluster center, called centroid.
– k is specified by the user
C2 C1
Improving k-Clustering Results
• If I don’t know which k to use, make k range from 2 to N and
pick result that makes most sense
– Smaller overall error
– Visual inspection
• Remove outliers from centroid computation as they skew
centroid position
• Run algorithm several times with a different initial set of
clusters and pick best result w.r.t. error
Other Approaches to k-Means
• Probabilistic methods
– LDA
– Autoencoders
–…
Going Deep with Machine Learning
The Machine Learning Framework
y = f(x)
output prediction Image
function feature
f( ) = ”Border Collie”
f( ) = ”Whippet”
Things Can Get a Little More Complicated
Norwich terrier
Features
• Histograms
• GIST descriptors
• Image gradients
• Shape context
• …
Other Feature Examples
• Text structure of sentence
• Syllables of speech
It Took You Several Years to Learn How to Correctly
Understand the World
Featureless Learning System
• No features
• Tagged data
Non-linearity
w1j
x1
w2j n
x2 s j wij xi b j y j f (s j )
i 0
yj
1
s j
wnj 1 e
xn
bj
Deep Learning
Feed raw data (or minimally processed data) into the network
Make network deep to replicate feature extraction algorithm
Intermediate Layers Learn Features
Deep Learning
Forward Backward
shared
Building a Whole Application Using DNN
CaffeNet/AlexNet
Norwich terrier
3x227x227
CaffeNet
Input Normalization
Convolution Fully connected (Inner product)
ReLu Softmax
MaxPooling
Normalization
• It simulates ”lateral inhibition” by normalizing over local input
regions
• According Alex Krizhevsky, it aids generalization
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Recurrent Neural Networks
• Examples of Recurrent Neural Networks
• Each rectangle is a vector and
arrows represent functions (e.g.
matrix multiply).
• Input vectors are in red, output
vectors are in blue and green
vectors hold the RNN's state
1 2 3 4 5
(1) Standard mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image classification).
(2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words).
(3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative
sentiment).
(4) Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a
sentence in French).
(5) Synced sequence input and output (e.g. video classification where we wish to label each frame of the video). 88
[Hsinchun Chen, University of Arizona]
A Closer Look at RNNs
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
A Closer Look at RNNs
• Word2Vec
Representation of Words
words
00110111
sentences
…
11001011
Word2Vec
• Matrix representation is very inefficient words
– Vocabulary of 100k words
– Representation of word is a one-hot encoding 00110111
sentences
– Representation of ”the dog is on the table”
• R(”dog”) | R(”is”) | R(”on”) | R(”the”) | R(”table”) …
• 𝑅 ∶ 𝑤𝑜𝑟𝑑𝑠 → {0,1}|𝑤𝑜𝑟𝑑𝑠|
• Can we find a representation that is denser than
one-hot encoding 11001011
Word2Vec
• 𝑊 ∶ 𝑤𝑜𝑟𝑑𝑠 → {0,1}𝑁 , 𝑁 << |𝑤𝑜𝑟𝑑𝑠|
• |W(w1)| = 1
𝑢.𝑣
• cos 𝑢, 𝑣 = , 𝑢 = 𝑊 𝑤1 , 𝑣 = 𝑊(𝑤2)
𝑢 𝑣
• |W(w1)| = 1
𝑢.𝑣
• cos 𝑢, 𝑣 = , 𝑢 = 𝑊 𝑤1 , 𝑣 = 𝑊(𝑤2)
𝑢 𝑣
• GPUs
• SoCs
• FPGAs
Compute Acceleration
Source: Building a datacenter for the post mobile era, Steve Brown, OCP Summit 17 (https://www.youtube.com/watch?v=zcw-Hm1p7ys)
Anatomy of an ML/DL Accelerator
Data
Transport
Non-Linear Matrix
Functions Machine Learning Multiplication
Deep Learning
Normalization Processing
Pooling Elements
Processor Based Acceleration
• All ML and DL algorithms start and end in machine subsystem
• But you end up using all machine resources for a single user
GPU Based Acceleration
• Programming using CUDA or OpenCL
http://www.nvidia.com/content/events/geoInt2015/LBrown_DL.pdf
GPU Based Acceleration
https://devblogs.nvidia.com/parallelforall/inside-volta/
GPUs Based Acceleration
GPUs for Deep Learning
CuDNNs and GPUs
SoC Based Acceleration
https://www.nextplatform.com/2017/04/05/first-depth-look-googles-tpu-architecture/
Systolic Processing Unit Inside TPU
FPGA Based Acceleration
https://www.wired.com/2016/09/microsoft-bets-future-chip-reprogram-fly/
FPGA Based Acceleration
• HLS/OpenCL-based flow gaining
acceptance
Performance
vs. GPU
Performance/Watt
vs. GPU
Fault Tolerance
RDDs track lineage info to rebuild lost data
• file.map(record => (record.type, 1))
.reduceByKey((x, y) => x + y)
.filter((type, count) => count > 10)
target
[Zaharia etal 2013]
data = spark.textFile(...).map(readPoint).cache()
w = Vector.random(D)
println(“Final w: ” + w)
[Zaharia etal 2013]
4000
3500
110 s / iteration
Running Time (s)
3000
2500
2000 Hadoop
1500 Spark
1000
500
first iteration 80 s
0
further iterations 1 s
1 5 10 20 30
Number of Iterations
Applications to EDA
Applications to EDA
• ML/DL for image based applications
• Debugging
SAT Solver Parameter Tuning and Solver Selection
for Formal Verification
• SAT is NP complete
– Little hope we will find efficient solver that fits all problems
Best practices
Best practices
• 80-90% of the questions can be analyzed even with simple (even rule
based) chatbots
• Difficulty may reside on how to connect knowledge base to system
When you have thousands of signals and millions of cycles, it is like searching for a needle in a haystack
Using ML/DL for Debugging
words
• Some ideas
00110111
– Look at coding of sentences
sentences
…
11001011
Using ML/DL for Debugging
𝑠𝑖𝑔𝑛𝑎𝑙𝑠 𝑡
• Some ideas
00110111
– Look at coding of sentences
cycles
…
– Using word2vec to code waveforms
11001011
Using ML/DL for Debugging
𝑠𝑖𝑔𝑛𝑎𝑙𝑠 𝑡 𝑠𝑖𝑔𝑛𝑎𝑙𝑠 𝑡−1
00110111 00000000
00110111
cycles
A+B–C=D … …
11001011 11001011
Can be used to extract relations/properties between events for any
number of cycles
Using RNNs to Help Debug
Cycles
Things to Look for
• Deep Learning frameworks
– Caffe/Caffe2
– MXNet
– TensorFlow
• Embeddings
– word2vec
Conclusions
• Singularity is coming
– There is tremendous potential in using ML/DL for EDA
– New ideas and algorithms are starting to emerge
– EDA industry need to be aware this may change the way algorithms
are implemented or even on how to affect the relations with
customers
• Can a support manager have a number of robots working for him to support
customers?
Thank You