[go: up one dir, main page]

0% found this document useful (0 votes)
33 views150 pages

DAC ML Tutorial Final Deck

The document discusses machine learning and its applications to electronic design automation tools. It provides an introduction to machine learning concepts like linear regression, logistic regression, support vector machines, and decision trees. It also discusses performing machine learning at scale, hardware acceleration, and applications to optimization and classification problems in traditional CAD flows.

Uploaded by

R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views150 pages

DAC ML Tutorial Final Deck

The document discusses machine learning and its applications to electronic design automation tools. It provides an introduction to machine learning concepts like linear regression, logistic regression, support vector machines, and decision trees. It also discusses performing machine learning at scale, hardware acceleration, and applications to optimization and classification problems in traditional CAD flows.

Uploaded by

R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 150

MACHINE LEARNING AND SYSTEMS FOR BUILDING THE

NEXT GENERATION EDA TOOLS


Manish Pandey - Synopsys, Inc., Mountain View, CA

Claudionor Coelho - NVXL Technology, Fremont, CA

June 19, 2017


Outline
• Introduction to machine learning
• Basics – Linear Regression, Logistic Regression, Support Vector Machines, and
Decision Trees
• k-means clustering for unsupervised learning
• Deep network training and simple convolutional neural networks
• Common neural net architectures, including Recurrent Neural Networks (RNNs)
• Performing machine learning at scale on large data sets - parallel machine learning
with Apache Sparc, and Mllib.
• Hardware assisted machine learning speedup with GPUs and purpose built processors
• Applications of machine learning to common optimization and classification problems
encountered in the traditional CAD flows.
A Machine
Learning
System

Source: https://m.xkcd.com/1838/
A Few Quotes
• “A breakthrough in machine learning would be
worth ten Microsofts” (Bill Gates, Chairman,
Microsoft)
• “Machine learning is the next Internet”
(Tony Tether, Director, DARPA)
• Machine learning is the hot new thing”
(John Hennessy, President, Stanford)
• Machine learning will bring about not just a new
era of civilization, but a new stage in the evolution
of life on Earth (Pedro Domingos, Professor UW)
• “It is a renaissance, it is a golden age. We are
solving problems with machine learning and
artificial intelligence that were in the realm of
science fiction for the last several decades. Natural
language understanding, machine vision problems,
it really is an amazing renaissance.” (Jeff Bezos,
Amazon CEO)
What is Machine Learning?
• Algorithms that can
improve performance
using training data
• Typically, a large
number of parameter
values learned from
data
• Applicable to situations
where challenging to
define rules manually
How many
variables are we
talking about?
• From tens to millions
of variables
• Learn a complex
multi-dimensional
function that
captures a solution to
the problem
Machine Learning Example

• Each character is represented by a 20x25 pixels. x ∈ R500


• Character recognition machine learning task:
Find a classifier y(x) such that
y : x → {a, b, c, …, z}
Example Details
• Each character is represented by a 20x25 pixels. x ∈ R500
• Character recognition machine learning task:
Find a classifier y(x) such that
y : x → {a, b, c, …, z} y( ) = v

a
Machine b
c
Learning d
500

Model
z
Example Details Cont’d
• Each character is represented by a 20x25 pixels. x ∈ R500
• Character recognition machine learning task:
Find a classifier y(x) such that y : x → {a, b, c, …, z} y( ) = v

x y
a
Machine b
c
500-dimension Input
Learning d
500

Model
z 13026 variable function
to model the mapping of
Wx + b = y pixels to characters
26x500 500x1 26x1 26x1
Learning Details Continued
Wx + b = y
26x500 500x1 26x1 26x1

• Training
– Start with random weights W[26x500] and bias b[26x1]
– For a given “training data set”, determine the error at the output y
– Pixels for known characters – test how “far” are the results from actual value.
– Adjust the weights and bias to minimize the error
• Validation
– Test on data that the model has not been trained on
Training: Solving for W and b
Given input x, and associated label L
▪ Compute y = Wx + b x= L = [0,0,0,….,0,1,0,0,0,0]
▪ Compute S(y)
S
▪ Cross entropy is
D(S, L) = − σ𝑖 𝐿𝑖 log(𝑆𝑖 )
▪ Loss function
1
LossL = ෍ 𝐷(𝑆 𝑊𝑥𝑖 + 𝑏 , 𝐿𝑖 )
𝑁
𝑖
▪ Compute derivative of W and b w.r.t. Loss = 𝛻𝑤
▪ Adjust W and b
▪ W = W - 𝛻𝑤 ∗ 𝑠𝑡𝑒𝑝_𝑠𝑖𝑧𝑒
Gradient Decent

L(w1,w2) -𝛼𝑑𝐿(𝑤1, 𝑤2)

w2
w1

All operating in 13026 variable space


Machine Learning Problems

Supervised Learning Unsupervised Learning

Discrete classification or categorization clustering

Continuous regression dimensionality reduction


Linear Regression
We want to find the “best” line (linear function y=f(X))
to explain the data.
y

X
Linear Regression
The predicted value of y is given by:

𝑦ො = 𝛽መ0 + ෍ 𝑋𝑗 𝛽መ𝑗
𝑗=1

The vector of coefficients 𝛽መ is the regression model.


Linear Regression
The regression formula
𝑝

𝑦ො = 𝛽መ0 + ෍ 𝑋𝑗 𝛽መ𝑗
𝑗=1
if 𝑋0 = 1, can be written as a matrix product with X a row vector:
𝑦ො = X 𝛽መ
We get this by writing all of the input samples in a single matrix X:
𝑋11 ⋯ 𝑋1𝑛
i.e. rows of 𝐗 = ⋮ ⋱ ⋮
𝑋𝑚1 ⋯ 𝑋𝑚𝑛
are distinct observations, columns of X are input features.
Least Squares Solution
The most common measure of fit between the line and the data is
the least-squares fit.

There is a good reason for this: If the points are generated by an


ideal line with additive Gaussian noise, the least squares
solution is the maximum likelihood solution.
2
− 𝑦𝑗 −𝑋𝑗 𝛽
Probability of a point yj is Pr 𝑦𝑗 = exp and the
2𝜎 2
probability for all points is the product over j of Pr 𝑦𝑗 .
2
− 𝑦𝑗 −𝑋𝑗 𝛽
We can easily maximize the log of this expression for
2𝜎 2
one point, or the sum of this expression at all points.
Logistic Regression
• Difference between regression (predicting a real value) and
classification (predicting a discrete value).

• Logistic regression is designed as a binary classifier (output


say {0,1}) but actually outputs the probability that the input
instance is in the “1” class.

• A logistic classifier has the form:


1
𝑝 𝑋 =
1 + exp −𝑋𝛽
where 𝑋 = 𝑋1 , … , 𝑋𝑛 is a vector of features.
Logistic Regression
• Logistic regression is probably the most widely used general-
purpose classifier.

• Its very scalable and can be very fast to train. It’s used for
– Spam filtering
– News message classification
– Web site classification
– Product classification
– Most classification problems with large, sparse feature sets.
Logistic Regression
• Logistic regression maps the “regression” value −𝑋𝛽 in
(-,) to the range [0,1] using a “logistic” function:
1
𝑝 𝑋 =
1 + exp −𝑋𝛽

• i.e. the logistic function maps any value on the real line to a
probability in the range [0,1]
Logistic Training
For training, we start with a collection of input values 𝑋 𝑖 and
corresponding output labels 𝑦 𝑖 ∈ 0,1 . Let 𝑝𝑖 be the
predicted output on input 𝑋 𝑖 , so
𝑖
1
𝑝 =
1 + exp −𝑋 𝑖 𝛽
The accuracy on an input 𝑋 𝑖 is
𝐴𝑖 = 𝑦 𝑖 𝑝𝑖 + 1 − 𝑦 𝑖 1 − 𝑝𝑖
Logistic regression maximizes either the sum of the log accuracy,
or the total accuracy, e.g.
𝑁

𝐴 = ෍ log 𝐴𝑖
𝑖=1
Logistic Regression Training - SGD
• A very efficient way to train logistic models is with Stochastic
Gradient Descent (SGD) – we keep updating the model with
gradients from small blocks (mini-batches) of input data.

• One challenge with training on power law data (i.e. most data)
is that the terms in the gradient can have very different
strengths (because of the power law distribution). We’ll
discuess this soon…
Support Vector Machines
• A Support Vector Machine (SVM) is a classifier that tries to
maximize the margin between training data and the
classification boundary (the plane defined by 𝑋𝛽 = 0)
Support Vector Machines
• The idea is that maximizing the margin maximizes the chance
that classification will be correct on new data. We assume
the new data of each class is near the training data of that
type.
SVM Training
SVMs can be trained using SGD.
The SVM gradient can be defined as (here 𝑝𝑖 = 𝑋 𝑖 𝛽)

𝑁
𝑑𝐴
= ෍ if 𝑝𝑖 𝑦 𝑖 < 1 then 𝑦 𝑖 𝑋 𝑖 else 0
𝑑𝛽
𝑖=1

The expression 𝑝𝑖 𝑦 𝑖 < 1 tests whether the point 𝑋 𝑖 is nearer


than the margin, and if so adds it with sign 𝑦 𝑖 . This “nudges”
the model to push it further out next time. It ignores other
points.
Decision trees
• Walk from root to a class-labeled leaf.
• At each node, branch based on the value of some feature.
Ex: Should we wait for a table at this restaurant?

Note you can use the same


attribute more than once.
Decision tree learning
• If there are k features, a decision tree might have up to 2k
nodes. This is usually much too big in practice.

• We want to find “efficient” (smaller) trees.

• We can do this in a greedy manner by recursively choosing a


best split feature at each node.
Choosing an attribute
• Idea: a good features splits the examples into subsets that are (ideally) "all positive"
or "all negative"

• Patrons or type? To wait or not to wait is still at 50%.


Unsupervised Learning
k-Means Clustering
Unsupervised Learning
Unsupervised Learning

How many clusters can we find?


Unsupervised Learning

N=2
Unsupervised Learning

N=3
Unsupervised Learning

N=4
K-means clustering
• K-means is a partitional clustering algorithm
• Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X  Rr, and
r is the number of attributes (dimensions) in the data.
• The k-means algorithm partitions the given data into k clusters.
– Each cluster has a cluster center, called centroid.
– k is specified by the user

CS583, Bing Liu, UIC 35


K-means algorithm
• Given k, the k-means algorithm works as follows:
1) Randomly choose k data points (seeds) to be the initial centroids,
cluster centers
2) Assign each data point to the closest centroid
3) Re-compute the centroids using the current cluster memberships.
4) If a convergence criterion is not met, go to 2).

CS583, Bing Liu, UIC 36


K-means algorithm – (cont …)

CS583, Bing Liu, UIC 37


Stopping/convergence criterion
1. no (or minimum) re-assignments of data points to different
clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared error (SSE),
k
SSE  
j 1
xC j
dist (x, m j ) 2
(1)

– Ci is the jth cluster, mj is the centroid of cluster Cj (the mean vector


of all the data points in Cj), and dist(x, mj) is the distance between
data point x and centroid mj.

CS583, Bing Liu, UIC 38


An example distance function

CS583, Bing Liu, UIC 39


Unsupervised Learning
Unsupervised Learning
Unsupervised Learning
Unsupervised Learning
Unsupervised Learning
Unsupervised Learning
Strengths of k-means
• Strengths:
– Simple: easy to understand and to implement
– Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
– Since both k and t are small. k-means is considered a linear algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is used. The global optimum is
hard to find due to complexity.

CS583, Bing Liu, UIC 46


Weaknesses of k-means
• The algorithm is only applicable if the mean is defined.
– For categorical data, k-mode - the centroid is represented by most
frequent values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away from other data points.
– Outliers could be errors in the data recording or some special data
points with very different values.

CS583, Bing Liu, UIC 47


Unsupervised Learning
outlier
Unsupervised Learning
outlier

C2 C1
Improving k-Clustering Results
• If I don’t know which k to use, make k range from 2 to N and
pick result that makes most sense
– Smaller overall error
– Visual inspection
• Remove outliers from centroid computation as they skew
centroid position
• Run algorithm several times with a different initial set of
clusters and pick best result w.r.t. error
Other Approaches to k-Means
• Probabilistic methods
– LDA
– Autoencoders
–…
Going Deep with Machine Learning
The Machine Learning Framework

y = f(x)
output prediction Image
function feature

f( ) = ”Border Collie”
f( ) = ”Whippet”
Things Can Get a Little More Complicated

This dog? Or that dog?

Is this a leg or the


dog’s ears?
What’s the Difference Between Machine
Learning and Deep Learning?
Traditional Machine Learning Algorithms

• Where is the object I want to


detect?

• What’s the orientation of the


object I want to detect?
feature extraction
and Machine Learning
preprocessing Algorithm

Norwich terrier
Features
• Histograms

• GIST descriptors

• Image gradients

• Shape context

• …
Other Feature Examples
• Text structure of sentence
• Syllables of speech
It Took You Several Years to Learn How to Correctly
Understand the World
Featureless Learning System
• No features

• Tagged data

• Sometimes untagged data, but with sensory feedback


Generalization

Slide credit: L. Lazebnik


To train you need lots of data
… but sometimes you do not have enough tagged data
Dataset Augmentation
• Mirrored data
• Distorted images
• Blurred images
• Color changes
Artificial Neuron

Non-linearity
w1j
x1

w2j n

x2 s j   wij xi  b j y j  f (s j )
i 0
yj
1
s j
wnj 1 e
xn
bj
Deep Learning

Feed raw data (or minimally processed data) into the network
Make network deep to replicate feature extraction algorithm
Intermediate Layers Learn Features
Deep Learning

Forward Backward

Need to learn all w’s and b’s


What Types of Layers Are We Talking About?
• Inner product (NN)
• Convolutional layer
• Recurrent Neural Network (RNN)
• Reduced Boltzman Machines (RBM)
Data Layers
Layers in in Caffe
Image Data Vision Layers Recurrent Layers Common Layers
Database Convolution Recurrent Inner Product
HDF5 Input Pooling RNN Dropout
HDF5 Output MaxPooling LSTM Embed
Input AvgPooling
Window Data RandomPooling
Memory Data Spatial Pyramid Pooling Activation Layers
Dummy Data Crop ReLU/Rectified-Linear and Leaky-ReLU Utility Layers
Deconvolution Layer PReLU Flatten
Im2Col ELU Reshape
Sigmoid Batch Reindex
TanH Split
Normalization Layers Loss Layers Absolute Value Concat
Local Response Multinomial Logistic Loss Power Slicing
Normalization (LRN) Infogain Loss Exp Eltwise
Mean Variance Softmax with Loss Log Filter/Mask
Normalization (MVN) Sum-of-Squares / Euclidean BNLL Parameter
Batch Normalization Hinge / Margin Threshold Reduction
Sigmoid Cross-Entropy Loss Bias Silence
Accuracy / Top-k layer Scale ArgMax
Contrastive Loss Softmax
What Types of Layers Are We Talking About?
• Inner product (NN)
• Convolutional layer
• Recurrent Neural Network (RNN)
• Reduced Boltzman Machines (RBM)
Convolutional Layers
• Usually used for 2-d representations
(images)
– Character recognition
– Natural images
• Can find edges, corners, endpoints, etc
• Each output utilizes only a subset
(window) of the inputs
• Each node is then passed through a
non-linear activation function
• Provides translation invariance
property by using same weight window
http://cs231n.github.io/convolutional-networks/
over the image
Convolutional Layers
Pooling Layer
• Provides dimensionality
reduction
• Usually interleaved with
convolutional layer
• Max, average, random
• Provides immunity against noise
• At later layers pooling can make
network invariant to more than
just translation
Pooling Layer
Putting All Together

shared
Building a Whole Application Using DNN
CaffeNet/AlexNet

Norwich terrier

3x227x227
CaffeNet

Input Normalization
Convolution Fully connected (Inner product)

ReLu Softmax
MaxPooling
Normalization
• It simulates ”lateral inhibition” by normalizing over local input
regions
• According Alex Krizhevsky, it aids generalization

CS 678 – Deep Learning 80


Softmax
• For classification problems it is
advantageous and increasingly
popular to use the softmax as
output layer
• It provides probabilistic
interpretation
• Often used with some form of
normalization
What Types of Layers Are We Talking About?
• Inner product (NN)
• Convolutional layer
• Recurrent Neural Network (RNN)
• Reduced Boltzman Machines (RBM)
We Need to Remember the History…
Limitations of CNN based networks
• We only look at the present to make a decision
• Examples have fixed length

Examples where history is important


• Video processing
• Audio
• Text analysis
Recurrent Neural Networks
• RNNs remembers the past by feeding back memory to next
iteration
• Implementation is really an unrolling of the network

http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Recurrent Neural Networks
• Examples of Recurrent Neural Networks
• Each rectangle is a vector and
arrows represent functions (e.g.
matrix multiply).
• Input vectors are in red, output
vectors are in blue and green
vectors hold the RNN's state

1 2 3 4 5
(1) Standard mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image classification).
(2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words).
(3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative
sentiment).
(4) Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a
sentence in French).
(5) Synced sequence input and output (e.g. video classification where we wish to label each frame of the video). 88
[Hsinchun Chen, University of Arizona]
A Closer Look at RNNs

http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
A Closer Look at RNNs

0.0 (vanishing memory)

inf (exploding gradient)


LSTM (Long Short Term Memory)
Gate Recurrent Units (GRUs)
RNN
Captioning Movies / Tagging movies
Autoencoders
• Neural network is trained to
reproduce the input at the
output

• Waist of the network represents


compression of the sampling
space

• These 30 numbers can be used


to classify the input

Diagram from (Hinton and Salakhutdinov, 2006)


Autoencoders and RBMs

Diagram from (Hinton and Salakhutdinov, 2006)


What’s Left in this Tutorial for Deep Learning?
• Representation of sentences in NLP
– Set of words
– Bag of words
– Probabilities

• Word2Vec
Representation of Words
words

00110111

sentences

11001011
Word2Vec
• Matrix representation is very inefficient words
– Vocabulary of 100k words
– Representation of word is a one-hot encoding 00110111

sentences
– Representation of ”the dog is on the table”
• R(”dog”) | R(”is”) | R(”on”) | R(”the”) | R(”table”) …
• 𝑅 ∶ 𝑤𝑜𝑟𝑑𝑠 → {0,1}|𝑤𝑜𝑟𝑑𝑠|
• Can we find a representation that is denser than
one-hot encoding 11001011
Word2Vec
• 𝑊 ∶ 𝑤𝑜𝑟𝑑𝑠 → {0,1}𝑁 , 𝑁 << |𝑤𝑜𝑟𝑑𝑠|

• |W(w1)| = 1

𝑢.𝑣
• cos 𝑢, 𝑣 = , 𝑢 = 𝑊 𝑤1 , 𝑣 = 𝑊(𝑤2)
𝑢 𝑣

• king - queen + woman = man


Word2Vec
• 𝑊 ∶ 𝑤𝑜𝑟𝑑𝑠 → {0,1}𝑁 , 𝑁 << |𝑤𝑜𝑟𝑑𝑠|

• |W(w1)| = 1

𝑢.𝑣
• cos 𝑢, 𝑣 = , 𝑢 = 𝑊 𝑤1 , 𝑣 = 𝑊(𝑤2)
𝑢 𝑣

• king - queen + woman = man


Hardware Assisted Acceleration
Acceleration of Deep Learning
• Processor based acceleration

• GPUs

• SoCs

• FPGAs
Compute Acceleration

Large Server Cluster GPU

Google TPU IBM TrueNorth Traditional FPGA


Card
Next Gen Datacenter Requirements

Source: Building a datacenter for the post mobile era, Steve Brown, OCP Summit 17 (https://www.youtube.com/watch?v=zcw-Hm1p7ys)
Anatomy of an ML/DL Accelerator
Data
Transport

Non-Linear Matrix
Functions Machine Learning Multiplication

Deep Learning

Normalization Processing
Pooling Elements
Processor Based Acceleration
• All ML and DL algorithms start and end in machine subsystem

• Multi-core/multi-thread systems are common nowadays

• Processor have high speed bandwidth to memory subsystem

• Hard to beat Numpy math library

• But you end up using all machine resources for a single user
GPU Based Acceleration
• Programming using CUDA or OpenCL

• Pre-defined libraries for common tasks


(CuDNN) and cuBLAS
– Convolution
– Pooling
– Activation

• Depending on data size, data


transportation may influence results

• Power efficiency may limit applicability

http://www.nvidia.com/content/events/geoInt2015/LBrown_DL.pdf
GPU Based Acceleration

https://devblogs.nvidia.com/parallelforall/inside-volta/
GPUs Based Acceleration
GPUs for Deep Learning
CuDNNs and GPUs
SoC Based Acceleration

https://www.nextplatform.com/2017/04/05/first-depth-look-googles-tpu-architecture/
Systolic Processing Unit Inside TPU
FPGA Based Acceleration

https://www.wired.com/2016/09/microsoft-bets-future-chip-reprogram-fly/
FPGA Based Acceleration
• HLS/OpenCL-based flow gaining
acceptance

• Uses DSP-based blocks commonly used


in FPGAs

• Being fully programmable means users


can recompile new binaries to address
new architecture challenges

• Experiments have shown to be very


energy efficient
Example of FPGA Based Acceleration
• Can adapt to multiple
requirements, whether logic or
computational

• Shell provides common infra-


structure (including
communication) to applications

• User has to worry only about


application implementation
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/Catapult_ISCA_2014.pdf
NVXL Cloud Datacenter
Applications run on client servers

o True software defined datacenter,


with dynamic provisioning

o Extreme performance on-demand


▪ Tens of Peta FLOPs of computing
PCIe/RDMA Network
power
▪ Billions of IOPs of SSD performance
Disaggregated acceleration/NVMeF clusters
o Elastic scaling ...
▪ 1000s of FPGAs and ASICs

o Dramatic reduction in TCO

NVXL DS-1 Universal Acceleration Server


Data Optimizations
• Quantization
– Moving from fp32 -> fp16
– Using fixed-point integer arithmetic
– Using fully quantized values for weights
• {-1,+1}
• {-1,0,+1}
• {-2,-1,0,+1,+2}
• …

– This requires retraining of the network


FPGAs can be Optimized with New Architectures
Classic Sparse Compact - Bitwidth Ternary Binary
DNN Type
DNNs DNNs DNNs DNNs DNNs
FP32 Matrix Sparse Matrix Compact Matrix Sign flipping or XNOR and
Computation
Multiply Multiply Multiply negation Bit-counting

Performance
vs. GPU

0.98x 1.5x 1.5x 3.5x 12x

Performance/Watt
vs. GPU

1.4x 2.4x 2x 4.3x 10x


Source: Intel Corp. paper, presented at FPGA’17 in Monterey, CA
INFRASTRUCTURE FOR LARGE SCALE MACHINE
LEARNING
Machine Learning at Scale
• Off-line and on-line machine learning
–Data volume
–Learning speed
–Prediction speed
• Managing data at scale is hard
–Distributed data storage
–Distributed computation
–Deployment and Operational considerations
Apache Spark
• Distributed in-memory computation
platform
• Underlying distributed storage MLLib

• Key idea – compute pipelines with


Apache Spark
– Parallel computation model
– In-memory parallelization support
– Checkpointing
• MLlib -- Parallel Machine Learning Library HDFS or other Distributed Store
implements most common ML algorithms
[Zaharia et.al. 2013]
Apache Spark for In-memory computation at
scale
RDDs track lineage info to rebuild lost data
• file.map(record => (record.type, 1))
.reduceByKey((x, y) => x + y)
.filter((type, count) => count > 10)

map reduce filter


Input file
[Zaharia etal 2013]

Fault Tolerance
RDDs track lineage info to rebuild lost data
• file.map(record => (record.type, 1))
.reduceByKey((x, y) => x + y)
.filter((type, count) => count > 10)

map reduce filter


Input file
[Zaharia etal 2013]

Mllib Example: Logistic Regression

Goal: find best line separating two sets of points


random initial line

target
[Zaharia etal 2013]

Mllib Example: Logistic Regression

data = spark.textFile(...).map(readPoint).cache()

w = Vector.random(D)

for (i <- 1 to iterations) {


gradient = data.map(p =>
(1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x
).reduce((x, y) => x + y)
w -= gradient
}

println(“Final w: ” + w)
[Zaharia etal 2013]

Logistic Regression Results

4000
3500
110 s / iteration
Running Time (s)

3000
2500
2000 Hadoop
1500 Spark
1000
500
first iteration 80 s
0
further iterations 1 s
1 5 10 20 30
Number of Iterations
Applications to EDA
Applications to EDA
• ML/DL for image based applications

• SAT solver parameter tuning and solver selection for formal


verification

• Chatbots for support

• Debugging
SAT Solver Parameter Tuning and Solver Selection
for Formal Verification
• SAT is NP complete
– Little hope we will find efficient solver that fits all problems

• Different solvers have strengths and weaknesses


– MiniSat, MarchSAT, …

• Each solver has a number of parameters that can perform well


on certain types of problems
Parameter Tuning and SAT Solver Selection
Features We know the problem is NP complete, but
1. Property circuit level different engines may affected differently by the
2. SCOAP cycles features, some polynomially and some exponentially
3. Number of flops uninitialized after RESET
4. Circuit Testability Index We attempt to optimize how many instances we can
5. Property Testability Index run to reduce the risk of a property not being proven
6. SCOAP adjusted flops
7. SCMax
8. Number of flops
9. Number of gate bits
10. Number of free variables
11. Number of bits directly affected by constraints
12. Number of counters flops
13. Number of FSM flops
14. Number of memory array flops

Penido et al, STTT 2010


Chatbots for Support

AI to generate automatic programs


Can you show me FFT Search from Google was using StackOverflow
code in CUDA? to search for snippets of code
Chatbots for Support
Why my synthesis is Search
not giving good
frequency?
Why my synthesis is Search
not giving good
frequency?

Best practices

Natural Language Interface

Design Dynamic data Documentation

Analysis of design, traversals, logs, etc


왜 내 합성이 좋은 Search
주파수를주지
못하는지?

Best practices

Natural Language Interface

Design Dynamic data Documentation

성능을 저해하는 조합 경로가 있습니다.


파이프 라인을 만들거나 병렬 처리를
시도하십시오.
Chatbots for Support
• Experience:

• 80-90% of the questions can be analyzed even with simple (even rule
based) chatbots
• Difficulty may reside on how to connect knowledge base to system

• 10-20% of the questions require more complex chatbots or physical


support
Using ML/DL for Debugging
• Debugging is based on 0, 1’s

• Why do I need probabilistic view for debug?


Using ML/DL for Debugging

When you have thousands of signals and millions of cycles, it is like searching for a needle in a haystack
Using ML/DL for Debugging
words
• Some ideas
00110111
– Look at coding of sentences

sentences

11001011
Using ML/DL for Debugging
𝑠𝑖𝑔𝑛𝑎𝑙𝑠 𝑡
• Some ideas
00110111
– Look at coding of sentences

cycles

– Using word2vec to code waveforms

11001011
Using ML/DL for Debugging
𝑠𝑖𝑔𝑛𝑎𝑙𝑠 𝑡 𝑠𝑖𝑔𝑛𝑎𝑙𝑠 𝑡−1

00110111 00000000
00110111

cycles
A+B–C=D … …

11001011 11001011
Can be used to extract relations/properties between events for any
number of cycles
Using RNNs to Help Debug

Classification of the behaviors

Fixes, expected behaviors, etc


Set of signals

Cycles
Things to Look for
• Deep Learning frameworks
– Caffe/Caffe2
– MXNet
– TensorFlow
• Embeddings
– word2vec
Conclusions
• Singularity is coming
– There is tremendous potential in using ML/DL for EDA
– New ideas and algorithms are starting to emerge
– EDA industry need to be aware this may change the way algorithms
are implemented or even on how to affect the relations with
customers
• Can a support manager have a number of robots working for him to support
customers?
Thank You

You might also like