[go: up one dir, main page]

0% found this document useful (0 votes)
38 views232 pages

DL 2021 Tensorflow and Deep Learning

The document provides an overview of using TensorFlow and Keras for deep learning, specifically focusing on the MNIST handwritten digit classification task. It details the architecture of a simple softmax classification model, the implementation in TensorFlow, and various techniques to improve model performance such as dropout and learning rate decay. Additionally, it discusses the transition to convolutional neural networks and the importance of managing overfitting in deep learning models.

Uploaded by

3257002711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views232 pages

DL 2021 Tensorflow and Deep Learning

The document provides an overview of using TensorFlow and Keras for deep learning, specifically focusing on the MNIST handwritten digit classification task. It details the architecture of a simple softmax classification model, the implementation in TensorFlow, and various techniques to improve model performance such as dropout and learning rate decay. Additionally, it discusses the transition to convolutional neural networks and the importance of managing overfitting in deep learning models.

Uploaded by

3257002711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 232

>TensorFlow, Keras and deep learning_

without a PhD

deep deep
Science ! Code ...

#Tensorflow @martin_gorner
Hello World: handwritten digits classification - MNIST

?
MNIST = Mixed National Institute of Standards and Technology - Download the dataset at http://yann.lecun.com/exdb/mnist/
Very simple model: softmax classification

784 pixels
28x28
pixels
... weighted sum of all
pixels + bias

softmax ...

0 1 2 9
neuron outputs

@martin_gorner
In matrix notation, 100 images at a time
10 columns
w0,0 w0,1w0,2 w0,3 … w0,9
w1,0 w1,1w1,2 w1,3 … w1,9 broadcast
w2,0 w2,1w2,2 w2,3 … w2,9
w3,0 w3,1w3,2 w3,3 … w3,9

784 lines
w4,0 w4,1w4,2 w4,3 … w4,9
x w5,0 w5,1w5,2 w5,3 … w5,9
x w6,0 w6,1w6,2 w6,3 … w6,9
x
X : 100 images, x w7,0 w7,1w7,2 w7,3 … w7,9
x w8,0 w8,1w8,2 w8,3 … w8,9
one per line, x

x
flattened x w783,0 w783,1 w783,2 … w783,9

L
L0,0
0,0 L0,1 L0,2 L0,3 ……L0,9 + b0 b1 b2 b3 … b9
L1,0 L1,1 L1,2 L1,3 … L1,9
L2,0 L2,1 L2,2 L2,3 … L2,9
L3,0 L3,1 L3,2 L3,3 … L3,9 + Same 10 biases
L4,0 L4,1 L4,2 L4,3 … L4,9
… on all lines
L99,0 L99,1 L99,2 … L99,9
784 pixels
Softmax, on a batch of images

Predictions Images Weights Biases


Y[100, 10] X[100, 784] W[784,10] b[10]

broadcast
applied line matrix multiply
on all lines
by line
tensor shapes in [ ]

@martin_gorner
Now in TensorFlow (Python)

tensor shapes: X[100, 784] W[748,10] b[10]

Y = tf.nn.softmax(tf.matmul(X, W) +
b)
broadcast
matrix multiply
on all lines

@martin_gorner
Success ?
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 1 0 0 0

actual probabilities, “one-hot” encoded

Cross entropy:
this is a “6”
computed probabilities

.01 .03 .00 .04 .03 .05 0.8 .02 .01 .01
0 1 2 3 4 5 6 7 8 9

@martin_gorner
Demo
92%
TensorFlow - initialisation

import tensorflow as tf this will become the batch size, 100

X = tf.placeholder(tf.float32, [None, 28, 28, 1])


W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10])) 28 x 28 grayscale images

init = tf.initialize_all_variables()

Training = computing variables W and b

@martin_gorner
TensorFlow - success metrics
flattening images
# model
Y = tf.nn.softmax(tf.matmul(tf.reshape(X, [-1, 784]), W) + b)
# placeholder for correct answers
Y_ = tf.placeholder(tf.float32, [None, 10])
“one-hot” encoded
# loss function
cross_entropy = -tf.reduce_sum(Y_ * tf.log(Y))
“one-hot” decoding
# % of correct answers found in batch
is_correct = tf.equal(tf.argmax(Y,1), tf.argmax(Y_,1))
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))

@martin_gorner
TensorFlow - training

learning rate

optimizer = tf.train.GradientDescentOptimizer(0.003)
train_step = optimizer.minimize(cross_entropy)

loss function

@martin_gorner
TensorFlow - run !
sess = tf.Session() running a Tensorflow
sess.run(init) computation, feeding
placeholders
for i in range(1000):
# load batch of images and correct answers
batch_X, batch_Y = mnist.train.next_batch(100)
train_data={X: batch_X, Y_: batch_Y}

# train
sess.run(train_step, feed_dict=train_data)

# success ?
Tip: a,c = sess.run([accuracy, cross_entropy], feed_dict=train_data)
do this
every 100 # success on test data ?
iterations test_data={X: mnist.test.images, Y_: mnist.test.labels}
a,c = sess.run([accuracy, cross_entropy], feed=test_data)
TensorFlow - full python code
training step
initialisation
import tensorflow as tf optimizer = tf.train.GradientDescentOptimizer(0.003)
train_step = optimizer.minimize(cross_entropy)

X = tf.placeholder(tf.float32, [None, 28, 28, 1]) sess = tf.Session()


W = tf.Variable(tf.zeros([784, 10])) sess.run(init)
b = tf.Variable(tf.zeros([10]))
init = tf.initialize_all_variables()
model for i in range(10000):
# load batch of images and correct answers
# model batch_X, batch_Y = mnist.train.next_batch(100)
Y=tf.nn.softmax(tf.matmul(tf.reshape(X,[-1, 784]), W) + b) train_data={X: batch_X, Y_: batch_Y}

# placeholder for correct answers # train


Y_ = tf.placeholder(tf.float32, [None, 10]) sess.run(train_step, feed_dict=train_data) Run
# loss function
success metrics # success ? add code to print it
cross_entropy = -tf.reduce_sum(Y_ * tf.log(Y)) a,c = sess.run([accuracy, cross_entropy], feed=train_data)

# % of correct answers found in batch # success on test data ?


is_correct = tf.equal(tf.argmax(Y,1), tf.argmax(Y_,1)) test_data={X:mnist.test.images, Y_:mnist.test.labels}
accuracy = tf.reduce_mean(tf.cast(is_correct,tf.float32)) a,c = sess.run([accuracy, cross_entropy], feed=test_data)

@martin_gorner
Cookbook

Softmax
Cross-entropy
Mini-batch
Let’s try 5 fully-connected layers !
overkill
784
sigmoid

200

100
sigmoid function
60

30

10 softmax
0 1 2 ... 9

@martin_gorner
TensorFlow - initialisation

K = 200
L = 100 weights initialised
M = 60 with random values
N = 30

W1 = tf.Variable(tf.truncated_normal([28*28, K] ,stddev=0.1))
B1 = tf.Variable(tf.zeros([K]))

W2 = tf.Variable(tf.truncated_normal([K, L], stddev=0.1))


B2 = tf.Variable(tf.zeros([L]))
W3 = tf.Variable(tf.truncated_normal([L, M], stddev=0.1))
B3 = tf.Variable(tf.zeros([M]))
W4 = tf.Variable(tf.truncated_normal([M, N], stddev=0.1))
B4 = tf.Variable(tf.zeros([N]))
W5 = tf.Variable(tf.truncated_normal([N, 10], stddev=0.1))
B5 = tf.Variable(tf.zeros([10]))

@martin_gorner
TensorFlow - the model
weights and biases

X = tf.reshape(X, [-1, 28*28])

Y1 = tf.nn.sigmoid(tf.matmul(X, W1) + B1)


Y2 = tf.nn.sigmoid(tf.matmul(Y1, W2) + B2)
Y3 = tf.nn.sigmoid(tf.matmul(Y2, W3) + B3)
Y4 = tf.nn.sigmoid(tf.matmul(Y3, W4) + B4)
Y = tf.nn.softmax(tf.matmul(Y4, W5) + B5)

@martin_gorner
Demo - slow start ?

@martin_gorner
Relu !
RELU
RELU = Rectified Linear Unit

Y = tf.nn.relu(tf.matmul(X, W) + b)
@martin_gorner
RELU

@martin_gorner
Demo - noisy accuracy curve ?

yuck!
Slow down . . .
Learning
rate decay
Demo
98%
Learning rate decay

Learning rate 0.003 at start then dropping exponentially to 0.0001

@martin_gorner
Overfitting
Cross-entropy loss

Overfitting ?

@martin_gorner
Dropout

pkeep =
tf.placeholder(tf.float32) TRAINING EVALUATION
rate=0.5 rate=0

Yf = tf.nn.relu(tf.matmul(X, W) + B)
Y = tf.nn.dropout(Yf, pkeep)
@martin_gorner
All the party tricks

97.9%
98.2%
peak
sustained

Sigmoid,decaying
RELU, learning
learningrate
learning
rate= =0.003
0.003
rate 0.003 -> 0.0001 and dropout 0.75

@martin_gorner
Overfitting
Cross-entropy loss

Overfitting

@martin_gorner
Overfitting Too many
neurons
?!?

BAD
Network
Not
enough
DATA
Convolutional layer
convolutional
subsampling

+padding convolutional
subsampling
convolutional
subsampling
stride

W1[4, 4, 3]

W2[4, 4, 3] W[4, 4, 3, 2]
filter input output
size channels channels

@martin_gorner
Hacker’s tip

ALL
Convolu-
tional
Convolutional neural network
+ biases on
all layers
28x28x1
convolutional layer, 4 channels
W1[5, 5, 1, 4] stride 1
28x28x4
convolutional layer, 8 channels
14x14x8 W2[4, 4, 4, 8] stride 2

convolutional layer, 12 channels


7x7x12
W3[4, 4, 8, 12] stride 2

200 fully connected layer W4[7x7x12, 200]


10 softmax readout layer W5[200, 10]
Tensorflow - initialisation
filter input output
K=4 size channels channels
L=8
M=12

W1 = tf.Variable(tf.truncated_normal([5, 5, 1, K] ,stddev=0.1))
B1 = tf.Variable(tf.ones([K])/10)
W2 = tf.Variable(tf.truncated_normal([5, 5, K, L] ,stddev=0.1))
B2 = tf.Variable(tf.ones([L])/10)
W3 = tf.Variable(tf.truncated_normal([4, 4, L, M] ,stddev=0.1))
B3 = tf.Variable(tf.ones([M])/10)
weights initialised
N=200 with random values

W4 = tf.Variable(tf.truncated_normal([7*7*M, N] ,stddev=0.1))
B4 = tf.Variable(tf.ones([N])/10)
W5 = tf.Variable(tf.truncated_normal([N, 10] ,stddev=0.1))
B5 = tf.Variable(tf.zeros([10])/10)
Tensorflow - the model

input image batch weights stride biases


X[100, 28, 28, 1]

Y1 = tf.nn.relu(tf.nn.conv2d(X, W1, strides=[1, 1, 1, 1], padding='SAME') + B1)


Y2 = tf.nn.relu(tf.nn.conv2d(Y1, W2, strides=[1, 2, 2, 1], padding='SAME') + B2)
Y3 = tf.nn.relu(tf.nn.conv2d(Y2, W3, strides=[1, 2, 2, 1], padding='SAME') + B3)

YY = tf.reshape(Y3, shape=[-1, 7 * 7 * M]) Y3 [100, 7, 7, 12]


flatten all values for
Y4 = tf.nn.relu(tf.matmul(YY, W4) + B4) fully connected layer
YY [100, 7x7x12]
Y = tf.nn.softmax(tf.matmul(Y4, W5) + B5)

@martin_gorner
Demo
98.9%

@martin_gorner
WTFH ???

???

@martin_gorner
Bigger convolutional network + dropout
+ biases on
all layers
28x28x1
convolutional layer, 6 channels
28x28x6 W1[6, 6, 1, 6] stride 1

convolutional layer, 12 channels


14x14x12 W2[5, 5, 6, 12] stride 2

7x7x24 convolutional layer, 24 channels


W3[4, 4, 12, 24] stride 2
+DROPOUT
200 fully connected layer W4[7x7x24, 200]
p=0.75
10 softmax readout layer W5[200, 10]
Demo
99.3%

@martin_gorner
YEAH !

with dropout

@martin_gorner
Learning
Relu !
rate decay

Dropout

Softmax
Cross-entropy ALL
Mini-batch Convolu- Overfitting Too many
tional ?!? neurons

BAD
Network
Not
enough
DATA
Cartoon images copyright: alexpokusay / 123RF stock photos
Have fun !
Martin Görner Cloud ML Engine
Google Developer relations your TensorFlow models
@martin_gorner trained in Google’s cloud.
Cloud Auto ML Vision ALPHA
Just bring your data

Cloud TPU
ML supercomputing

Pre-trained models:
Cloud Vision API
Cloud Speech API
Videos, slides, code:
Natural Language API
github.com/ Google Translate API
GoogleCloudPlatform/ Video Intelligence API That’s all
tensorflow-without-a-phd folks...
Cloud Jobs API BETA
@martin_gorner
Tensorflow and
deep learning
without a PhD

1
neurons
Workshop
Keyboard shortcuts for the
visualisation GUI:

1 ......... display 1st graph only


2 ......... display 2nd graph only
3 ......... display 3rd graph only
4 ......... display 4th graph only
5 ......... display 5th graph only
6 ......... display 6th graph only
7 ......... display graphs 1 and 2
8 ......... display graphs 4 and 5
9 ......... display graphs 3 and 6
ESC or 0 .. back to displaying all graphs

SPACE ..... pause/resume


O ......... box zoom mode (then use mouse)
H ......... reset all zooms
Ctrl-S .... save current image
If training speed is an issue (it can happen in VirtualBox), consider displaying graph 3
while you wait instead of all the six graphs. You can also disable the visualisations
altogether. There are instructions for that purpose at the end of each code sample.

@martin_gorner
Workshop
Self-paced code lab (summary below ↓): goo.gl/mVZloU
Code: github.com/martin-gorner/tensorflow-mnist-tutorial
1-5. Theory (install then sit back and listen or read) 11. Theory (sit back and listen or read)

Neural networks 101: softmax, cross-entropy, mini- Convolutional networks


batching, gradient descent, hidden layers, sigmoids, and
how to implement them in Tensorflow
6. Practice (full instructions for this step) 12. Practice (full instructions for this step)
Open file: mnist_1.0_softmax.py
Run it, play with the visualisations (keyboard shortcuts Replace your model with a convolutional network,
on previous slide), read and understand the code as well without dropout.
as the basic structure of a Tensorflow program. Solution in: mnist_3.0_convolutional.py

7. Practice (full instructions for this step) 13. Challenge (full instructions for this step)

Start from the file mnist_1.0_softmax.py and add one Try a bigger neural network (good hyperparameters on
or two hidden layers. slide 43) and add dropout on the last layer to get >99%
Solution in: mnist_2.0_five_layers_sigmoid.py Solution in: mnist_3.0_convolutional_bigger_dropout.py

8. Practice (full instructions for this step) ?


Special care for deep neural networks: use RELU
activation functions, use a better optimiser, initialise
?
weights with random values and beware of the log(0)

9-10. Practice (full instructions for this step)

Use a decaying learning rate and then add dropout

Solution in: mnist_2.2_five_layers_relu_lrdecay_dropout.py


>TensorFlow, deep learning and \
recurrent neural networks
without a PhD_

deep deep
Science ! Code ...

#Tensorflow @martin_gorner
The superpower: batch normalisation
Data “whitening”

Data: large values, different scales, skewed, correlated

O’REILLY TensorfFlow World @martin_gorner


Data “whitening”

Subtract average
Modified data: centered around zero, rescaled... Divide by std dev

O’REILLY TensorfFlow World @martin_gorner


Data “whitening”

(A+B)/2

A-B

Modified data: … and decorrelated (that was almost a Principal Component Analysis)

O’REILLY TensorfFlow World @martin_gorner


Data “whitening”
Scale & rotate shift

0.05 0.12
x + -1.45 0.12
0.61 -1.23

new new
A B
= A B W ? B ?

A network layer
can do this !

O’REILLY TensorfFlow World @martin_gorner


Fully connected network
784

OK
200
OK ?
100
OK ???
60

30 OK ???

softmax OK ???
10
0 1 2 ... 9

O’REILLY TensorfFlow World @martin_gorner


Without batch normalisation

My distribution
of inputs

sigmoid

boo-hoo

O’REILLY TensorfFlow World @martin_gorner


Batch normalisation
“logit” = weighted sum + bias
Compute average and Center and re-scale logits
variance on mini-batch before the activation function
(decorrelate ? no, too complex)

one of each
per neuron

Add learnable scale and offset


for each logit so as to restore expressiveness

Try α=stdev(x) and β=avg(x) and you have BN(x) = x

O’REILLY TensorfFlow World @martin_gorner


Batch normalisation
depends from:
depends from:
same weights and biases, images
weights, biases, images only one set of weights and biases in a mini-batch

x=
weighted
sum + bias
Batch-norm α, β
activation
fn
=> BN is differentiable relatively to weights, biases, α and β
It can be used as a layer in the network, gradient calculations will still work

O’REILLY TensorfFlow World @martin_gorner


With batch normalisation (sigmoid)

distribution of sigmoid
neuron output

Batch norm

O’REILLY TensorfFlow World @martin_gorner


With batch normalisation (RELU)

My distribution
of inputs

RELU

O’REILLY TensorfFlow World @martin_gorner


Batch normalisation done right

biases :
no longer useful Per
relu sigmoid
neuron:

without
bias bias
x=
BN
when activation fn is RELU
weighted
sum + b With α is not useful
β α, β It does not modify output distrib.
Batch-norm α, β BN

activation
fn

+You can go faster: use higher learning rate


+BN also regularises: lower or remove dropout

O’REILLY TensorfFlow World @martin_gorner


Convolutional batch normalisation

Each neuron or patch has a value:


● per image in the batch
● per x position
● per y position

=> compute avg and stdev across all


batchsize x width x height values

b1 α1 β1 Still, one bias,


W1[4, 4, 3] scale or offset
b2 α2 β2 per neuron
W2[4, 4, 3]

O’REILLY TensorfFlow World @martin_gorner


Batch normalisation at test time

Stats on what ?
● Last batch: no
● all images: yes (but not practical)
● => Exponential moving average during training

O’REILLY TensorfFlow World @martin_gorner


Batch normalisation with Tensorflow
Define one offset and/or
scale per neuron
def batchnorm_layer(Ylogits, is_test, Offset, Scale, iteration, convolutional=False):
exp_moving_avg = tf.train.ExponentialMovingAverage(0.9999, iteration)
if convolutional: # avg across batch, width, height
mean, variance = tf.nn.moments(Ylogits, [0, 1, 2])
else:
mean, variance = tf.nn.moments(Ylogits, [0])
update_moving_averages = exp_moving_avg.apply([mean, variance])
m = tf.cond(is_test, lambda: exp_moving_avg.average(mean), lambda: mean)
v = tf.cond(is_test, lambda: exp_moving_avg.average(variance), lambda: variance)
Ybn = tf.nn.batch_normalization(Ylogits, m, v, Offset, Scale, variance_epsilon=1e-5)
return Ybn, update_moving_averages don’t forget to execute this (sess.run)
apply activation fn on Ybn
The code is on@martin_gorner
GitHub: goo.gl/DEOe7Z
O’REILLY TensorfFlow World
Demo
99.5%

O’REILLY TensorfFlow World @martin_gorner


More superpowers

high level API

O’REILLY TensorfFlow World @martin_gorner


Layers

from tensorflow.contrib import layers

# this
Y = layers.relu(X, 200)

# instead of this
W = tf.Variable(tf.zeros([784, 200]))
b = tf.Variable(tf.zeros([200]))
Y = tf.nn.relu(tf.matmul(X,W) + b)

Sample: goo.gl/y1SSFy
O’REILLY TensorfFlow World @martin_gorner
Model function
from tensorflow.contrib import learn, layers, metrics
“features” and “targets“
def model_fn(X, Y_, mode):
Yn = … # model layers TRAIN, EVAL
prob = tf.nn.softmax(Yn) or INFER
digi = tf.argmax(prob, 1)

predictions = {"probabilities": prob, "digits": digi} #free-form


evaluations = {'accuracy': metrics.accuracy(digi, Y_)} #free-form

loss = tf.nn.softmax_cross_entropy_with_logits(…) learning rate


train = layers.optimize_loss(loss,framework.get_global_step(), 0.003,"Adam")

return learn.ModelFnOps(mode, predictions,loss,train,evaluations)


Sample: goo.gl/y1SSFy
O’REILLY TensorfFlow World @martin_gorner
Estimator

estimator = learn.Estimator(model_fn=model_fn)

estimator.fit(input_fn=… , steps=10000)

estimator.evaluate(input_fn=…, steps=1)
# => {'accuracy': … }

estimator.predict(input_fn=…)
# => {"probabilities":…, "digits":…}

# input_fn: feeds in batches of features and targets

Sample: goo.gl/y1SSFy
O’REILLY TensorfFlow World @martin_gorner
Convolutional network
def conv_model(X, Y_, mode):
XX = tf.reshape(X, [-1, 28, 28, 1])
Y1 = layers.conv2d(XX, num_outputs=6, kernel_size=[6, 6])
Y2 = layers.conv2d(Y1, num_outputs=12, kernel_size=[5, 5], stride=2)
Y3 = layers.conv2d(Y2, num_outputs=24, kernel_size=[4, 4], stride=2)
Y4 = layers.flatten(Y3)
Y5 = layers.relu(Y4, 200)
Ylogits = layers.linear(Y5, 10)
prob = tf.nn.softmax(Ylogits)

digi = tf.cast(tf.argmax(prob, 1), tf.uint8)


predictions = {"probabilities": prob, "digits": digi} #free-form
evaluations = {'accuracy': metrics.accuracy(digi, Y_)} #free-form
loss = tf.nn.softmax_cross_entropy_with_logits(Ylogits, tf.one_hot(Y_, 10))
train = layers.optimize_loss(loss, framework.get_global_step(), 0.003, "Adam")
return learn.ModelFnOps(mode, predictions, loss, train, evaluations)

estimator = learn.Estimator(model_fn=conv_model)
Sample: goo.gl/y1SSFy
O’REILLY TensorfFlow World @martin_gorner
Recurrent Neural Networks
>TensorFlow, Keras and \
recurrent neural networks
without a PhD_

deep deep
Science ! Code ...

bit.ly/keras-rnn-codelab

#Tensorflow @martin_gorner
Neural network 101 (reminder)
20x20x3

1200

200

20

O’REILLY TensorfFlow World @martin_gorner


Activation functions (reminder)

sigmoid

inputs tanh

activation bias
weights
On last layer:
1

softmax
(classification)

relu weighted
nothing
-1 1 (regression)
sum+b
norm
O’REILLY TensorfFlow World @martin_gorner
RNN
N: internal size
Xt X: inputs
H

RNN cell tanh


H: internal
softmax state
Yt Y: outputs

O’REILLY TensorfFlow World @martin_gorner


RNN
concatenation
Xt X = Xt | Ht-1

RNN cell
Ht = tanh(X.WH + bH)

Yt Yt = softmax(Ht.W + b)

O’REILLY TensorfFlow World @martin_gorner


RNN training

X0 X1 X2 X3 X4 X5

H-1 H0 H1 H2 H3 H4 H5
cell cell cell cell cell cell

Y0 Y1 Y2 Y3 Y4 Y5

The same weights and biases shared across iterations


O’REILLY TensorfFlow World @martin_gorner
Deep RNN
L: number of layers

X0 X1 X2 X3 X4 X5

0 H0 H1 H2 H3 H4 H5
cell cell cell cell cell cell

0 H’0 H’1 H’2 H’3 H’4 H’5


cell cell cell cell cell cell

Y0 Y1 Y2 Y3 Y4 Y5

O’REILLY TensorfFlow World @martin_gorner


Long term dependencies: a problem

Michel C. was born in Paris, France. He is married and has three children. He received a M.S.
in neurosciences from the University Pierre & Marie Curie and the Ecole Normale Supérieure in 1987,
and and then spent most of his career in Switzerland, at the Ecole Polytechnique de Lausanne. He
specialized in child and adolescent psychiatry and his first field of research was severe mood disorders
in adolescent, topic of his PhD in neurosciences (2002). His mother tongue is ? ? ? ? ?
Short context
Miche
l
C. was born in … English,
German,
Hn-1 … Hn Russian,
French …
Frenc
h Long context Problems…

O’REILLY TensorfFlow World @martin_gorner


RNN cell types

Simple RNN cell GRU cell LSTM cell


“Generalized Recurrent Unit” “Long Short Term Memory”

Xt Xt Xt

Ht-1 Ht Ht-1 σ σ 1-
× × Ht Ht-1 σ σ tanh σ Ht
tanh
× tanh × ×
+
Ct-1 × +
tanh
Ct
Ht Ht Ht
Yt Yt Yt

O’REILLY TensorfFlow World @martin_gorner


LSTM
LSTM = Long Short Term Memory X = Xt | Ht-1

Xt concatenation f = σ(X.Wf + bf)


u = σ(X.Wu + bu)
Ht-1 Ht
σ σ tanh σ r = σ(X.Wr + br)
× ×
tanh
X’ = tanh(X.Wc + bc)
Ct-1 × + Ct
Ct = f * Ct-1 + u * X’
Yt Ht = r * tanh(Ct)
σ tanh Neural net. layers
Yt = softmax(Ht.W + b)
× tanh Element-wise operations

O’REILLY TensorfFlow World @martin_gorner


LSTM
vector sizes
concatenate : X = Xt | Ht-1 p+n

forget gate : f = σ(X.Wf + bf) n

update gate : u = σ(X.Wu + bu) n

result gate : r = σ(X.Wr + br) n


Xt
input : X’ = tanh(X.Wc + bc) n
Ht-1 σ σ tanh σ
×
Ht new C : Ct = f * Ct-1 + u * X’n
×
Ct-1
× +
tanh Ct new H : Ht = r * tanh(Ct) n

Yt output : Yt = softmax(Ht.W + b) m

O’REILLY TensorfFlow World @martin_gorner


Gru !
GRU
vector sizes
GRU = Gated X = Xt | Ht-1 p+n
Recurrent Unit
z = σ(X.Wz + bz) n
2 gates instead
of 3 => cheaper r = σ(X.Wr + br) n

X’ = Xt | r * Ht-1 p+n
Xt

X” = tanh(X’.Wc + bc) n
Ht-1 GRU Ht
Ht
Ht = (1-z) * Ht-1 + z * X” n

Yt Yt = softmax(Ht.W + b) m

O’REILLY TensorfFlow World @martin_gorner


Language model in Tensorflow
Characters,
one-hot encoded

S t _ J o h
0 H5
character-
based

t _ J o h n

O’REILLY TensorfFlow World @martin_gorner


Language model in Tensorflow
defines weights and
biases internally
cells = [tf.nn.rnn_cell.GRUCell(CELLSIZE) for i in range(NLAYERS)]
mcell = tf.nn.rnn_cell.MultiRNNCell(cells, state_is_tuple=False)

Hr, H = tf.nn.dynamic_rnn(mcell, X, initial_state=Hin)


Hin X0 X1 X2 X3 X4 X6 X7 X8

0
GRU H0
GRU H1
GRU H2
GRU H3
GRU H5
GRU H6
GRU H7
GRU H8
0 H0 H H H H H H H
GRU H’00
GRU H’10
GRU H’20
GRU H’30
GRU H’50
GRU H’60
GRU H’70
GRU H’8
0
GRUH’0 GRUH’0
H” GRUH’0
H” GRUH’0
H” H’0
H”
GRU H’0
H”
GRU H’0
H”
GRU H’0
H”
GRU H”8
0 1 2 3 5 6 7
H ALPHASIZE = 98
CELLSIZE = 512
H”0 H”1 H”2 H”3 H”5 H”6 H”7 H”8 NLAYERS = 3
SEQLEN = 30

O’REILLY TensorfFlow World @martin_gorner


Softmax readout layer
# Hr [ BATCHSIZE, SEQLEN, CELLSIZE ]
Hf = tf.reshape(Hr, [-1, CELLSIZE]) [ BATCHSIZE x SEQLEN, CELLSIZE ]
Ylogits = tf.layers.dense(Hf, ALPHASIZE) [ BATCHSIZE x SEQLEN, ALPHASIZE ]
Y = tf.nn.softmax(Ylogits) [ BATCHSIZE x SEQLEN, ALPHASIZE ]
X0 X1 X2 X3 X4 X6 X7 X8

0 Tip: handle sequence


H0 H1 H2 H3 H5 H6 H7 H8
0 H0
and batch elements
H’H00 H’H10 H’H20 H’H30 H’H50 H’H60 H’H70 H’8
0
the same
H’0 H’0
H” H’0
H” H’0
H” H’0
H” H’0
H” H’0
H” H’0
H” H”8
0 1 2 3 5 6 7

H”0 H”1 H”2 H”3 H”5 H”6 H”7 H”8


Y0 Y1 Y2 Y3 Y4 Y5 Y6 Y7 ALPHASIZE = 98
CELLSIZE = 512
loss = tf.nn.softmax_cross_entropy_with_logits(Ylogits, Y_) NLAYERS = 3
SEQLEN = 30

O’REILLY TensorfFlow World @martin_gorner


Inputs and outputs

S t _ A n d r e [ BATCHSIZE, SEQLEN ]
X0 X1 X2 X3 X4 X6 X7 X8
[ BATCHSIZE, SEQLEN, ALPHASIZE ]
0
H0 H1 H2 H3 H5 H6 H7 H8
0 H0 H’H00 H’H10 H’H20 H’H30 H’H50 H’H60 H’H70 H’8 H: [ BATCHSIZE,
0 H’0 H’0 H’0 H’0 H’0 H’0 H’0 H’0 CELLSIZE x NLAYERS ]
H” 0 H” 1 H” 2 H” 3 H” 5 H” 6 H” 7 H”8

Y0 Y1 Y2 Y3 Y4 Y5 Y6 Y7

t _ A n d r e w ALPHASIZE
CELLSIZE
= 98
= 512
NLAYERS = 3
SEQLEN = 30

O’REILLY TensorfFlow World @martin_gorner


Placeholders, and the rest...
Xd = tf.placeholder(tf.uint8, [None, None]) [ BATCHSIZE, SEQLEN ]

X = tf.one_hot(X, ALPHASIZE, 1.0, 0.0) [ BATCHSIZE, SEQLEN, ALPHASIZE ]

Yd_ = tf.placeholder(tf.uint8, [None, None]) [ BATCHSIZE, SEQLEN ]

Y_ = tf.one_hot(Y_, ALPHASIZE, 1.0, 0.0) [ BATCHSIZE, SEQLEN, ALPHASIZE ]

Hin = tf.placeholder(tf.float32, [None, CELLSIZE*NLAYERS])


[ BATCHSIZE, CELLSIZE x NLAYERS ]
# Y, loss, Hout = my_model(X, Y_, Hin) Y: [ BATCHSIZE x SEQLEN, ALPHASIZE ]

predictions = tf.argmax(Y, 1) [ BATCHSIZE x SEQLEN ]

predictions = tf.reshape(predictions, [batchsize, -1]) ALPHASIZE = 98


[ BATCHSIZE, SEQLEN ] CELLSIZE = 512
NLAYERS = 3
train_step = tf.train.AdamOptimizer(1e-3).minimize(loss) SEQLEN = 30

O’REILLY TensorfFlow World @martin_gorner


Bitchin’ batchin’
Batch 1 Batch 2 Batch 3
start The quic k brown fox jump
++
later seventh heaven o f typogr
++++
later Mr. Herm ann Zapf was the

Ht-1 Ht Ht+1 Ht+


2

for x, y_ in utils.rnn_minibatch_sequencer(codetext, BATCHSIZE, SEQLEN,


nb_epochs=10):
O’REILLY TensorfFlow World @martin_gorner
Language model in Tensorflow
Xd = tf.placeholder(tf.uint8, [None, None])
X = tf.one_hot(Xd, ALPHASIZE, 1.0, 0.0) # loss and training step (optimizer)
Yd_ = tf.placeholder(tf.uint8, [None, None]) loss = tf.nn.softmax_cross_entropy_with_logits(Ylogits, Y_)
Y_ = tf.one_hot(Yd_, ALPHASIZE, 1.0, 0.0) train_step = tf.train.AdamOptimizer(1e-3).minimize(loss)
Hin = tf.placeholder(tf.float32, [None,
CELLSIZE*NLAYERS])
# the model
# training loop
cell = [tf.nn.rnn_cell.GRUCell(CELLSIZE)
for epoch in range(20):
for i in range(NLAYERS)]
mcell = tf.nn.rnn_cell. inH = np.zeros([BATCHSIZE, INTERNALSIZE*NLAYERS])
for x, y_ in utils.rnn_minibatch_sequencer(codetext,
MultiRNNCell([cell]*NLAYERS,state_is_tuple=False) BATCHSIZE, SEQLEN, nb_epochs=30):
Hr,H = tf.nn. dic = {X: x, Y_: y_, Hin:inH}
dynamic_rnn(mcell, X, _,y,outH = sess.run([train_step,Yp,H,], feed_dict=dic)
initial_state=Hin)
inH = outH

# softmax output layer


Hf = tf.reshape(Hr, [-1, CELLSIZE]) ALPHASIZE = 98
The code is on GitHub:
Ylogits = layers.linear(Hf, ALPHASIZE) github.com/martin-gorner/ CELLSIZE = 512
Y = tf.nn.softmax(Ylogits) tensorflow-rnn-shakespeare NLAYERS = 3
SEQLEN = 30
Yp = tf.argmax(Y, 1)
Yp = tf.reshape(Yp, [batchsize, -1])
O’REILLY TensorfFlow World @martin_gorner
Shakespeare
0.03 ee o no nonnaoter s ee seih iae r t i r io i ro s
epochs sierota tsohoreroneo rsa esia anehereeo hensh
rho etnrhhs iti saoitns t et rsearh tshseoeh ta
oirhroren e eaetetnesnareeeoaraihss nshtano eter
e oooaoaeee nonn is heh easren ieson httn
nihensont t e n a ooe oerhi neaeehteriseat tiet i i
ntsh
orhi e ohhsiea e aht ohr er ra eeo oeeitrot
hethisesaaei o saeii straieiteoeresorh e ooeri
e ninesh sort a es h rs hattnteseato sonoanr
sniaase s rshninsasi na sntennn oti r etnsnrse oh n C1

r e tiathhnaeeano
O’REILLY TensorfFlow World trrr hhohooon rrt eernre e rnoh
@martin_gorner
Shakespeare
0.1 II WERENI
epochs Are I I wos the wheer boaer.
Tin thim mh cals sate bauut site tar oue tinl an
bsisonetoal yer an fimireeren.

L[IO SI Hns oret bsllssts aaau ton hete me toer


frurtor sheus aed trat

A faler bis tote oadt tou than male, tel mou ce


an cime. ais fauto ws cien whus yas. Ande fert te a
ut wond aal sinr be at saar C3

O’REILLY TensorfFlow World @martin_gorner


Shakespeare
0.2 BERENS Hall hat in she the hir meres.
epochs
Perstr in ame not of heard, me thin hild of shear and
ant on of mare. I lore wes lour.

DOCHES The chaster'd on not fenst


The laldoos more.

[Ixeln thrish] Stage directions ?

And tho priines sith of hamdeling the san wind C5

O’REILLY TensorfFlow World @martin_gorner


Shakespeare
1 KING LEAR Alas, I am not forsworn both to bod!
epoch And let the firm I have to'st trainoured.

Invented KING HENRY VIII I love not my father.


names !

PORDIA He tash you will have it.

HENRY BLUTIUS Work, thou lovest my son here,


thy father's fath!

CLIOND Why, then, would say, the beasts are C6

O’REILLY TensorfFlow World @martin_gorner


Shakespeare
30 TITUS ANDRONICUS
epochs
ACT I

SCENE III An ante-chamber. The COUNT's


palace.

[Enter CLEOMENES, with the Lord SAY]

Chamberlain Let me see your worshing in my hands.


B10

LUCETTA
O’REILLY TensorfFlow World I am a sign of me, and sorrow sounds
@martin_gorner
Shakespeare
30 And sorrow far into the stars of men,
epochs Without a second tears to seek the best and
bed,
With a strange service, and the foul prince of
Rome

[Exeunt MARK ANTONY and LEPIDUS]

Well said, my lord,--

MENENIUS I do not say so. B10

O’REILLY TensorfFlow World Well, I will@martin_gorner


not have no better ways;
Python code
0.03 diassts_= =tlns==eti.s=tessn_((
epochs sie_s_nts_ens= dondtnenroe dnar taonte
srst anttntoilonttiteaen

detrtstinsenoaolsesnesoairt(

arssserleeeerltrdlesssoeeslslrlslie(e
drnnaleeretteaelreesioe niennoarens
dssnstssaorns sreeoeslrteasntotnnai(ar
dsopelntederlalesdanserl
lts(sitae(e) A1

O’REILLY TensorfFlow World @martin_gorner


Python code
0.1 with
epochs self.essors_sigeater(output_dits_allss,
self._train.
Python for sampated to than ubtexsormations.
keywords

expeddions = np.randim(natched_collection,
ranger, mang_ops, samplering)

def assestErrorume_gens(assignex) as
and(sampled_veases):
eved. A2

O’REILLY TensorfFlow World @martin_gorner


Python code
def testGiddenSelfBeShareMecress(self):
0.4 with self.test_session() as sess:
tat = tf.contrib.matrix.cast_column_variable([1, 1], [0, 1, 1], [1, 7]],
epochs [[1, 1, 1]].file(file, line_state_will_file))
with self.test_session():
self.assertAllEqual(1, l.ex6)
self.assertEqual(output_graph_def is_output_tensors_op(
tf.pro_context_name.sqrt(sess) Wrong
Correct ([])
def test_shape(self):
use of res = values=value_rns[0].eval()) nesting
colons:
def tempDimpleSeriesGredicsIothasedWouthAverageData(self):
self._testDirector(self):
self._test_inv3_size = 5
with tf.train.ConvolutioBailLors_startswith("save_dir_context.PutIsprint().eval())
Hallucinated return tf.contrib.learn.RUCISLCCS:
function # Check the orfloating so that the nimesting object mumputable othersifier.
names # dense_keys.tokens_prefix/statch_size of the input1 tensors.
A3
@property

O’REILLY TensorfFlow World @martin_gorner


Python code
# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
12 #
# Licensed under the Apache License, Version 2.0 (the "License");
epochs # you may not use this file except in compliance with the License. Recites
# You may obtain a copy of the License at Apache
#
# http://www.apache.org/licenses/LICENSE-2.0 license
#
# Unless required by applicable law or agreed to in [0.1, 2.0, 3.0]]

def __init__(self, expected): Correct triple ([]) nesting


Tensorflow return np.array([[0, 0, 0], [0, 0, 0]])
tips! self.assertAllEqual(tf.placeholder(tf.float32, shape=(3, 3)),(shape, prior.pack(),
tf.float32))
for keys in tensor_list:
return np.array([[0, 0, 0]]).astype(np.float32)

# Check that we have both scalar tensor for being invalid to a vector of 1 indicating
# the total loss of the same shape as the shape of the tensor.
sharded_weights = [[0.0, 1.0]]
# Create the string op to apply gradient terms that also batch.
B10
# The original any operation as a code when we should alw infer to the session case.

O’REILLY TensorfFlow World @martin_gorner


...and more
Credit to Andrej
Karpathy’s blog:

The Unreasonable
Effectiveness of
Recurrent Neural
Networks

O’REILLY TensorfFlow World @martin_gorner


Tensorflow: save, restore
=> Save variables in file_200, the graph in file_200.meta
saver = tf.train.Saver(keep_checkpoint_every_n_hours=0.1, max_to_keep=5)
with tf.Session() as sess:
# ... training loop ...
saver.save(sess, 'file_' , global_step=iter)

=> Restore graph and variable values


with tf.Session() as sess:
resto = tf.train.import_meta_graph('file_200.meta')
resto.restore(sess, 'file_200')

Must name variables explicitly !!!


# when saving # when using restored graph
X = tf.placeholder(tf.uint8, name='X') y,h = sess.run(['Y:0', 'H:0'],
Y = tf.nn.softmax(Ylogits, name='Y') feed_dict={'X:0': y} )

O’REILLY TensorfFlow World @martin_gorner


Shakespeare generation
with tf.Session() as sess: One char
resto = tf.train.import_meta_graph('shake_200.meta') at a time
resto.restore(sess, 'shake_200')

# initial values X
x = np.array([[0]]) # [BATCHSIZE, SEQLEN] with BATCHSIZE=1 and SEQLEN=1
h = np.zeros([1, INTERNALSIZE * NLAYERS], dtype=np.float32)
Ht-1 Ht
for i in range(100000):
H’0
dic = {'X:0': x, 'Hin:0': h, 'batchsize:0':1}
y,h = sess.run(['Y:0', 'H:0'], feed_dict=dic)
Y
c = my_txtutils.sample_from_probabilities(y, topn=5)
x = np.array([[c]]) # shape [BATCHSIZE, SEQLEN] with BATCHSIZE=1 and SEQLEN=1

print(chr(my_txtutils.convert_to_ascii(c)), end="")

O’REILLY TensorfFlow World @martin_gorner


Tensorboard

Tip: use time in


logdir name
summary_writer = tf.train.SummaryWriter("log/train_" + time)
loss_summary = tf.scalar_summary("batch_loss", loss)
# in training loop: Tip: use a second
smm = sess.run(summaries, feed_dict=dic) SummaryWriter for
summary_writer.add_summary(smm, iteration) validation results

O’REILLY TensorfFlow World @martin_gorner


RNN shapes
Characters,
one-hot encoded

S t _ J o h
0 H5
character-
based

t _ J o h n

O’REILLY TensorfFlow World @martin_gorner


RNN shapes
Words encoded as embeddings = tf.Variable(tf.random_uniform([vocab_size, embed_size]))
vectors: “embeddings” X = tf.nn.embedding_lookup(embeddings, train_inputs)
Or constant => see Word2Vec
US Chin hav agree
The and
A a e d
0
Text
classification

geopolitics

Tensorflow sample: goo.gl/m41mNp

O’REILLY TensorfFlow World @martin_gorner


Bitchin’ batchin’
seq
len
China and the USA have agreed to a new round of talks 12
The quick brown fox jumps over the lazy dog . ∅ ∅ 10
Boys will be boys . ∅ ∅ ∅ ∅ ∅ ∅ ∅ 5
Tom , get your coat . We are going out . ∅ 11
Math rules the world . Men rule math . ∅ ∅ ∅ 9

0 Hn

geopolitics
Hr, H =
tf.nn.dynamic_rnn(mcell, X, initial_state=Hin, sequence_lenght=slen)

O’REILLY TensorfFlow World @martin_gorner


RNN shapes
Words encoded
as vectors Text
translation
Th mous
red cat ate the ∅ Le chat rouge a mangé la souris
e e
0

slow
ch roug mang
Le a la souris ∅
at e é
fast tf.nn.sampled_softmax_loss(…)
Tensorflow sample: goo.gl/KyKLDv

O’REILLY TensorfFlow World @martin_gorner


RNN shapes
Images encoded
as vectors
Image captioning
(simplified)
∅ A man on a beachflying a kite

ma
A on abeachflying a kite ∅
n
For ex. output
of convolutional network or auto-encoder
Google’s neural net for image captioning: goo.gl/VgZUQZ
O’REILLY TensorfFlow World @martin_gorner
Image captioning

A herd of elephants walking A person riding a motorcycle on


across a dry grass field. a dirt road.

Google’s neural net for image captioning: goo.gl/VgZUQZ


O’REILLY TensorfFlow World @martin_gorner
Image captioning

A yellow school bus parked in a A refrigerator filled with lots of


parking lot. food and drinks.

Google’s neural net for image captioning: goo.gl/VgZUQZ


O’REILLY TensorfFlow World @martin_gorner
Cloud Machine Learning Engine

O’REILLY TensorfFlow World @martin_gorner


Data-parallel distributed training
asynchronous
parameter servers
updates
W’ = W + ∆W

I ♡ noise

model
replicas

data

O’REILLY TensorfFlow World @martin_gorner


TF high level API

from tensorflow.contrib import learn


“features” and “targets
def model_fn(X, Y_, mode):
Yn = … # model layers

predictions = {"probabilities": …, "digits": …} #free-form


evaluations = {'accuracy': metrics.accuracy(…)} #free-form
loss = …
train = layers.optimize_loss(loss, …)

return learn.ModelFnOps(mode, predictions,loss,train,evaluations)

Samples: goo.gl/F3i3bf, goo.gl/CofxFM


O’REILLY TensorfFlow World @martin_gorner
Estimator, Experiment, learn_runner
from tensorflow.contrib.learn.python.learn.utils import saved_model_export_utils

def experiment_fn(job_dir):
return learn.Experiment(
estimator=learn.Estimator(model_fn, model_dir=job_dir,
config=learn.RunConfig(save_checkpoints_secs=None,
save_checkpoints_steps=1000)),
train_input_fn=…, # data feed
trainingInput: eval_input_fn=…, # data feed
scaleTier: STANDARD_1
train_steps=10000,
eval_steps=1,
Free stuff !!! export_strategies=make_export_strategy(export_input_fn=
Tensorboard graphs serving_input_fn))

Resume on fail def main(argv=None):


Parallel data feeds job_dir = # parse argument --job-dir
Serving model export learn_runner.run(experiment_fn, job_dir)
Distributed training if __name__ == '__main__': main()

Samples: goo.gl/F3i3bf, goo.gl/CofxFM


O’REILLY TensorfFlow World @martin_gorner
Data queues for distributed training

# dummy implementation for data that fits in memory


def train_data_input_fn(mnist):
images = tf.constant(mnist.train.images)
batch size
labels = tf.constant(mnist.train.labels)
return tf.train.shuffle_batch([images, labels], 100,
1100, 1000, enqueue_many=True)
trainingInput: Inserts queue nodes
scaleTier: STANDARD_1
Into TF graph

# dummy implementation for data that fits in memory


def eval_data_input_fn(mnist):
return tf.constant(mnist.test.images),
tf.constant(mnist.test.labels)

For practical data


queuing use the
TF Records format

Samples: goo.gl/F3i3bf, goo.gl/CofxFM


O’REILLY TensorfFlow World @martin_gorner
Serving input function
Batch of images
For MNIST
# Online predictions on Cloud ML Engine
def serving_input_fn():

trainingInput: # Placeholder for data deserialised from JSON


scaleTier: STANDARD_1 inputs = {'A': tf.placeholder(tf.uint8, [None, 28, 28])}

# Transform the data as needed


features = [tf.cast(inputs['A'], tf.float32)]

return input_fn_utils.InputFnOps(features, None, inputs)

Samples: goo.gl/F3i3bf, goo.gl/CofxFM


O’REILLY TensorfFlow World @martin_gorner
Run it
gcloud ml-engine jobs submit training job22
--job-dir=gs://mybucket/job22
--package-path=trainer tensorboard
--module-name=trainer.task summaries
--config=config.yaml model checkpoints
trainingInput:
--
scaleTier: STANDARD_1 --<custom model arguments here>

Deploy trained model to prod = click click click


autoscaled gcloud ml-engine predict
serving --model <model_name>
--json-instances mydigits.json

Samples: goo.gl/F3i3bf, goo.gl/CofxFM


O’REILLY TensorfFlow World @martin_gorner
Demo: aucnet

Retrain Inception yourself: goo.gl/Z9eNek


O’REILLY TensorfFlow World @martin_gorner
Have fun !
Martin Görner Cloud ML Engine
Google Developer relations your TensorFlow models
@martin_gorner trained in Google’s cloud.
Cloud Auto ML VisionALPHA
Just bring your data

Cloud TPU BETA


ML supercomputing

Pre-trained models:
Cloud Vision API
Cloud Speech API
Videos, slides, code:
Natural Language API

github.com/ Google Translate API


GoogleCloudPlatform/ Video Intelligence API That’s all
tensorflow-without-a-phd Cloud Jobs API PRIVATE BETA
folks...
O’REILLY TensorfFlow World @martin_gorner
Tensorflow and
deep learning
without a PhD

1
neurons
O’REILLY TensorfFlow World @martin_gorner
>TensorFlow, deep learning and \
modern convolutional neural nets
without a PhD_

#Tensorflow #GoogleCloud @martin_gorner


Kaggle

O’Reilly AI @martin_gorner
Fully-connected layers
20x20x3 batch size
1200
X = tf.reshape(images, [-1, 20*20*3])

Y1 = tf.layers.dense(X, 200,
200 activation=tf.nn.relu)
Y2 = tf.layers.dense(Y1, 20,
20 activation=tf.nn.relu)
Ylogits = tf.layers.dense(Y2, 2)
2
plane: [1,0] correct
not plane: [0,1] answer
loss = tf.losses.softmax_cross_entropy(tf.one_hot(is_plane,2), Ylogits)

train_op = tf.train.AdamOptimizer(0.001).minimize(loss)

learning rate
O’Reilly AI @martin_gorner
Activation functions
softmax

inputs Classification
head

activation bias
weights
relu 1

Hidden layers weighted


sum+b

-1 1 norm

O’Reilly AI @martin_gorner
Cookbook

Relu, softmax
Cross-entropy
Convolutional layer
+padding weights
4
4
3

W[4, 4, 3, 4]
filter input nb of
size channels filters
=
output
channels

O’Reilly AI @martin_gorner
Convolutional networks

W1[3, 3, 4, 6] Y1 = tf.layers.conv2d(Y0, filters=6,

4 kernel_size=3, strides=1, padding="same",


activation=tf.nn.relu)

W2[2, 2, 6, 10] Y2 = tf.layers.conv2d(Y1, filters=10,

6 kernel_size=2, strides=2, padding="same",


activation=tf.nn.relu)

W2[1, 1, 10, ...] Y1 = tf.layers.conv2d(Y2, filters=…,


kernel_size=1, strides=1, padding="same",
stride 2 activation=tf.nn.relu)

10
one x one
convolution ? # can also use pooling to reduce x,y size
tf.layers.max_pooling2d(pool_size=2, strides=2)

O’Reilly AI @martin_gorner
Tensorflow - the model
input image batch Filter stride
X[100, 20, 20, 3] size
20x20 x 3

Y1 = tf.layers.conv2d(X, filters=8, kernel_size=4, strides=1,


20x20 x 8 padding="same", activation=tf.nn.relu)
Y2 = tf.layers.conv2d(Y1, filters=16, kernel_size=3, strides=2,
padding="same", activation=tf.nn.relu)
10x10 x 16 Y3 = tf.layers.conv2d(Y2, filters=32, kernel_size=2, strides=2,
padding="same", activation=tf.nn.relu)

YY = tf.reshape(Y3, shape=[-1, 5*5*32]) flatten all values for


fully connected layer
5x5 x 32 Y4 = tf.layers.dense(YY, 100, activation=tf.nn.relu)
Yl = tf.layers.dense(Y4, 2, activation=None)
Y = tf.nn.softmax(Yl)
5x5x32=800
2 loss = tf.losses.softmax_cross_entropy(targets, Yl)

O’Reilly AI @martin_gorner
Dropout, batch norm, learning rate decay

# layers…
tf.train. tf.layers.dropout
exponential_decay # layers…
# layers…

tf.layers.
batch_normalization
# acivation
# layers…

O’Reilly AI @martin_gorner
Cloud Machine Learning Engine

TensorBoard

AI
Platform
O’Reilly AI @martin_gorner
Hyperparameter tuning

Grid search Random search Bayesian optimisation


useless parameter

useless parameter

Important parameter Important parameter

Google blog post on


Bayesian optimization ML Engine
O’Reilly AI @martin_gorner
Hyperparam tuning on ML Engine
gcloud ml-engine jobs submit training plane09 \
--job-dir gs://ml1-demo-martin/jobs/plane09/ \
--config config-hptune.yaml \
# file: config.yaml --project cloudml-demo-martin \
--region us-central1 \
trainingInput: --module-name trainer.train \
scaleTier: BASIC_GPU --package-path trainer
hyperparameters:
goal: MAXIMIZE
hyperparameterMetricTag: accuracy
maxTrials: 50
“metric” defined in
maxParallelTrials: 10 your Tensorflow model
params:
- parameterName: hp-lr2
type: INTEGER Command line params
minValue: 800 of your Python module
maxValue: 30000
scaleType: UNIT_LINEAR_SCALE
- parameterName: hp-filter-sizes
type: CATEGORICAL
categoricalValues: ['S', 'M', 'L']
ML Engine
O’Reilly AI @martin_gorner
Making progress
Dataset: 32K tiles, 15 epochs
Conv: 5x5 x 8
Conv: 4x4 x 16 test acc.
Conv: 3x3 x 32 94%
Dense: 100 neurons
Learning rate: 0.001 #01
Dataset: 32K tiles, 15 epochs
Conv: 5x5 x 8
Conv: 4x4 x 16 test acc.
Conv: 3x3 x 32 98%
Dense: 100 neurons
Learning rate: 0.01-> 0.002
Batch normalisation + dropout #03
Dataset: 32K tiles, 15 epochs
Conv: 4x4 x 16
Conv: 3x3 x 32 test acc.
Conv: 2x2 x 64 99%
Dense: 43 neurons
Learning rate: 0.01-> 0.0001
Batch normalisation + dropout #08
Dataset: 465K tiles, 21 epochs
Conv: 4x4 x 16
Conv: 3x3 x 32 test acc.
Conv: 2x2 x 64 99.6%
Dense: 80 neurons
Learning rate: 0.01-> 0.0001
Aerial imagery: U.S. Geological Survey Batch normalisation + dropout #64
O’Reilly AI @martin_gorner
Estimator
Free stuff !!!
Tensorboard graphs
estimator =
Resume on fail
tf.estimator.Estimator(model_fn=model_fn) Parallel data feeds
Serving model export
Distributed training
train_spec =
tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=1000)

export_latest =
tf.estimator.LatestExporter(serving_input_receiver_fn=serving_input_fn)
eval_spec =
tf.estimator.EvalSpec(input_fn=eval_input_fn, exporters=export_latest)

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)


O’Reilly AI @martin_gorner
Estimator model function

def model_fn(feature_dic, labels, mode):

Yn = … # layers TRAIN, EVAL


# layers or INFER
# layers
# layers …
predictions = {"is_plane": ...} #free-form
loss = tf.losses.softmax_cross_entropy(…)
train_op = tf.training.create_train_op(loss, tf.train.AdamOptimizer)
evals = {'accuracy': tf.metrics.accuracy(...)} #free-form

return tf.estimator.EstimatorSpec(mode, predictions, loss, train_op, evals)

O’Reilly AI @martin_gorner
Dataset API
def train_input_fn(dataset):
filenames = gcsfile.get_matching_files(directory + "/*")
dataset = return tf.contrib.data.Dataset.from_tensor_slices((filenames,))

def load(filename):
bytes = tf.read_file(filename)
# ... decode images and corresponding labels from file ...
return tf.contrib.data.Dataset.from_tensor_slices((images, labels))

dataset = dataset.flat_map(load)
dataset = dataset.shuffle(2000)
dataset = dataset.batch(100)
dataset = dataset.repeat() # indefinitely

return dataset

O’Reilly AI @martin_gorner
Estimator serving input function
def serving_input_fn(): # expected input is: list of rgb 20x20 images
input = {'images': tf.placeholder(tf.uint8, shape=[None, 20,20,3] )}
feature_dic = {'images': input['images']}
return tf.estimator.export.ServingInputReceiver(feature_dic, input)
pass-through

def serving_input_fn(): # expected input: list of 256x256 jpegs complex


inputs = {'jpeg_images': tf.placeholder(tf.string)}
jpegs = inputs['jpeg_images']

boxes256x256 = my_genboxes(256, 20, tile_step = 5, zoom_step = 1.3) # ~5000 tiles


box_indices = tf.constant(np.zeros(len(boxes256x256))

def jpeg_to_bytes(jpeg):
pixels = tf.image.decode_jpeg(jpeg, channels=3)
pixels = tf.image.crop_and_resize(tf.expand_dims(pixels,0), boxes256x256, box_indices, [20, 20])
return tf.cast(pixels, dtype=tf.uint8)

images = tf.map_fn(jpeg_to_bytes, jpegs, dtype=tf.uint8)


feature_dic = {'image': images}
return tf.estimator.export.ServingInputReceiver(feature_dic, inputs)

O’Reilly AI @martin_gorner
ConvNet architectures and detection papers
Aerial imagery: U.S. Geological Survey
Inception

max 1x1 1x1


pool conv conv
1x1
conv
1x1 3x3 3x3
conv conv conv

3x3
conv

arXiv: 1409.4842, Szegedy & al 2014


arXiv: 1512.00567, Szegedy & al 2015
O’Reilly AI @martin_gorner
Filter factorisation
5
5

3x3 5x5 filter W[5,5,1,1] = 25 weights


30%
cheaper
3x3 filter W1[3,3,1,1]
3x3 filter W2[3,3,1,1] = 9+9=18 weights
3x3

O’Reilly AI @martin_gorner
1x1 convolution ?

10
W [1, 1, 10, 5]

5 cheap

O’Reilly AI @martin_gorner
Last layer: dense layer vs. global avg. pool
7
7 global average pooling

flatten
245

W [245, 5]
softmax
softmax

1225
cheaper 0 Yay
weights weights cheapskate!
O’Reilly AI @martin_gorner
Squeezenet

“fire”

squeeze 1x1
conv
“fire”
“fire”
maxpool
“fire”

expand 1x1
conv
3x3
conv
“fire”
maxpool
“fire” like

“fire”

“fire module”
arXiv: 1602.07360, Forrest Iandola & al 2016
O’Reilly AI @martin_gorner
Darknet vs. squeezenet
256x256 x 3
Darknet-like 256x256 x 64 256x256 x 3 Squeezenet-like
256x256 x 50 256x256 x 32
3x3 x 64 3x3 x 32
1x1 x 50 3x3 x 16 1x1 x 16
128x128 x 30
maxpool 128x128 x 52 128x128 x 80 maxpool
3x3 x 52 128x128 x 54 1x1 x 30
1x1 x 54 3x3 x 40 1x1 x 40
maxpool maxpool
64x64 x 56 64x64 x 50
3x3 x 56 1x1 x 50
64x64 x 58 64x64 x 128
1x1 x 58 3x3 x 64 1x1 x 64
maxpool maxpool
3x3 x 60 32x32 x 60
32x32 x 50 1x1 x 50
1x1 x 62 32x32 x 62
32x32 x 80 3x3 x 40 1x1 x 40
maxpool maxpool
16x16 x 62
3x3 x 64 1x1 x 30
16x16 x 64 16x16 x 30
1x1 x 65 16x16 x 35 3x3 x 17 1x1 x 18
YOLO head 16x16 x 10
YOLO head
16x16 x 10
12 layers 136K weights 12 layers, 60K weights
O’Reilly AI @martin_gorner
YOLO
4x4 grid, 2 boxes per cell
loss =

256

x
position [-1,1]
y
w size [0,1]
Aerial imagery: U.S. Geological Survey
C confidence [0,1]
256
arXiv: 1602.07360, Redmon & al 2015
O’Reilly AI @martin_gorner
YOLO last layer
1
tanh

-2 -1 1 2

NxN grid
-1 avg tanh x
avg tanh y
avg sigmoid w
Split
in 4 avg sigmoid C
(or 8, 12, …)
sigmoid 1

-4 -2 0 2 4

O’Reilly AI @martin_gorner
Making progress
Intersect. Over Union, eval dataset (higher=better) 17 layers best
+loss weights useful in composite loss
+random hue+rot. data augmentation !
16x16x2 with swarm-optimized box assign.
8x8x1 more boxes
+shuffle shuffle your data!
12 layers 4x4x1 not enough boxes

YOLO grid 4x4x1 YOLO grid 8x8x1 YOLO grid 16x16x1 YOLO grid 16x16x2

Aerial imagery: U.S. Geological Survey


O’Reilly AI @martin_gorner
Making progress
Dataset: 32K tiles, 15 epochs
3 conv layers, 1 dense
Test accuracy 94% #01
previous +
Batch norm, lr decay, dropout
Test accuracy 99% #08
previous +
Dataset: 465K tiles, 21 epochs
Test accuracy 99.6% #64

YOLO detector 8x8x1


Squeezenet CNN, 16 layers #171

YOLO detector 16x16x1


Squeezenet CNN, 16 layers #201
YOLO detector 16x16x2
Squeezenet CNN, 16 layers
Optimized for swarm detection #222

Yay
Aerial imagery: U.S. Geological Survey cherrypicker !
O’Reilly AI @martin_gorner
Now with Cloud TPUs
(Tensor Processing Units)

O’Reilly AI @martin_gorner
Training hardware options on ML Engine
Training time Cost

GPU - P100 5h50 $15

# config.yaml
trainingInput:
scaleTier: CUSTOM
masterType: standard_p100

# Tensorflow code
tf.Estimator

O’Reilly AI @martin_gorner
Training hardware options on ML Engine
Training time Cost

GPU - P100 5h50 $15

GPU - V100 4h30 $18

# config.yaml
trainingInput:
scaleTier: CUSTOM
masterType: standard_v100

# Tensorflow code
tf.Estimator

O’Reilly AI @martin_gorner
Training hardware options on ML Engine
Training time Cost

GPU - P100 5h50 $15

GPU - V100 4h30 $18

Cluster
1h35 $16
4 GPUs - P100

# config.yaml
trainingInput:
scaleTier: CUSTOM
masterType: standard_p100
parameterServerType: standard
workerType: standard_p100
# Tensorflow code
paramServerCount: 1
tf.Estimator
workerCount: 4

O’Reilly AI @martin_gorner
Training hardware options on ML Engine
Training time Cost

GPU - P100 5h50 $15

GPU - V100 4h30 $18

Cluster
1h35 $16
4 GPUs - P100

Cluster
1h15 $18
4 GPUs - v100

# config.yaml
trainingInput:
scaleTier: CUSTOM
masterType: standard_v100
parameterServerType: standard
workerType: standard_v100
# Tensorflow code
paramServerCount: 1
tf.Estimator
workerCount: 4

O’Reilly AI @martin_gorner
Training hardware options on ML Engine
Training time Cost

GPU - P100 5h50 $15

GPU - V100 4h30 $18

Cluster
1h35 $16
4 GPUs - P100

Cluster
1h15 $18
4 GPUs - v100

VM GPUx4 P100
MirroredStrategyALPH 2h15 $21
A

# config.yaml
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m_p100 # Tensorflow code
tf.Estimator +
MirroredStrategy
O’Reilly AI @martin_gorner
Training hardware options on ML Engine
Training time Cost
Cloud TPUs v2
GPU - P100 5h50 $15 (Tensor Processing Units)
now available
GPU - V100 4h30 $18

Cluster
1h35 $16
4 GPUs - P100

Cluster
1h15 $18
4 GPUs - v100

VM GPUx4 P100
MirroredStrategyALPH 2h15 $21
A

TPUv2 1h00 $7
4 chips - 8 cores
# config.yaml
# Tensorflow code
trainingInput:
tf.TPUEstimator
scaleTier: BASIC_TPU

O’Reilly AI @martin_gorner
Training hardware options on ML Engine
Training time Cost
Cloud TPUs V3ALPHA
GPU - P100 5h50 $15

GPU - V100 4h30 $18

Cluster
1h35 $16
4 GPUs - P100

Cluster
1h15 $18
4 GPUs - v100

VM GPUx4 P100
MirroredStrategyALPH 2h15 $21
A

TPUv2 1h00 $7

TPU v3 ALPHA available soon


# Tensorflow code
tf.TPUEstimator

O’Reilly AI @martin_gorner
Training hardware options on ML Engine
Training time Cost

GPU - P100 5h50 $15

GPU - V100 4h30 $18

Cluster
1h35 $16
4 GPUs - P100

Cluster
1h15 $18
4 GPUs - v100

VM GPUx4 P100
MirroredStrategyALPH 2h15 $21
A

TPUv2 1h00 $7

TPU v3 ALPHA available soon TPUs pods: try them now (alpha)
up to 512 cores
TPU podALPHA available soon
O’Reilly AI @martin_gorner
Have fun !
Martin Görner Cloud ML Engine
Google Developer relations your TensorFlow models
@martin_gorner trained in Google’s cloud.
Cloud Auto ML Vision
Just bring your data

Cloud TPU
ML supercomputing

Pre-trained models:
Cloud Vision API
Cloud Speech API
Videos, slides, code:
Natural Language API
github.com/ Google Translate API
GoogleCloudPlatform/ Video Intelligence API That’s all
tensorflow-without-a-phd folks...
Cloud Jobs API BETA
O’Reilly AI @martin_gorner
Tensorflow and
deep learning
without a PhD

1
neurons
O’Reilly AI @martin_gorner
end

O’Reilly AI @martin_gorner
+padding
depth-
weights
4

separable
4
W1[4, 4, 3]
3

filter convolutions
Phase 1: size nb of
convolutions filters
=
weights
5 input
3 channels
3

Phase 2:
dot products W2[3, 5]

output
channels
Postcard from...
Generative Adversarial Network (GAN)

noise

fake
/ real

O’Reilly AI @martin_gorner
GANs

O’Reilly AI @martin_gorner
GANs

arXiv: 1511.06434, Radford & al 2016


O’Reilly AI @martin_gorner
GANs

O’Reilly AI @martin_gorner
Nvidia
Research,
Karras &
al. 2017
150,000h
learning Tensorflow

“Tensorflow and
deep learning”

900K 1
views neurons “Without a PhD”
on YouTube
O’Reilly AI @martin_gorner
Learn to slap layers’n’shit together.
[...] that’s what everyone is
already doing in deep learning.

O’Reilly AI @martin_gorner
>TensorFlow, deep learning and \
modern RNN architectures
without a PhD_

EMBED ENCODE ATTEND PREDICT


Google translate 2017

Translate

Devoxx est une conférence indépendante Devoxx is an independent conference


organisée par des passionnés de organized by IT enthusiasts that takes
l'informatique qui a lieu à Kinepolis en place at Kinepolis in November and which
novembre et qui réunit plus de 3000 brings together more than 3000
développeurs à Anvers tous les ans depuis developers in Antwerp every year since
2001. Devoxx est un conférence sous la 2001. Devoxx is a conference under the
houlette du comité pour le code de qualité. leadership of the Quality Code
Committee.

@martin_gorner
RNN
N: internal size
Xt X: inputs
H

RNN cell tanh


H: internal
H softmax state
Yt Y: outputs

@martin_gorner
RNN training

X0 X1 X2 X3 X4 X5

H-1 H0 H1 H2 H3 H4 H5
cell cell cell cell cell cell

Y0 Y1 Y2 Y3 Y4 Y5

The same weights and biases shared across iterations


@martin_gorner
Deep RNN
L: number of layers

X0 X1 X2 X3 X4 X5

0 H5
cell cell cell cell cell cell

0 H’5
cell cell cell cell cell cell

Y0 Y1 Y2 Y3 Y3 Y5

@martin_gorner
RNN cell types

Simple RNN cell GRU cell LSTM cell


“Generalized Recurrent Unit” “Long Short Term Memory”

Xt Xt Xt

Ht-1 Ht Ht-1 σ σ 1-
× × Ht Ht-1 σ σ tanh σ Ht
tanh
× tanh × ×
+
Ct-1 × +
tanh
Ct
Ht Ht Ht
Yt Yt Yt

tf.nn.rnn_cell.BasicRNNCell(SIZE) tf.nn.rnn_cell.GRUCell(SIZE) tf.nn.rnn_cell.BasicLSTMCell(SIZE)


@martin_gorner
Language model in Tensorflow
Characters,
one-hot encoded

A l p h a b
0 H5
character-
based

l p h a b e

@martin_gorner
Toxic comment detection
Fuck off, you idiot. Thanks for your help editing this. You’re such
an asshole. But thanks anyway. I'm going to shoot you! Oh
shoot. Well alright. God damn it! First of all who the fuck died
and made you the god. Gosh darn it! Get the hell out of here
you jerk. You're not that smart are you? Fuck off, you idiot.
Thanks for your help editing this. You’re such an asshole. But
thanks anyway. I'm going to shoot you! Oh shoot. Well alright.
God damn it! First of all who the fuck died and made you the
god. Gosh darn it! Get the hell out of here you jerk. You're not
that smart are you? Fuck off, you idiot. Thanks for your help
the god.this.
editing Gosh darn such
You’re it! Getanthe hell outBut
asshole. of thanks
here you jerk. I'm
anyway.
You're
going tonot that you!
shoot smartOhare you?Well
shoot. Fuckalright.
off, you idiot.
God damnThanks
it! First of
for your help editing this.
all who the fuck died and made you You’re such an asshole. But
thanks Well
shoot. anyway. I'm God
alright. goingdamnto shoot you!ofOh
it! First all who the fuck died
and made you the god. Gosh darn it! Get the hell out of here
you you?
are jerk. You're
Fuck off,notyou
thatidiot.
smart
Thanks for your help editing this.
You’re such
an asshole. But thanks anyway. I'm going to shoot you! Oh shoot.
Well
alright. God damn it! First of all who the fuck died and made you
@martin_gorner
Modern RNN architectures

EMBED ENCODE ATTEND PREDICT


Word embeddings
one-hot encoding ?
~ 100,000 words

“Absorb” 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 … 0
=>
0.3 0.6 0.1 1.9 0.3 0.6 0.1 1.9 0.3 0.6 0.1 0.1 0.3 0.6 0.1 1.9 0.3 1.3 EMBED
1.2 2.1 1.9 1.5 1.2 2.1 1.9 1.5 1.2 2.1 1.9 1.3 1.2 2.1 1.9 1.5 1.2 0.5

1.6 2.2 2.1 0.3 1.6 2.2 2.1 0.3 1.6 2.2 2.1 2.1 1.6 2.2 2.1 0.3 1.6 0.6

embedding …
0.9 1.3 1.8 2.2 0.9 1.3 1.8 2.2 0.9 1.3 1.8 0.5 0.9 1.3 1.8 2.2 0.9 1.2

0.3 1.5 0.8 1.8 0.3 1.5 0.8 1.8 0.3 1.5 0.8 0.7 0.3 1.5 0.8 1.8 0.3 1.9

0.5 1.0 0.9 1.1 0.5 1.0 0.9 1.1 0.5 1.0 0.9 1.9 0.5 1.0 0.9 1.1 0.5 2.2

1.6 1.4 3.2 0.4 1.6 1.4 3.2 0.4 1.6 1.4 3.2 1.4 1.6 1.4 3.2 0.4 1.6 0.1

2.3 0.6 1.1 1.6 2.3 0.6 1.1 1.6 2.3 0.6 1.1 0.6 2.3 0.6 1.1 1.6 2.3 0.7

embeddings = tf.Variable(tf.random_uniform([vocab_size, embed_size]))


X = tf.nn.embedding_lookup(embeddings, sentences)

@martin_gorner
Word based language model EMBED

Words -> embeddings


FRANCE JESUS XBOX MEGABIT
Close in
ITALY GOD PS3 MB/S
AUSTRIA SATAN SEGA BAUD embedding
BELGIUM CHRIST AMIGA OCTETS
Rose violet GREECE VISHNU CAPCOM MHZ space
are red are
s s Yay!

0 H5

● Trained embeddings
violet ● Pre-trained embeddings (Word2Vec, GloVe, ...)
are red are blue ● Trained embeddings from pre-trained initial values
s

@martin_gorner
Classification with an RNN ENCODE

muc
I like you very but
h
0

Toxic / non
toxic

Tensorflow sample: goo.gl/m41mNp

@martin_gorner
Bidirectional RNN ENCODE

muc
I like you very but
h

0
Y1 concatenate
0
Y2 Y1|Y2

softmax

Toxic / non
toxic
@martin_gorner
Attention ATTEND

muc
I like you very but
h

α1H1 + α2H2 + α3H3 + α4H4 + α5H5 + α6H6


softmax

Toxic / non-toxic
α = softmax(α)
@martin_gorner
Toxicity detector
EMBED
muc
I like you very but
h ENCODE
0
ATTEND
bidirectional 0
PREDICT
concatenate
H1 H2 H3 H4 H5 H6

attention
α1 α2 α3 α4 α5 α6

Toxic / Not

@martin_gorner
Bitchin’ batchin’
seq
len
China and the USA have agreed to a new round of talks 12
The quick brown fox jumps over the lazy dog . ∅ ∅ 10
Boys will be boys . ∅ ∅ ∅ ∅ ∅ ∅ ∅ 5
Tom , get your coat . We are going out . ∅ 11
Math rules the world . Men rule math . ∅ ∅ ∅ 9

0 Hn

geopolitics
Hout, H =
tf.nn.dynamic_rnn(cell, X, initial_state=Hin, sequence_length=slen)

@martin_gorner
Toxicity detector
word_vectors EMBED
= tf.nn.embedding_lookup(embeddings, features[‘words’])

rnn_fw_cell = tf.contrib.rnn.GRUCell(RNN_CELL_SIZE) ENCODE


rnn_bw_cell = tf.contrib.rnn.GRUCell(RNN_CELL_SIZE)
outputs, _, _
= tf.nn.bidirectional_dynamic_rnn(rnn_fw_cell, rnn_bw_cell, word_vectors) ATTEND
outputs = tf.concat(outputs, axis = 2)
encoded, alphas PREDICT
lik yo muc
= my_attention(outputs, HIDDEN_LAYER_SIZE) I very but
e u h
0
logits = tf.layers.dense(encoded, 2, activation=None)
0
prediction = tf.argmax(logits, 1)
loss = tf.losses.softmax_cross_entropy(
H1 H2 H3 H4 H5 H6
onehot_labels=onehot_labels, logits=logits)
α1 α2 α3 α4 α5 α6
Toxic / Not
@martin_gorner
Tensorflow code: attention

def my_attention(inputs, hidden_layer_size):

X = tf.reshape(inputs, [-1, 2*RNN_CELL_SIZE])


Y = tf.layers.dense(X, hidden_layer_size, activation=tf.nn.relu)
logits = tf.layers.dense(Y, 1, activation = None)

logits = tf.reshape(logits, [-1, sequence_length, 1])


alphas = tf.nn.softmax(logits, dim=1)
encoded_sentence = tf.reduce_sum(inputs * alphas, axis=1)

return encoded_sentence, alphas Model hyperparameters:


MAX_DOCUMENT_LENGTH = 60
EMBEDDING_SIZE = 50
PRE_TRAINED = True
RNN_CELL_SIZE = 128
BATCH_SIZE = 128
HIDDEN_LAYER_SIZE = 32
@martin_gorner
Toxicity detector demo

www.kaggle.com/c/jigsaw-toxic-
comment-classification-challenge/

github.com/conversationai/conver
sationai-models/blob/nthain-
initial/attention-codelab

@martin_gorner
Sequence 2 sequence
Training time
Text
encoder decoder
translation
Th mous
cat ate the ∅ ◯
GO Le chat a mangé la souris
e e
0

slow
ch mang
Le a la souris ∅
at é
fast tf.nn.sampled_softmax_loss(…)

@martin_gorner
Sequence 2 sequence PREDICT

Prediction time

mang

GO Le chat a la souris
é X = tf.nn.embedding_lookup(embeddings, inWord)

GO
Le la a le la les
The cat H, Hout =
ate the X Hout tf.nn.dynamic_rnn(cell, X, initial_state=Hin)
mouse Hin H
Y = tf.layers.dense(H, vocab_size)
∅ Y
P
P = tf.nn.softmax(Y)

Le la a man
le la souri
les ∅ outWord = tf.nn.argmax(P) Bad idea™
Le chat a la ∅
gé s => Use tf.contrib.seq2seq.BeamSearchDecoder

@martin_gorner
seq2seq + attention

mous mang souri


black cat ate the ∅ ◯
GO chat noir a
é
la
s
e
0
one
Hd7
d1
d2
d3
d4
d5
d6 Hd7
d1
d2
d3
d4
d5
d6 Hd7
d1
d2
d3
d4
d5
d6 Hd7
d1
d2
d3
d4
d5
d6 Hd7
d1
d2
d3
d4
d5
d6 Hd7
d1
d2
d3
d4
d5
d6 layer Hd1 Hd2 Hd3 Hd4 Hd5 Hd6 Hd7

α1H1 + α2H2 + α3H3 + α4H4 + α5H5 + α6H6


anything...
ch noi mang souri
a la ∅
at r é s

Tensorflow blog: github.com/tensorflow/nmt


@martin_gorner
Translation with seq2seq
EMBED

x = tf.nn.embedding_lookup(embeddings, sentences)
ENCODE

encoder_cell = tf.rnn.GRUCell(encoding_dimension) ATTEND


wrapped_cell = tf.rnn.DropoutWrapper(encoder_cell, input_keep_prob=p)
encoded_sentences, encoder_state = tf.nn.dynamic_rnn(encoder_cell, x)
PREDICT

decoder_cell = tf.rnn.GRUCell(encoding_dimension)
decoder = tf.seq2seq.BeamSearchDecoder(decoder_cell, embeddings,
sos_tokens, eos_token, encoder_state, beam_width)
outputs, final_state, _ = tf.seq2seq.dynamic_decode(decoder, maximum_iterations=max_length)

@martin_gorner
Translation with attention
ATTEND

inattentive_decoder_cell = tf.rnn.GRUCell(encoding_dimension)

attention_mechanism = tf.seq2seq.LuongAttention(encoding_dimension, encoded_sentences)

decoder_cell = tf.seq2seq.AttentionWrapper(inattentive_decoder_cell,
attention_mechanism)

@martin_gorner
Postcard from...
Demo

arXiv:1506.05869v1, Oriol Vinyals, Quoc V. Le 2015


@martin_gorner
Text comprehension

Q: XXXXX dedicated their fall fashion show to moms


A: Dolce & Gabana (correct!)
arXiv:1506.03340v3, Hermann & al. 2015
@martin_gorner
Text comprehension

Q: U.S. Navy identifies deceased sailor as XXXX, who leaves behind a wife
A: Jason Kortz (correct!)
arXiv:1506.03340v3, Hermann & al. 2015
@martin_gorner
Have fun !
Martin Görner Cloud ML Engine
Google Developer relations your TensorFlow models
@martin_gorner trained in Google’s cloud.
Cloud Auto ML VisionALPHA
Nithum Thain Just bring your data
Jigsaw Research manager
@nithum Cloud TPU BETA
ML supercomputing
Neeraj Kashyap
Google Developer relations Pre-trained models:
nkash@google.com Cloud Vision API
Cloud Speech API
Videos, slides, code:
Natural Language API

github.com/ Google Translate API


GoogleCloudPlatform/ Video Intelligence API That’s all
tensorflow-without-a-phd Cloud Jobs API PRIVATE BETA
folks...
@martin_gorner
Tensorflow and
deep learning
without a PhD

1
neurons
>TensorFlow and \
deep reinforcement learning
without a PhD_
deep deep
Deep Reinforcement
Science !
Code ...

deep
Code...

#Tensorflow @martin_gorner
Neural network 101
20x20x3

1200

200

20

@martin_gorner
Activation functions
Classification
head
inputs
softmax

activation bias
weights

Hidden layers weighted


sum+b
relu
-1 1 norm

@martin_gorner
Success ?
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 1 0 0 0

actual probabilities, “one-hot” encoded

Cross entropy:
this is a “6”
computed probabilities

.01 .03 .00 .04 .03 .05 0.8 .02 .01 .01
0 1 2 3 4 5 6 7 8 9

@martin_gorner
Cookbook

Relu, softmax
Cross-entropy
Tensorflow 101
20x20x3 pixels
1200
Y1 = tf.layers.dense(X, 200, activation=tf.nn.relu)

Y2 = tf.layers.dense(Y1, 20, activation=tf.nn.relu)


200
Ylogits = tf.layers.dense(Y2, 2)
20

2
plane: [1,0] correct
not plane: [0,1] answer
loss = tf.losses.softmax_cross_entropy(tf.one_hot(is_plane,2), Ylogits)

train_op = tf.train.AdamOptimizer(0.001).minimize(loss)

learning rate
@martin_gorner
Pong ?

Δ frames cross-entropy

W1 probabilities
W2

“correct move”
softmax
Ex: 1 0 0

“policy network”
Adapted from A. Karpathy’s “Pong from Pixels” post
@martin_gorner
Pong ?

sample from
probabilities
UP
STILL keep playing
DOWN

“policy network”

@martin_gorner
Policy gradients

WIN!
+1
LOSE
-1

for each move


R: reward

“sampled move” predicted


Ex: 1 0 0

@martin_gorner
Policy gradient refinements

WIN!
+1
LOSE
whichever was chosen
-1
discounted
① rewards
R: reward

normalized across a “batch”


② rewards of moves

move #i

@martin_gorner
Training data
move [UP, STILL, Policy network Discounted
# DOWN] probabilities rewards

0 [1, 0, 0] [PUP, PSTILL, PDOWN] +1*d7


1 [0, 1, 0] [PUP, PSTILL, PDOWN] +1*d6
2 [0, 1, 0] [PUP, PSTILL, PDOWN] +1*d5
3 [0, 0, 1] [PUP, PSTILL, PDOWN] +1*d4
4 [0, 0, 1] [PUP, PSTILL, PDOWN] +1*d3
5 [0, 0, 1] [PUP, PSTILL, PDOWN] +1*d2
6 [0, 1, 0] [PUP, PSTILL, PDOWN] +1*d
7 [1, 0, 0] [PUP, PSTILL, PDOWN] +1 WIN! +1
8 [0, 0, 1] [PUP, PSTILL, PDOWN] -1*d7
Hyperparams:
9 [0, 0, 1] [PUP, PSTILL, PDOWN] -1*d6
10 [0, 0, 1] [PUP, PSTILL, PDOWN] -1*d5
Batch size: ~20k moves
11 [0, 0, 1] [PUP, PSTILL, PDOWN] -1*d4
Discount factor d=0.95~0.99
12 [0, 1, 0] [PUP, PSTILL, PDOWN] -1*d3
Optimizer=tf.train.RMSPropOptimizer
13 [1, 0, 0] [PUP, PSTILL, PDOWN] -1*d2
- Learning rate lr=0.0001~0.005
14 [0, 0, 1] [PUP, PSTILL, PDOWN] -1*d
- Decay=0.95~0.99
15 [0, 1, 0] [PUP, PSTILL, PDOWN] -1
1 hidden layer with 200 units
LOSE -1
Beta (“laziness”) = 0.01~0.02
@martin_gorner
Policy network

observations = tf.placeholder(shape=[None, 80x80]) # pixels


actions = tf.placeholder(shape=[None]) # 0, 1, 2 for UP, STILL, DOWN
rewards = tf.placeholder(shape=[None]) # +1, -1, with discounts

# model
Y = tf.layers.dense(observations, 200, activation=tf.nn.relu)
Ylogits = tf.layers.dense(Y, 3)

# sample an action from predicted probabilities


sample_op = tf.multinomial(logits=Ylogits, num_samples=1)
Gradient descent

#loss
cross_entropies = tf.losses.softmax_cross_entropy(one_hot_labels=
tf.one_hot(actions,3), logits=Ylogits)

loss = tf.reduce_sum(rewards * cross_entropies)

# training operation
optimizer = tf.train.RMSPropOptimizer(learning_rate=0.001, decay=0.99)
train_op = optimizer.minimize(loss)
Playing a game
with tf.Session() as sess:
... # reset everything
while not done: # play a game in 21 points
current_pix = read_pixels(game_state) # get pixels
observation = current_pix - previous_pix
previous_pix = current_pix
# decide what move to play: UP, STILL, DOWN (through NN model)
action = sess.run(sample_op, feed_dict={observations: [observation]})
# play it (through opneAI gym pong simulator)
game_state, reward, done, info = pong_sim.step(action)
# collect results
observations.append(observation);
actions.append(action);
rewards.append(reward)
Training loop
with tf.Session() as sess:
while len(observations) < BATCH_SIZE:
... # play game in 21 points, many moves, collect ...
... # observations, actions, rewards (from previous slide) ...

# Process the rewards after each game


processed_rewards = discount_rewards(rewards, args.gamma)
processed_rewards = normalize_rewards(rewards, args.gamma)

sess.run(train_op, feed_dict={observations: observations,


actions: actions,
rewards: processed_rewards })
Pong!
Learned weights

@martin_gorner
Postcard from...
Neural architecture search

RNN sample
policy
Layer conv 4x4x16 relu
gradient Layer conv 2x2x32 relu
Layer conv 2x2x64 relu
Layer maxpool 2x2
Layer conv 1x1x16 relu
Layer dense 400 relu
R: reward

train
accuracy R

arXiv:1611.01578v2, Barret Zoph, Quoc V. Le, May 2017


@martin_gorner
Tools: Cloud ML Engine

AI Platform TensorBoard

Auto ML
Just bring your data

Cloud TPU
ML supercomputing

@martin_gorner
Have fun !
Martin Görner AI Platform
Google Developer relations your TensorFlow models
@martin_gorner trained in Google’s cloud.
Auto ML
Yu-Han Liu Just bring your data
Google Developer relations
yuhanliu@google.com Cloud TPU
ML supercomputing
Neeraj Kashyap
Google Developer relations Pre-trained models:
nkash@google.com Cloud Vision API
Cloud Speech API
Videos, slides, code:
Natural Language API

github.com/ Google Translate API


GoogleCloudPlatform/ Video Intelligence API That’s all
tensorflow-without-a-phd Cloud Jobs API PRIVATE BETA
folks...
@martin_gorner
Tensorflow and
deep learning
without a PhD

1
neurons

You might also like