[go: up one dir, main page]

0% found this document useful (0 votes)
43 views50 pages

Advanced Data Analytics: Simon Scheidegger - University of Lausanne, Department of Economics

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 50

ADVANCED DATA ANALYTICS

Lecture 9

Simon Scheidegger – simon.scheidegger@unil.ch

University of Lausanne, Department of Economics


ROAD-MAP
 This lecture:  Throughout lectures – hands-on:

Deep Learning cont’d

Basics on Tensorflow & Keras

More advanced topics: 
Examples related to the day’s
topics in Tensorflow

Recurrent neural networks
and beyond.
KERAS & TENSORFLOW BASICS
 tensorflow.org

 Keras API:
https://www.tensorflow.org/guide/keras/sequential_model

 Fun data sets to play with: https://www.kaggle.com/datasets

 Some “clean” data to play with: https://archive.ics.uci.edu/ml/index.php

 Help for debugging – Tensorboard: https://www.tensorflow.org/tensorboard


A GENTLE FIRST EXAMPLE
 Lets look at the notebook: demo/03_Gentle_DNN.ipynb.

 This Notebook contains all the basic functionality from a theoretical point of
view.

 2 simple examples, one regression, and one classification.


ACTION REQUIRED
Look at the test functions below*. Pick three of those test functions (from Genz 1987).

Approximate a 2-dimensional function stated below with Neural Nets based 10, 50, 100, 500 points randomly sampled
from [0 , 1]2 . Compute the average and maximum error.

The errors should be computed by generating 1,000 uniformly distributed random test points from
within the computational domain.

Plot the maximum and average error as a function of the number of sample points.

Repeat the same for 5-dimensional and 10-dimensional functions. Is there anything particular
you observe?

*Choose the parameters w and c in meaningful ways.


ACTION REQUIRED (II)
 Play with the architecture.

Number of hidden layers.

activation functions.

choice of the stochastic gradient descent algorithm.

Monitor the performance with respect to the architecture.
A SEMI-COMPREHENSIVE TF TOUR
 demo/04_TF_tour.ipynb

 5 examples (incl. Kaggle data set from Lending Club)

 Tensorboard

On Nuvolos: In-cell it won't work
in JupyterLab.

Once you've run all the cells, go to the launcher, click TensorBoard, and
you should be good to go.

Right after this, a new tensorboard tab should show up that contains the
expected.
ACTION REQUIRED
 Focus on the example with the Kaggle data set.

 Play with the architecture.



Number of hidden layers.

activation functions.

choice of the stochastic gradient descent algorithm.

Monitor the performance with respect to the architecture.

 Try to use Tensorboard.


SOME PERSONAL TAKE-AWAY
 Swish activation is the “best” if you need smooth and deep.

 Multiple of 2 for network (training speed).

 Smaller learning rate with deeper networks.

 Batch normalization for speed.

 Glorot initialization.

 Custom layers for custom models


(https://www.tensorflow.org/tutorials/customization/custom_layers).
A BREAK MOUNTAINS
BEYOND VANILLA DNN
 Not all applications are plain vanilla deep neural nets.

 There exist situations where more intricate architectures are needed.

 Examples:

Time-series comparisons, such as estimating how closely related two documents or two stock
tickers are.

Sequence-to-sequence learning, such as decoding an English sentence into French.

Sentiment analysis, such as classifying the sentiment of tweets or movie reviews as positive or
negative.

Time-series forecasting, such as predicting the future weather at a certain location, given recent
weather data.
EXAMPLE

Given a picture of a ball, can we predict where it will go?


EXAMPLE

Given a picture of a ball, can we predict where it will go?


A SEQUENCE MODELING PROBLEM:
PREDICT THE NEXT WORD

“Today, we hare having a class on deep ”


A SEQUENCE MODELING PROBLEM:
PREDICT THE NEXT WORD

“Today, we hare having a class on deep learning”

given these words predict the next word


A SEQUENCE MODELING PROBLEM:
AUDIO
THE PERCEPTRON REVISITED

Activation Functions
e.g. sigmoid function

Inputs Weights Sum Non-Linearity Output

→ Bias term allows you to shift your activation function to the left or the right
FEED-FORWARD NETS REVISITED

X X X X
RECURRENT NEURAL NETS
 To model sequences, we need to

Handle variable-length sequences.

Track long-term dependencies.

Maintain information about the order.

Share parameters across the sequence.

 Recurrent Neural Networks (RNN) are an approach to sequence modeling


problems (Rumelhart et al. (1986)).
 More specifically, given an observation sequence x = {x1, x2, ..., xT} and its
corresponding label y = {y1, y2,..., yT}, we want to learn a map f : x → y.

http://colah.github.io/posts/2015-08-Understanding-LSTMs/
RNN
 RNNs, are a family of neural networks for processing sequential data.
 A RNN is a neural network that is specialized for
processing a sequence of values x(1), . . . , x(τ).
 Unfold the computational graph of a dynamical system:

Fig. from Goodfellow et al. (2016)


PREVIEW ON RNN
 Recurrent layers use their own output as input.

In Figure: A is recurrent cell

 Introduce history or time dependency in NNs.

 The only way to efficiently train them is to unroll them.

Cell Function old Input vector


state (parameterized) state at time t

http://colah.github.io/posts/2015-08-Understanding-LSTMs/
FEED-FORWARD NETS REVISITED
HANDLING INDIVIDUAL TIME STEPS

output
vector

input
vector
NEURONS WITH RECURRENCE

output
vector

input
vector

output input past memory


NEURONS WITH RECURRENCE

output
vector

input
vector

output input past memory


RECURRENT NEURAL NETWORKS
Apply a recurrence relation at every time step to
output process a sequence:
vector

cell state function input old state


with
weights W
input Note: the same function and set of parameters are
vector used at every time step.

RNNs have a cell state that is updated at each time step as a sequence is proceeded.
RNN INTUITION

output
vector

input
vector
RNN INTUITION

output
vector

input
vector
RNN INTUITION

output
vector

input
vector
RNN STATE UPDATE AND OUTPUT
Output vector
output
vector
Update hidden state

RNN

input
vector Input vector
RNN – IN ONE SLIDE
 RNN models a dynamic system, where the hidden (cell) state ht is not only dependent on the current
observation xt, but also relies on the previous hidden state ht-1.

 More specifically, we can represent ht as ht = f (ht-1, xt ) (Eq. 1)


where f is a nonlinear (time-invariant) mapping.

 Thus, ht contains information about the


whole sequence, which can be inferred from
the recursive definition in Eq. 1.

 In other words, RNN can use


the hidden variables as a memory
to capture long term information from
a sequence.
 Prediction at the time step t: zt
Fig. from G. Chen (2016)
RNN: COMPUTATION GRAPH ACROSS TIME

output
vector

RNN → represent as computational graph unrolled across time.

input
vector
BACK-PROPAGATION THROUGH TIME

Re-use same weight matrices at every time step!

http://colah.github.io/posts/2015-08-Understanding-LSTMs/
RNN FROM SCRATCH & TENSORFLOW

output
vector

RNN
recurrent cell

input
vector
RNN INTUITION
Many-to-one One-to-many many-to-many
e.g., sentiment classification e.g., text generation e.g., translation & forecasting

one-to-one
ordinary DNN, e.g.,
classification
SEQUENCE MODELING – DESIGN CRITERIA

output Recall: to model sequences, we need to:


vector 
Handle variable-length sequences.

Track long-term dependencies.
RNN

Maintain information about order.

input 
Share parameters across the sequence.
vector
→ Recurrent Neural Networks (RNNs) meet these
sequence modeling design criteria.
HANDLE VARIABLE SEQUENCE LENGTHS

output The food was great.


vector
vs.

We visited a Pizzeria for lunch.


RNN
vs.

input We were hungry because we went for sport before eating.


vector
BACK-PROPAGATION THROUGH TIME

output
vector

RNN

input
vector Computing the gradient wrt. h0 involves many factors
of Whh + repeated gradient computation!
BACKPROPAGATION THROUGH TIME

output
vector

RNN

input
vector Many values > 1: Many values < 1:
Exploding gradients Vanishing gradients
RECALL: RNN HARD TO TRAIN
 Recurrent blocks suffer from two problems:

Long-term dependencies do not work well.

Difficult to connect two distant parts of the input.

Magnitude of the signal can get amplified at each recurrent
connection.

At every time iteration, the gradient can either vanish or
explode.

Very hard to train them.

“I grew up in England… and I speak fluent ___ “

http://colah.github.io/posts/2015-08-Understanding-LSTMs/
LONG SHORT-TERM MEMORY (LSTM)
http://www.bioinf.jku.at/publications/older/2604.pdf

 Hochreiter & Schmidhuber (1997)


 LSTM layers are improved versions of the recurrent layers.

They rely on a gated cell to track information throughout many time steps.

They can learn long-term dependencies.

They can forget.

 They have an internal state and a


structure which is composed of
four actual layers.

Layers labeled with σ are
gates which can block or let
information flow.

http://colah.github.io/posts/2015-08-Understanding-LSTMs/
LONG SHORT-TERM MEMORY (LSTM)
 The core of LSTM is a memory unit (or cell) ct
which encodes the information of the inputs
that have been observed up to that step.
 The memory cell ct has the same inputs
(ht−1 and xt) and outputs ht as a normal
recurrent network, but has more gating units
which control the information flow.
 The input gate and output gate respectively
control the information input to the memory
unit and the information output from the unit.
 More specifically, the output ht of the LSTM cell can be shut off via the output gate.

Fig. from G. Chen (2016)


LSTM FORGET GATE
http://www.bioinf.jku.at/publications/older/2604.pdf

 LSTMs follow two paths



They update their internal state.

They give an output based on
the internal state and the input.

 A gate layer σ decides if we should


forget an old part of the internal
state.

Something which has to be
replaced by new information.
1. Forget
2. Store
3. Update
4. Output
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
LSTM NEW STATE
http://www.bioinf.jku.at/publications/older/2604.pdf
1. Forget
2. Store
3. Update
4. Output
 Once that the layer decided what to
forget, it computes
 What has to replace it, it , based
on the input and the old state.

What has to be used to replace it,
the candidate value Ct.
 The new state Ct can be computed
based on the new information.

1. Forget
2. Store
3. Update
4. Output
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
LSTM OUTPUT
http://www.bioinf.jku.at/publications/older/2604.pdf
1. Forget
2. Store
3. Update
4. Output
 Based on the new state and the input,
the layer can produce a result.

this is the output.

the same value is also passed to the next iteration.

 Why is this so important?



Many translation algorithms and
voice interpreters are based on
small variations of this layer.

 Action required: demo/05_RNN_intro.ipynb


(see also https://www.tensorflow.org/guide/keras/rnn)

http://colah.github.io/posts/2015-08-Understanding-LSTMs/
ACTION REQUIRED
 There is a weather data set from the Max Planck Institute of Biochemistry
https://www.bgc-jena.mpg.de/wetter/.

 Open the notebook demo/05b_Weather_data.ipynb.

 Given this time series (Temperature as a function of time), try to make


predictions of various time intervals into the future.
CONVOLUTIONAL NEURAL NETS
 Possibly the most successful types of networks.
 Uses sequences of convolutional layers.
 Can be interleaved with pooling operations or
fully-connected layers.
 Train faster than MLPs.
 Can be used for 2D, 3D or higher dimensions
(though 2D are the most common).
 Used for image recognition, object detection,
sound analysis, etc.
 There exist more intricate architectures.
 Yann LeCun (1998)
http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf
https://cs231n.github.io/convolutional-networks/
CONVOLUTIONAL NEURAL NETS
GENERATIVE ADVERSARIAL NETS
 GANs were introduced by Goodfellow et al. (2014)
 https://arxiv.org/abs/1406.2661.
 The idea is to train a network to generate samples which are
indistinguishable from real ones (from the training set).

The input is a random noise sample (latent space).
 Another network is trained at the same time to distinguish
between real and fake samples.

See https://arxiv.org/abs/1701.00160 for a turorial. Right fig from Goodfellow (2018).


QUESTIONS?

You might also like