The main loop gives Nans when in a computation expression #8

mrakgr · 2015-11-27T08:40:16Z

In the sequence recall program below, when I uncomment the '\let train =' and the stuff at the end, I get Nan numbers after around 2000-3000 iterations. When I leave that as it is, it optimizes just fine through the whole 10k iterations.

This looks like a bug to me, as there is nothing to indicate to me that the results should differ from the two runs.

Also, although not related to this issue, one thing that sticks out to me is the lack of map for the DM matrices. It would really help in implementing various activation functions not to mention the clipping function for the final sigmoid layer. Would that be something that is difficult to add to the AD library?

#I @"C:\Users\Marko\Documents\Visual Studio 2015\Projects\Automatic Differentiation\packages\DiffSharp.0.7.4\lib\net46"
#r @"DiffSharp.dll"

#I @"C:\Users\Marko\Documents\Visual Studio 2015\Projects\Automatic Differentiation\packages\FSharp.Quotations.Evaluator.1.0.6\lib\net40"
#r @"FSharp.Quotations.Evaluator.dll"

//#I @"C:\Users\Marko\Documents\Visual Studio 2015\Projects\Automatic Differentiation\packages\FSharp.Charting.0.90.13\lib\net40"
//#r "FSharp.Charting.dll" 
//#r @"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.6\System.Windows.Forms.DataVisualization.dll"

//open FSharp.Charting

open DiffSharp.AD.Float32
open DiffSharp.Util

open System.IO

let rng = System.Random()

// A layer of neurons
type Layer =
    {mutable W:DM  // Input weight matrix
     mutable U:DM  // Recurrent weight matrix
     mutable b:DV  // Bias vector
     a:DM->DM}     // Activation function

let createRandomLayer hidden_size input_size act =
    {
    W = DM.init hidden_size input_size (fun _ _ -> (rng.NextDouble()-0.5) / sqrt(float hidden_size) |> float32)
    U = DM.init hidden_size input_size (fun _ _ -> (rng.NextDouble()-0.5) / sqrt(float hidden_size) |> float32)
    b = DV.init hidden_size (fun _ -> (rng.NextDouble()-0.5) / sqrt(float hidden_size) |> float32)
    a = act
    }

// A feedforward network of neuron layers
type Network =
    {layers:Layer[]} // The layers forming this network

// For the section with no previous hidden state.
let runLayerNoH (x:DM) (l:Layer) =
    l.W * x + l.b |> l.a

// For the section with no input
let runLayerNoI (y:DM) (l:Layer) =
    l.U * y + l.b |> l.a

// For the section with previous hidden state
let runLayer (x:DM) (y:DM) (l:Layer) =
    l.W * x + l.U * y + l.b |> l.a

// To me these two problems look roughly similar but to the network they are worlds apart it seems.
let sequence_recall_data batch_size seq_length =
    [|
    for k = 1 to batch_size do
        let t = [|for i=1 to 7 do yield if rng.NextDouble() > 0.5 then 1.0f else 0.0f|]
        yield t
        for i=2 to seq_length-1 do
            let t = [|for i=1 to 7 do yield if rng.NextDouble() > 0.5 then 1.0f else 0.0f|]
            yield t
        yield t |]

let target_length = 3
let batch_size = 50
let training_data = sequence_recall_data batch_size target_length
let training_data_transposed =
    [|
    for i=0 to target_length-1 do
        let t = 
            [|
            for k=0 to batch_size-1 do
                let ind = k*target_length+i
                yield training_data.[ind] |] |> Array.map Array.toSeq |> Array.toSeq |> toDM
        yield t |]

let hidden_size = 10
let input_size = 7
let l1 = createRandomLayer hidden_size input_size DM.Tanh
let l2 = createRandomLayer input_size hidden_size DM.Sigmoid

let layers = [|l1;l2|]

let learning_rate = 0.1f / float32 batch_size


//let train =
    //[|
for i=1 to 10000 do
    let tag = DiffSharp.Util.GlobalTagger.Next
    for l in layers do
        l.W <- l.W |> makeReverse tag
        l.U <- l.U |> makeReverse tag
        l.b <- l.b |> makeReverse tag

    let a1 = runLayerNoH training_data_transposed.[0] l1
    let a2 = runLayer training_data_transposed.[1] a1 l1
    let a3 = runLayerNoI a2 l1
    let b3 = runLayerNoH a3 l2
    //let cost = -(training_data_transposed.[2] * log b3 + (1.0f-training_data_transposed.[2]) * log (1.0f-b3)) |> DM.Sum // Does not work. Probably because I have not clipped the outputs.
    let cost = b3 .* b3 |> DM.sum

    cost |> reverseProp (D 1.0f)

    for l in layers do
        l.W <- l.W.P - learning_rate*l.W.A
        l.U <- l.U.P - learning_rate*l.U.A
        l.b <- l.b.P - learning_rate*l.b.A

    let t = float32 cost

    printfn "The cost at iteration %i is %f" i t
        //yield 0.0f |]

//(Chart.Line train).ShowChart()

The text was updated successfully, but these errors were encountered:

mrakgr · 2015-11-27T10:35:41Z

I am not 100% sure, but I think the above might be a memory corruption bug because I have not transposed my data properly so the matrix dimensions do not match. Also I forgot to subtract the targets from the outputs in the cost variable.

Right now, I've just gotten the OR example to work and I've noticed that DiffSharp seems to have no bound checks on matrix dimensions anywhere! Even moreso than map, boundary checking I would say is essential to a library like this. Is it really not included?

Edit: I also cannot get logistic cross entropy cost function to work for some reason. At first it took me a bit to realize that I was inadvertently using the matrix product operator * instead of the .* operator, but even after I fixed that the function does not act like one would expect it to.

The code below is mostly the same as from the neural networks tutorial. The squared error cost function works fine.

Edit (2 days later): I still have not gotten the cross entropy error as shown below to work. The code in the example above is surely due to a memory bug (though it won't converge properly even after I'd fixed it,) but why the code below will not work is a complete mystery to me. I am not sure whether this is a bug with the library or some misunderstanding on my part how the library works.

Please advise.

#I @"C:\Users\Marko\Documents\Visual Studio 2015\Projects\Automatic Differentiation\packages\DiffSharp.0.7.4\lib\net46"
#r @"DiffSharp.dll"

#I @"C:\Users\Marko\Documents\Visual Studio 2015\Projects\Automatic Differentiation\packages\FSharp.Quotations.Evaluator.1.0.6\lib\net40"
#r @"FSharp.Quotations.Evaluator.dll"

#I @"C:\Users\Marko\Documents\Visual Studio 2015\Projects\Automatic Differentiation\packages\FSharp.Charting.0.90.13\lib\net40"
#r "FSharp.Charting.dll" 
#r @"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.6\System.Windows.Forms.DataVisualization.dll"

open DiffSharp.AD.Float32
open DiffSharp.Util

open FSharp.Charting

open System.IO

let rnd = System.Random()

// A layer of neurons
type Layer' =
    {mutable W:DM  // Weight matrix
     mutable b:DV  // Bias vector
     a:DM->DM}     // Activation function

// A feedforward network of neuron layers
type Network' =
    {layers:Layer'[]} // The layers forming this network

let runLayer' (x:DM) (l:Layer') =
    l.W * x + (DM.createCols x.Cols l.b) |> l.a


8000
let runNetwork' (x:DM) (n:Network') =
    Array.fold runLayer' x n.layers

// Backpropagation with SGD and minibatches
// n: network
// eta: learning rate
// epochs: number of training epochs
// mbsize: minibatch size
// loss: loss function
// x: training input matrix
// y: training target matrix
let backprop' (n:Network') (eta:float32) epochs mbsize loss (x:DM) (y:DM) =
    [|
    let i = DiffSharp.Util.GlobalTagger.Next
    let mutable b = 0
    let batches = x.Cols / mbsize
    let mutable j = 0
    while j < epochs do
        b <- 0
        while b < batches do
            let mbX = x.[*, (b * mbsize)..((b + 1) * mbsize - 1)]
            let mbY = y.[*, (b * mbsize)..((b + 1) * mbsize - 1)]

            for l in n.layers do
                l.W <- l.W |> makeReverse i
                l.b <- l.b |> makeReverse i

            let L:D = loss (runNetwork' mbX n) mbY
            L |> reverseProp (D 1.0f)

            for l in n.layers do
                l.W <- (l.W.P - eta * l.W.A)
                l.b <- (l.b.P - eta * l.b.A)

            printfn "Epoch %i, minibatch %i, loss %f" j b (float32 L)
            b <- b + 1
            yield float32 L
        j <- j + 1|]

let createNetwork (l:int[]) =
    {layers = Array.init (l.Length - 1) (fun i ->
        {W = DM.init l.[i + 1] l.[i] (fun _ _ -> -0.5 + rnd.NextDouble() |> float32)
         b = DV.init l.[i + 1] (fun _ -> -0.5 + rnd.NextDouble() |> float32)
         a = sigmoid})}


let net1 = createNetwork [|2; 3; 1|]

let softmaxCrossEntropy (x:DM) (y:DM) =
    -(x |> DM.toCols |> Seq.mapi (fun i v -> 
        (DV.standardBasis v.Length (int (float32 y.[0, i]))) * log v) |> Seq.sum) / x.Cols

let logisticCrossEntropy (x:DM) (y:DM) =
    -((y .* (DM.Log x) + (1.0f-y) .* DM.Log (1.0f-x)) |> DM.Sum)

let squareSum (x:DM) (y:DM) =
    let r = x - y
    (r .* r |> DM.Sum)

let ORx =  [|[0.; 0.]
             [0.; 1.]
             [1.; 0.]
             [1.; 1.]|] |> Array.map List.toSeq |> Array.toSeq |> toDM |> DM.Transpose

let ORy =  [|[0.]
             [1.]
             [1.]
             [1.]|] |> Array.map List.toSeq |> Array.toSeq |> toDM |> DM.Transpose

let train = backprop' net1 0.005f 1000 4 logisticCrossEntropy ORx ORy

(Chart.Line train).ShowChart()

mrakgr · 2015-12-03T14:47:04Z

I've managed to find the error in the cross entropy function. It turns out that it was a library bug. Take a look at this and you will see what I mean at around line 140.
let neg_a2 = -a2
let neg_a2_plus_one = neg_a2+1.0f
and
let neg_a2_plus_one = 1.0f - a2
do not give the same results.

I'll give it a shot at fixing it directly.

Edit: I managed to find the errors.

On line 2958:
| Sub_DCons_D(b) -> pushRec ((bx -d.A b) :: t)
On line 3025:
| Sub_DCons_DV(b) -> pushRec ((bx d.A b) :: t)
On line 3172:
| Sub_DCons_DM(b) -> pushRec ((bx d.A b) :: t)

Based on symmetry, I would say that both Sub_DCons_DV and Sub_DCons_DM are wrong.

gbaydin · 2015-12-03T18:03:11Z

Hi Marko, thank you for reporting this.

I'm sorry for the delay. I've just seen the issue you opened here. I will look at it and let you know as soon as I understand what's the problem.

mrakgr · 2015-12-04T07:51:04Z

Got your mail. It happened to me too just recently that I did not get a Github notification. The only thing I have to add to this thread for the sake of completeness that for in the code in my first post there was also one other error besides using matrix multiplication instead of the Hadamarad.

W = DM.init hidden_size input_size (fun _ _ -> (rng.NextDouble()-0.5) / sqrt(float hidden_size) |> float32)
U = DM.init hidden_size input_size (fun _ _ -> (rng.NextDouble()-0.5) / sqrt(float hidden_size) |> float32)
b = DV.init hidden_size (fun _ -> (rng.NextDouble()-0.5) / sqrt(float hidden_size) |> float32)

U = DM.init hidden_size input_size should be U = DM.init hidden_size hidden_size

I am kind of not used to bound checks bailing me out of these kinds of bugs so I did not see it at first. That cross entropy bug was also pretty nasty.

gbaydin · 2015-12-05T13:15:04Z

I can confirm that this was a bug in the reverse AD code of some scalar-vector and scalar-matrix operations. The Sub_D_DV, Sub_D_DM, Sub_DCons_DV and Sub_DCons_DM code is now fixed.

Seeing that you are implementing recurrent neural networks, I would suggest to have a look at the RNN, LSTM, and GRU code here: https://github.com/hypelib/Hype/blob/master/src/Hype/Neural.fs

You can maybe use/modify that code to suit your needs. The training code there is also working quite well and implements several gradient-based optimization methods like RMSProp.

There are some questions you asked. Let me try to answer them briefly:

Map operations
A map operation for matrices is not implemented in DiffSharp. This is by design. In other words, it is not implemented on purpose because we want the user to use the standard operations (e.g., exp, sin) with whole matrices, which runs significantly faster for both forward/reverse evaluation and makes use of the BLAS functionality as much as possible. If a function is mapped to all entries in a matrix, the matrix has to be converted into a 2D array of AD scalars with the D type and each has to hold its own trace for reverse AD which is hugely inefficient. Actually, this was the case before version 0.7 of the library and it was one of the factors that significantly limited performance.

If you really need this type of mapping, you can still achieve it by converting a matrix to a 2D array, and mapping the function. For example:

let m = toDM [[1; 2]; [3; 4]];;
let m1 = m |> DM.toArray2D |> Array2D.map (fun (v:D) -> sin (v * v))

val m : DM = DM [[1.0f; 2.0f]
                 [3.0f; 4.0f]]

val m1 : D [,] = [[D 0.841470957f; D -0.756802499f]
                  [D 0.412118495f; D -0.287903309f]]

The suggested and faster way of doing this without map would be:

let m = toDM [[1; 2]; [3; 4]];;
let m1 = sin (m .* m)

val m : DM = DM [[1.0f; 2.0f]
                 [3.0f; 4.0f]]

val m1 : DM = DM [[0.841470957f; -0.756802499f]
                  [0.412118495f; -0.287903309f]]

Bounds checking
Bounds checking is not implemented in many places of the library for performance considerations. Most of these will nevertheless throw exceptions such as IndexOutOfRangeException due to the underlying implementation, and because this is not caught by the library, it will end up notifying the user. We can, in the coming versions, improve this behavior and introduce better checks.

Some suggestions for usage

I can give you a few quick suggestions to make your life a bit easier when using the library. :)

Instead of

let XORx = [|[0.; 0.]
             [0.; 1.]
             [1.; 0.]
             [1.; 1.]
             |] |> Array.map List.toSeq |> Array.toSeq |> toDM |> DM.Transpose

you can just write

let XORx = toDM [[0;0;1;1];[0;1;0;1]]

which gives the same result. This is because lists and arrays can be passed as sequences. It's the reason we use seq in some parts of the API.

Instead of

W = [|[|-0.55f;-0.4f;-0.25f|]|] |> Array.map Array.toSeq |> Array.toSeq |> toDM

you can write

W = toDM [[-0.55f;-0.4f;-0.25f]]

Again, thank you very much for catching the bug! :)

gbaydin · 2015-12-05T13:17:26Z

One more thing! :) The API is still evolving and we would be very happy to hear if you have any comments or suggestions regarding that!

gbaydin · 2015-12-06T19:33:53Z

The bug is fixed in version 0.7.5.

mrakgr closed this as completed Dec 4, 2015

gbaydin added the bug label Dec 5, 2015

mrakgr mentioned this issue Dec 25, 2015

Add boundary checking. #17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The main loop gives Nans when in a computation expression #8

The main loop gives Nans when in a computation expression #8

The main loop gives Nans when in a computation expression #8

The main loop gives Nans when in a computation expression #8

Comments