Exercise session 4

Overview

The goals of this exercise session is to:

Try to avoid overfitting by use of dropout
Get an idea of what model tuning does

Don’t forget to add your results to the score board!

4.1: Using dropout

We will use Brad from exercise 3.2. as an starting point to see how dropout may help us in avoiding overfitting. If you didn’t get to defining him, you can use this code as a starting point for Brad:

#define Brad and compile him
brad <- keras_model_sequential()
  
  #Build model structure
  brad %>% 
    layer_dense(units = 91, activation = 'sigmoid', 
                input_shape = 91) %>%     
    layer_dense(units = 91, activation = 'sigmoid') %>%
    layer_dense(units = 91, activation = 'sigmoid') %>%
    layer_dense(units = 91, activation = 'sigmoid') %>%
    layer_dense(units = 91, activation = 'sigmoid') %>%
    layer_dense(units = 2, activation = "softmax")
  
  #Compile: choose settings for how he will be trained
  brad %>% compile(
    loss      = "binary_crossentropy",
    optimizer = "rmsprop",
    metrics   = c("accuracy")
  )

Note that you can insert a dropout layer between any two layers like this:

model %>% layer_dense(units = 10, activation = "sigmoid", input_shape = 20) %>%
  layer_dropout(0.3) %>%
  layer_dense(units = 2, activiation = "softmax")

This means that for each node in the first (and only) hidden layer, there is a 30% chance that all its outgoing weights will be set to zero.

4.1.1. Introduce dropout in Brad

Between each pair of layers of Brad, introduce a dropout layer with 15% dropout. Compare the performance of this model with the performance of Brad without dropout.

4.2 Tune the dropout rate

We will now systematically compare different models in order to tune the dropout ratio parameter. We will look at the following possible values for the dropout rate: \(\phi \in \{0, 0.05, 0.10, ..., 0.90, 0.95\}\)

A first idea here could be to simply run Brad with all these different choices of dropout rate and then choose the dropout rate that results in the largest accuracy on the test data. But then we would be using the testdata to make model decisions - i.e. we would learn from the testdata - and thus we would be breaking the 2 rules of machine learning (and very likely to overfit to the testdata).

Instead, we will split our training data in two and measure the performance on this “new” testdata. This can be done by using the validation_split argument in the fit() function:

brad2_history <- brad2 %>% fit(x = NN_traindata_x, 
                               y = NN_traindata_DEATH2YRS,
                               epochs = 20,
                               batch_size = 10,
                               validation_split = 0.2)

Setting validation_split = 0.2 means that only the first 80% of the data (i.e. observations number 1 to 962) are used in training, while the remaining 20% are only used for testing.

Look through the code below and make sure you understand what happens.
Look at the accuracies computed for each dropout rate. Why do you think the last ones are the same?
Fit a Brad model on the full training data using the optimal choice of dropout rate that you just found. Evaluate this new Brad on the test data.
Advanced extra exercise: Try tuning the dropout rate even finer by writing a new loop that computes accuracies for e.g. \(\phi \in \{0.45, 0.46, 0.47, ..., 0.54\}\).

#initilize two vectors where we will store the accuracies and the 
#drop out rates
accuracies <- numeric(20)
dops <- seq(0, 0.95, 0.05)

#for-loop: inside the brackets {}, everything is repeated 20 times 
#first time, i is set to 1. Second time i is set to 2 , ... 
for (i in 1:20) {
  #choose the ith value of dropout rates
  dop <- dops[i]
  
  #build a Brad with this dropout rate
  thisBrad <- keras_model_sequential()
  thisBrad %>% 
    layer_dense(units = 91, activation = 'sigmoid', 
                  input_shape = 91) %>%     
    layer_dropout(dop) %>%
    layer_dense(units = 91, activation = 'sigmoid') %>%
    layer_dropout(dop) %>%
    layer_dense(units = 91, activation = 'sigmoid') %>%
    layer_dropout(dop) %>%
    layer_dense(units = 91, activation = 'sigmoid') %>%
    layer_dropout(dop) %>%
    layer_dense(units = 91, activation = 'sigmoid') %>%
    layer_dropout(dop) %>%
    layer_dense(units = 2, activation = "softmax")
  
  thisBrad %>% compile(
    loss      = "binary_crossentropy",
    optimizer = "rmsprop",
    metrics   = c("accuracy")
  )
  
  #train Brad on the training data
  #note: verbose = 0 turns off information being printed and plots
  #being made. 
  thisBrad_history <- thisBrad %>% fit(x = NN_traindata_x, 
                                 y = NN_traindata_DEATH2YRS,
                                 epochs = 20,
                                 batch_size = 10,
                                 validation_split = 0.2, 
                                 verbose = 0)
  
  #choose validation accuracy from Brads 20th (i.e. last) epoch and save it
  accuracies[i] <- thisBrad_history$metrics$val_acc[20]
  
  #print result to the screen
  print(paste("Dropout rate ", dop, "resulted in an accuracy of ", accuracies[i]))
}

4.3. Freestyle NN building

Here is your last chance to build the best possible NN you can. Try out ideas that you have gotten throughout the day and see if you can beat your previous best model. No rules here, except that you have to leave the test data alone while choosing how to build your model.