Exercise session 1

Overview

The goal of this exercise session is to:

Get an overview of the data
Try training your first machine to classify new observations

Don’t forget to add your results to the score board!

Loading the data

Step 1: Unzip the NNday.zip folder found here.
Step 2: Open the folder and open the NNday project in RStudio.

Then this code will load the data into R:

load("./data/andata.rda")

You will now have six objects available in your workspace:

traindata_x: the feature variables in the training dataset
traindata_DEATH2YRS: the DEATH2YRS outcome variable from the training dataset
traindata_DISCONT: the DISCONT outcome variable from the training dataset
testdata_x: the feature variables in the test dataset
testdata_DEATH2YRS: the DEATH2YRS outcome variable from the test dataset
testdata_DISCONT: the DISCONT outcome variable from the test dataset

The traindata_x and testdata_x datasets contain the same variables but different observations. You are not allowed to use testdata_x for training your machines: this dataset should be used exclusively for testing their performance on new, unknown data.

We have two binary outcome variables:

DEATH2YRS: An indicator of whether the patient is registered as dead within 2 years of the study.
DISCONT: An indicator of whether treatment was discontinued due to adverse effects within 3 months of the study.

I recommend that you focus on trying to predict values of DEATH2YRS in the following, but if you find yourself needing more challenge, see what you can do with DISCONT.

A few tips:

I have prepared ready to go data (e.g. no missing information): andata.rda.
Look in the codebook (codebook_mCRPCdata.pdf) for more information about the features you can use.
Note: Categorical variables (ECOG and AGEGRP) are coded as dummies, i.e. instead of having one categorical (factor) variable with three categories (0, 1, 2), we have two binary variables that identified whether a person has ECOG 2 or ECOG 3, respectively, while ECOG 0 is the reference group:

table(ECOG1 = traindata_x$ECOG_1); table(ECOG2 = traindata_x$ECOG_2)

## ECOG1
##   0   1 
## 617 586

## ECOG2
##    0    1 
## 1148   55

1.1. Try exploring the dataset a bit.

You can look at the codebook in order to get an explanation of what each variable contains, and you can look more at the data by using e.g. the dim(), table() and hist() functions. I provide a few examples below that you may run for inspiration. Don’t spend too too much time on this! We will have the machines do the data learning for us today.

#How many observations and how many variables are there?
dim(traindata_x)

#Look at table for primary outcome: DEATH2YRS:
table(traindata_DEATH2YRS)

#Histograms for "HB" (most commonly selected predictor in the Seyednasrollah paper),
#stratified by DEATH2YRS
hist(traindata_x$HB[traindata_DEATH2YRS == 0], main = "HB distribution for 2YR survivors",
     xlab = "HB")
hist(traindata_x$HB[traindata_DEATH2YRS == 1], main = "HB distribution for 2YR non-survivors",
     xlab = "HB")

##Cross tabulation of ECOG_1 and DEATH2YEARS
table(ECOG1 = traindata_x$ECOG_1, traindata_DEATH2YRS)

##Cross tabulation of ECOG_2 and DEATH2YEARS
table(ECOG2 = traindata_x$ECOG_2, traindata_DEATH2YRS)

1.2: Time to train your first machine: Susan

Below, we define a machine to predict DEATH2YRS. Let’s call her Susan. Susan knows that more severe patients have higher ECOG scores. Therefore, she wants to use this information to form a guess for a new patient. If someone has ECOG score 0, she will guess that they survive. If they have ECOG score 1, she will:

Guess that they die with probability \(p_1\)
Guess that they live with probability \(1-p_1\)

there \(p_1 = P(Y = 1 \, | \, ECOG = 1)\), i.e. the probability of dying among patients with ECOG value 1. Similarly, if they have ECOG score 2, she will:

Guess that they die with probability \(p_2\)
Guess that they live with probability \(1-p_2\)

there \(p_2 = P(Y = 1 \, | \, ECOG = 2)\), i.e. the probability of dying among patients with ECOG value 2.

1.2.1. Run the code and make sure you understand roughly what happens in each line.

#Define Susan. She needs to take the following inputs: training x values, training y values.
#And she outputs a function that takes new x-values and outputs guesses for y-values for each
#of them

susan <- function(data_x, y) {
  #STEP 1: Learn from data
  
  #Compute an estimate of p1 and p2
 
  #Define subsets of y for those with ECOG 1 and ECOG 2, respectively
  ECOG1_y <- subset(y, data_x$ECOG_1 == 1)
  ECOG2_y <- subset(y, data_x$ECOG_2 == 1)

  #Compute percentages of y being equal to 1 among those with ECOG 1 and 2, respectively
  #Note: mean() sums number of 1s (DEATH) and divides by number of observations => percentage
  #dead
  p1 <- mean(ECOG1_y)
  p2 <- mean(ECOG2_y)
  
  #STEP 2: Return a function that makes predictions for new observations
  predictFunction <- function(newdata) {
    #Make vector for predictions with the same number of observations as
    #the original variable, fill it out with zeros.
    ys <- rep(0, nrow(newdata))
    
    #Count how many observations in newdata have ECOG value 1 and 2, respectively:
    n1 <- sum(newdata$ECOG_1 == 1) 
    n2 <- sum(newdata$ECOG_2 == 1) 

    #replace the values in ys where newdata$ECOG_1 == 1 with a randomly drawn value.
    #There should be probability p1 of that value being 1, probability 1-p1 of that 
    #value being 0.
    ys[newdata$ECOG_1 == 1] <- sample(c(1,0), size = n1, replace = TRUE, 
                                     prob = c(p1, 1-p1)) 
    
    #replace the values in ys where newdata$ECOG_2 == 1 with a randomly drawn value.
    #There should be probability p2 of that value being 1, probability 1-p2 of that 
    #value being 0.
    ys[newdata$ECOG_2 == 1] <- sample(c(1,0), size = n2, replace = TRUE, 
                                     prob = c(p2, 1-p2)) 
    
    #return Susan's guesses
    return(ys)
  }
   
  #return the prediction function
  return(predictFunction)
}

#Train Susan by running her on the traindata. Remember to save her output (which is a function):
susan_predict <- susan(traindata_x, traindata_DEATH2YRS)

#Use Susan's prediction function on the testdata set:
susan_guesses <- susan_predict(testdata_x)

#Compare Susan's guesses with the actual values (confusion matrix)
table(susan_guesses, testdata_DEATH2YRS)

#Compute the accuracy (% correctly classified observations)
sum(susan_guesses == testdata_DEATH2YRS)/length(susan_guesses)

1.2.2. Reflect:

We find that Susan guesses correctly for about \(60\%\) of the observations in the testdata. So she’s better than a random coinflip! Go Susan!

Every time you run susan_predict(testdata_x) you get a slightly different accuracy. Why is that?
Susan is better than a random coinflip. Can you come up with a more reasonable accuracy to demand of her? Does she live up to that criterion?

1.3: Defining another machine: Lazy Joe.

Below, you will define a machine on your own. He is called Joe and he is a very lazy machine. All he does is to guess that every observation should be labeled 0, since this is the most common label in the traindata:

table(traindata_DEATH2YRS)

## traindata_DEATH2YRS
##   0   1 
## 769 434

1.3.1. Define Joe

Fill in the missing parts in the template below.

#Define the Lazy Joe machine. He inputs the training data and the name of the outcome variable
#and outputs a function that classifies a new dataset (newdata)
joe <- function(data_x, y) {
  #STEP 1: Train you machine
    #[INSERT CODE HERE]
  
  #STEP 2: Return a function that makes predictions for new observations
  predictFunction <- function(newdata) {
    #[INSERT CODE HERE]
  }
   
  #return the prediction function
  return(predictFunction)
}

Hint: This may be easier than you think. STEP 1 is where you look at the training data. Is Joe going to look at the data?

1.3.2. Train Joe and use him to predict values for the testdata

As with Susan, train Joe on the training data. Remember to save his outputted prediction function. Use this prediction function to predict values for the test data and save them.

1.3.3. Evaluate Joe

Make a confusion matrix and measure the accuracy of Joe. Did he do better than Susan?

1.3.4. Reflect:

What is the most reasonable lower bound for accuracy in when classifying new patients according to DEATH2YRS?

1.4. Machine freestyling

Use the machine defining template again to define your own machine and see if you can obtain a higher accuracy than Joe’s. Don’t forget to add your machine (and its accuracy) to the score board.