Exercise session 2

2.1. A simple logistic regression machine: Pablo

Below, we define a machine that uses logistic regression in its training step, and we name him Pablo.

Pablo wants to use a logistic regression model to predict labels for new observations and he will use the ECOG_1 and ECOG_2 variables as well as the HB and AST variables in his model. That means that he will fit the following model on the training data:

$\log \left( \frac{P(\text{DEATH2YRS} = 1)}{1 - P(\text{DEATH2YRS} = 1)} \right) = \alpha + \beta_1 \cdot \text{ECOG_1} + \beta_2 \text{ECOG_2} + \beta_3 \text{HB} + \beta_4 \text{AST}$

and use this model to make guesses for labels by choosing

Label = 1 if $\hat{P}(\text{DEATH2YRS} = 1 \, | \, \text{ECOG_1}, \text{ECOG_2}, \text{HB}, \text{AST}) > 0.5$
Label = 0 if $\hat{P}(\text{DEATH2YRS} = 1 \, | \, \text{ECOG_1}, \text{ECOG_2}, \text{HB}, \text{AST}) \leq 0.5$

In words: we assign the label 1 (death) if the predicted probability of death within two years, based on information about ECOG, HB and AST, is greater than $0.5$ , and otherwise, we assign the label 0 (survival).

2.1.1. Go through the definition of the machine

Look at the code line by line and make sure you understand roughly what is happening.

pablo <- function(data_x, y) {
  #STEP 1: Learn from data
  model <- glm(y ~ ECOG_1 + ECOG_2 + HB + AST, data  = data_x, 
               family = "binomial")
  
  #STEP 2: Return a function that makes predictions for new observations
  predictFunction <- function(test_x) {
    
    #choose the variables from newdata that we will use:
    usedata <- test_x[, c("ECOG_1", "ECOG_2", "HB", "AST")]
    
    #Use prediction function for glm models:
    probabilities <- predict(model, newdata = usedata, type = "response")
  
    #make vector of zeros, one for each observation in newdata
    ys <- rep(0, nrow(test_x)) 
    
    #fill in a 1 in ys in all the places there probabilities > 0.5
    ys[probabilities > 0.5] <- 1
    
    #return Pablo's guesses
    return(ys)
  }
   
  #return the prediction function
  return(predictFunction)
}

Hint: predict() used on glm() models will return the probability of each observation having the output 1 when we use the option type = "response". So predict(model, newdata = newdata, type = "response") returns the estimated probabilities $\hat{P}(\text{DEATH2YRS} = 1 \, | \, \text{ECOG_1}, \text{ECOG_2}, \text{HB}, \text{AST})$ for each observation in newdata.

2.1.2. Train and evaluate Pablo

Train Pablo using the training data.
Make guesses for each of the observations in testdata.
Evaluate Pablo’s performance in terms of accuracy.

2.1.3. Use other performance measures on Pablo: AUC

As of now, Pablo returns guesses for labels for each observation in the testdata. This allows us to compute the accuracy when using a specific threshold for deciding which label should be used (we used 0.5 as a threshold). However, it might sometimes be more informative to look at the tradeoff between specificity (true negative rate) and sensitivity (true positive rate) for varying values of the threshold. In order to do so, we need Pablo to return his probability estimates for the testdata rather than the labels.

Change Pablo so his predict function returns probabilities rather than labels.
Train Pablo again and compute predictions for all the observations in the testdata. Look at the results to make sure that you actually output predicted probabilities for each observation in the test data.
Draw the ROC curve for Pablos probabilities and compute the AUC (area under the ROC curve) using the code provided below.

#Open the pROC package. Install it if you don't have it:
#install.packages("pROC")
library(pROC)
  
#ROC curve and AUC:
 
  #compute the AUC (area under ROC curve)
  #pablo_probs should be the estimated probabilities on the testdata
  #from Pablo. 
  pablo_roc <- roc(testdata_DEATH2YRS, pablo_probs)  
  pablo_roc
  
  #Draw ROC curve
  plot(pablo_roc)

2.2. Utilize logistic regression even further

Build you own machine using logistic regression and see if you can beat Pablo in terms of accuracy and AUC.

You can consider the following questions for inspiration:

What predictors to include in the model?
How to include them - should some be log-transformed? Squared?
Do you want to use interactions between some variables?

More advanced options could include:

Choose what variables to include in the model automatically, e.g. the ones that have the best marginal predictive performances on the training dataset
Use spline expansions to deal with non-linearity
Use regularization, e.g. LASSO or ridge regression, to avoid overfitting

Below, I provide some information you may use as inspiration for building your model. You can also look at the Seyednasrollah et al. paper for ideas.

The 20 predictors that are the most correlated with the outcome (DEATH2YRS) in the training data:

	absolute correlation
MHNEOPLA	0.22906
MHSOCIAL	0.17704
HB	0.16611
AST	0.15705
CORTICOSTEROID	0.15087
ALP	0.14482
TARGET	0.12008
LIVER	0.11833
BMI	0.10719
CA	0.10458
ECOG_2	0.10075
NON_TARGET	0.09617
BONE	0.09011
MHSURG	0.08060
GONADOTROPIN	0.08025
BILATERAL_ORCHIDECTOMY	0.07911
GLUCOCORTICOID	0.07443
ANALGESICS	0.07274
ADRENAL	0.07257
MHCARD	0.07227