The goal of this exercise session is to:
Don’t forget to add your results to the score board!
Below, we define a machine that uses logistic regression in its training step, and we name him Pablo.
Pablo wants to use a logistic regression model to predict labels for new observations and he will use the ECOG_1
and ECOG_2
variables as well as the HB
and AST
variables in his model. That means that he will fit the following model on the training data:
\[\log \left( \frac{P(\text{DEATH2YRS} = 1)}{1 - P(\text{DEATH2YRS} = 1)} \right) = \alpha + \beta_1 \cdot \text{ECOG_1} + \beta_2 \text{ECOG_2} + \beta_3 \text{HB} + \beta_4 \text{AST}\]
and use this model to make guesses for labels by choosing
In words: we assign the label 1 (death) if the predicted probability of death within two years, based on information about ECOG, HB and AST, is greater than \(0.5\), and otherwise, we assign the label 0 (survival).
Look at the code line by line and make sure you understand roughly what is happening.
pablo <- function(data_x, y) {
#STEP 1: Learn from data
model <- glm(y ~ ECOG_1 + ECOG_2 + HB + AST, data = data_x,
family = "binomial")
#STEP 2: Return a function that makes predictions for new observations
predictFunction <- function(test_x) {
#choose the variables from newdata that we will use:
usedata <- test_x[, c("ECOG_1", "ECOG_2", "HB", "AST")]
#Use prediction function for glm models:
probabilities <- predict(model, newdata = usedata, type = "response")
#make vector of zeros, one for each observation in newdata
ys <- rep(0, nrow(test_x))
#fill in a 1 in ys in all the places there probabilities > 0.5
ys[probabilities > 0.5] <- 1
#return Pablo's guesses
return(ys)
}
#return the prediction function
return(predictFunction)
}
Hint: predict()
used on glm()
models will return the probability of each observation having the output 1 when we use the option type = "response"
. So predict(model, newdata = newdata, type = "response")
returns the estimated probabilities \(\hat{P}(\text{DEATH2YRS} = 1 \, | \, \text{ECOG_1}, \text{ECOG_2}, \text{HB}, \text{AST})\) for each observation in newdata
.
As of now, Pablo returns guesses for labels for each observation in the testdata. This allows us to compute the accuracy when using a specific threshold for deciding which label should be used (we used 0.5 as a threshold). However, it might sometimes be more informative to look at the tradeoff between specificity (true negative rate) and sensitivity (true positive rate) for varying values of the threshold. In order to do so, we need Pablo to return his probability estimates for the testdata rather than the labels.
#Open the pROC package. Install it if you don't have it:
#install.packages("pROC")
library(pROC)
#ROC curve and AUC:
#compute the AUC (area under ROC curve)
#pablo_probs should be the estimated probabilities on the testdata
#from Pablo.
pablo_roc <- roc(testdata_DEATH2YRS, pablo_probs)
pablo_roc
#Draw ROC curve
plot(pablo_roc)
Build you own machine using logistic regression and see if you can beat Pablo in terms of accuracy and AUC.
You can consider the following questions for inspiration:
More advanced options could include:
Below, I provide some information you may use as inspiration for building your model. You can also look at the Seyednasrollah et al. paper for ideas.