Exercise 2 - Machine Learning and Prediction

Exercise summary

This exercise covers using regression as a predictive tool, both for linear and logisitic regression, as well as simple machine learning using the Naive Bayes and kNN algorithms. You will use the same dataset as from Exercise 1, the dail2002.dta data.

Please submit this exercise by email prior to the start of class Monday, March 9.

Partitioning the dataset into “folds”. For this exercise, we will be fitting a model from a subset of the data, and using the fitted model to predict the outcomes from the “left out” set and using that to evaluate RMSE or accuracy.
1. To start, familiarize yourself with the sample() command to draw a random sample of one-fifth of the observations in the dail2002.dta dataset. Use a .Random.seed and read about this function to see what it does. Why should you use this?
This fixes the seed for the random number generator, and ensures that replications of your code produce the same answer.
1. For the categorical predictions, we will also assess predictive ability based on “leave-one-out” testing and using “folds”, which are groups of observations that have been left out of the training set, and then attempting to predict them with accuracy. You will need to use indexing to partition the dataset to carry this out. You might want to write a loop for this. For instance, to partition a group of 20 observations into four sets, you could use the following for loop:
```
n <- 20
k <- 4
data <- data.frame(myIndex = 1:n, letter = LETTERS[1:n])
size <- n/k
if (n %% k)
    stop("n not divisible by k")
for (i in 1:k) {
    startIndex <- 1 + (i-1)*size
    endIndex <- startIndex + size - 1
    cat(startIndex, endIndex, "\n")
    print(data[startIndex : endIndex, ])
}
```
```
## 1 5
##   myIndex letter
## 1       1      A
## 2       2      B
## 3       3      C
## 4       4      D
## 5       5      E
## 6 10
##    myIndex letter
## 6        6      F
## 7        7      G
## 8        8      H
## 9        9      I
## 10      10      J
## 11 15
##    myIndex letter
## 11      11      K
## 12      12      L
## 13      13      M
## 14      14      N
## 15      15      O
## 16 20
##    myIndex letter
## 16      16      P
## 17      17      Q
## 18      18      R
## 19      19      S
## 20      20      T
```
What is the purpose of the line if (n %% 4)?

The %% is a modulo operator, which yields the remainder following division of the first argument by the second. The operation here ensures that the sample size is divisible by k, since the if statement will only evaluate to true if (n %% k) evaluates to a non-zero value.

OLS regression for prediction.

Fit a regression from the dataset to predict votes1st. You may use any combination of regressors that you wish. Save the model object to reg2_1.

require(foreign)

## Loading required package: foreign

dail2002 <- read.dta("https://www.kenbenoit.net/files/dail2002.dta")
reg2_1 <- lm(votes1st ~ incumb:spend_total + senator + gender + electorate, data=dail2002)

Predict the votes1st from the same sample to which you fitted the regression. What is the Root Mean Squared Error (RMSE) and how would you interpret this?

length(predict(reg2_1))   # shortened to 462 because of the two missing cases

## [1] 462

# refit model with different behaviour for na.action
reg2_1 <- lm(votes1st ~ incumb*spend_total + senator + gender + electorate,
             data=dail2002, na.action=na.exclude)
reg2_1pred <- predict(reg2_1)
length(reg2_1pred)  # now includes missing cases

## [1] 464

sum(is.na(reg2_1pred))

## [1] 2

# function to compute RMSE
computeRMSE <- function(lmobject) {
    sqrt(sum(residuals(lmobject)^2, na.rm=TRUE) / df.residual(lmobject))
}
computeRMSE(reg2_1)

## [1] 1799.031

# compare to output from summary()
summary(reg2_1)

##
## Call:
## lm(formula = votes1st ~ incumb * spend_total + senator + gender +
##     electorate, data = dail2002, na.action = na.exclude)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -5769.3  -960.1  -189.1   891.2  7071.7
##
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)         5.326e+02  4.115e+02   1.294   0.1962
## incumb              4.419e+03  4.782e+02   9.241  < 2e-16 ***
## spend_total         2.075e-01  1.166e-02  17.797  < 2e-16 ***
## senator            -5.473e+02  6.519e+02  -0.840   0.4016
## genderf            -5.023e+02  2.193e+02  -2.290   0.0225 *
## electorate          1.407e-04  5.279e-03   0.027   0.9787
## incumb:spend_total -1.074e-01  2.250e-02  -4.773 2.45e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1799 on 455 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.6675, Adjusted R-squared:  0.6631
## F-statistic: 152.2 on 6 and 455 DF,  p-value: < 2.2e-16

Drop the incumbency variable – that you hopefully included in your answer to 2.1! – and repeat steps 2.1–2.2. Compute a new RMSE and compare this to the previous one. Which is a better predictor?
```
reg2_3 <- lm(votes1st ~ spend_total + senator + gender + electorate,
             data=dail2002, na.action=na.exclude)
computeRMSE(reg2_3)
```
```
## [1] 2061.288
```
```
summary(reg2_3)$r.squared
```
```
## [1] 0.5615657
```
The model with incumbency is a better predictor, with a lower RMSE and a higher $R^2$.

Logistic regression for prediction.

Fit a logistic regression (hint: use glm()) to predict the outcome variable wonseat. Use any specification that you think provides a good prediction.
```
reg3_1 <- glm(wonseat ~ incumb:spend_total + senator + gender + electorate, data=dail2002, family=binomial, na.action=na.exclude)
```

For the full sample, compute:

a table of actual wonseat by predicted wonseat

wonseat_predicted <- ifelse(predict(reg3_1) > 0, 1, 0)
(predtable <- table(wonseat_predicted, dail2002$wonseat))

##
## wonseat_predicted   0   1
##                 0 267  65
##                 1  30 100

percent correctly predicted

sum(diag(prop.table(predtable))) * 100

## [1] 79.43723

precision
recall

# define a general function to compute precision & recall
precrecall <- function(mytable, verbose=TRUE) {
    truePositives <- mytable[1,1]
    falsePositives <- sum(mytable[1,]) - truePositives
    falseNegatives <- sum(mytable[,1]) - truePositives
    precision <- truePositives / (truePositives + falsePositives)
    recall <- truePositives / (truePositives + falseNegatives)
    if (verbose) {
        print(mytable)
        cat("\n precision =", round(precision, 2),
            "\n    recall =", round(recall, 2), "\n")
    }
    invisible(c(precision, recall))
}
# compute precision of wonseat (not lost seat!)
precrecall(predtable[2:1, 2:1])

##
## wonseat_predicted   1   0
##                 1 100  30
##                 0  65 267
##
##  precision = 0.77
##     recall = 0.61

Comparing two models.

Compute an 8-fold validation, where for 8 different training sets consisting of 7/8 of the observations, you predict the other held-out 1/8 and compare the actual to predicted for the 1/8 test set. Compute an average F1 score for the 8 models.

# here it helps to know that n=464 and 464 %% 8 == 0
foldIndex <- rep(1:8, each = nrow(dail2002)/8)
# initialize vector to store F1 scores
f1result <- c()
# loop over folds
for (i in 1:8) {
    glmresult <- glm(wonseat ~ incumb:spend_total + senator + gender + electorate,
                     na.action = na.exclude, family = "binomial",
                     data = subset(dail2002, foldIndex != i))
    wonseatPred <- ifelse(predict(glmresult, newdata = subset(dail2002, foldIndex == i)) > 0, 1, 0)
    predtable <- table(wonseatPred, subset(dail2002, foldIndex == i)$wonseat)
    prrec <- precrecall(predtable[2:1, 2:1], verbose=FALSE)
    f1result[i] <- 2 * prod(prrec) / sum(prrec)
}
f1result

## [1] 0.7619048 0.6315789 0.7222222 0.5945946 0.7567568 0.6470588 0.6486486
## [8] 0.6666667

mean(f1result)

## [1] 0.6786789

Now drop a variable or two, and repeat the previous step to compare the average F1 score for this model.

f1result <- c()
# loop over folds
for (i in 1:8) {
    glmresult <- glm(wonseat ~ spend_total + senator + gender + electorate,
                     na.action = na.exclude, family = "binomial",
                     data = subset(dail2002, foldIndex != i))
    wonseatPred <- ifelse(predict(glmresult, newdata = subset(dail2002, foldIndex == i)) > 0, 1, 0)
    predtable <- table(wonseatPred, subset(dail2002, foldIndex == i)$wonseat)
    prrec <- precrecall(predtable[2:1, 2:1], verbose=FALSE)
    f1result[i] <- 2 * prod(prrec) / sum(prrec)
}
f1result

## [1] 0.5500000 0.6363636 0.6875000 0.7234043 0.7179487 0.6500000 0.5000000
## [8] 0.7317073

mean(f1result)

## [1] 0.6496155

Why is it valuable to use the different folds here, rather than simply comparing the F1 score for the predicted outcome of the entire sample, fit to the entire sample? Because fitting on the whole sample can lead to overfitting. For calibrating a predictive model we need to test it out of sample.

kNN prediction.

Fit a $k=1$ kNN model to predict the same specification as in your logistic regression. Compare the percent correctly predicted. Which model worked better? (Note: You can use the whole sample here, so that this will compare the results to those from 3.1).

# get the variables we want
dail2002sub <- subset(dail2002, select = c(spend_total, incumb, gender, electorate))
# convert the factor to numeric 0 or 1
dail2002sub$gender <- unclass(dail2002sub$gender) - 1
# figure out which are complete
completeCasesIndex <- complete.cases(dail2002sub)
# set the training and test datasets to be identical
train <- test <- dail2002sub[completeCasesIndex, ]
require(class)

## Loading required package: class

# make the true class a factor
wonseatf <- factor(dail2002$wonseat[completeCasesIndex], labels = c("No", "Yes"))
knnPred <- knn(train, test, wonseatf, k=1)
precrecall(table(knnPred, wonseatf)[2:1, 2:1])

##        wonseatf
## knnPred Yes  No
##     Yes 165   0
##     No    0 297
##
##  precision = 1
##     recall = 1

Experiment with two more settings of $k$ to see how this affects prediction, reporting the percent correctly predicted.

# k = 2
precrecall(table(knnPred <- knn(train, test, wonseatf, k=2), wonseatf)[2:1, 2:1])

##      wonseatf
##       Yes  No
##   Yes 130  37
##   No   35 260
##
##  precision = 0.78
##     recall = 0.79

# k = 3
precrecall(table(knnPred <- knn(train, test, wonseatf, k=2), wonseatf)[2:1, 2:1])

##      wonseatf
##       Yes  No
##   Yes 127  31
##   No   38 266
##
##  precision = 0.8
##     recall = 0.77

# k = 4
precrecall(table(knnPred <- knn(train, test, wonseatf, k=2), wonseatf)[2:1, 2:1])

##      wonseatf
##       Yes  No
##   Yes 133  34
##   No   32 263
##
##  precision = 0.8
##     recall = 0.81

# k = 4
precrecall(table(knnPred <- knn(train, test, wonseatf, k=2), wonseatf)[2:1, 2:1])

##      wonseatf
##       Yes  No
##   Yes 125  42
##   No   40 255
##
##  precision = 0.75
##     recall = 0.76

We can also plot precision by recall to get an ROC curve-type plot, to evaluate performance at each level of k. Note that this will not produce “curves” because we are not fitting to different folds or datasets, just once to recover the predictions from the whole training set.

maxk <- 10
prresults <- data.frame(precision = NA, recall = NA)
for (k in 1:maxk) {
    prresults[k, ] <- precrecall(table(knnPred <- knn(train, test, wonseatf, k=k), wonseatf)[2:1, 2:1],
                                 verbose=FALSE)
}
with(prresults, plot(recall, precision, type="n", xlim=c(.5, 1), ylim=c(.5, 1)))
with(prresults, text(recall, precision, paste0("k=", 1:maxk)))

precision

Exercise 2 - Machine Learning and Prediction

Ken Benoit

4 March 2015

Exercise summary