This exercise covers using regression as a predictive tool, both for linear and logisitic regression, as well as simple machine learning using the Naive Bayes and kNN algorithms. You will use the same dataset as from Exercise 1, the dail2002.dta
data.
Please submit this exercise by email prior to the start of class Monday, March 9.
Partitioning the dataset into “folds”. For this exercise, we will be fitting a model from a subset of the data, and using the fitted model to predict the outcomes from the “left out” set and using that to evaluate RMSE or accuracy.
sample()
command to draw a random sample of one-fifth of the observations in the dail2002.dta
dataset. Use a .Random.seed
and read about this function to see what it does. Why should you use this?This fixes the seed for the random number generator, and ensures that replications of your code produce the same answer.
for
loop:n <- 20
k <- 4
data <- data.frame(myIndex = 1:n, letter = LETTERS[1:n])
size <- n/k
if (n %% k)
stop("n not divisible by k")
for (i in 1:k) {
startIndex <- 1 + (i-1)*size
endIndex <- startIndex + size - 1
cat(startIndex, endIndex, "\n")
print(data[startIndex : endIndex, ])
}
## 1 5
## myIndex letter
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
## 6 10
## myIndex letter
## 6 6 F
## 7 7 G
## 8 8 H
## 9 9 I
## 10 10 J
## 11 15
## myIndex letter
## 11 11 K
## 12 12 L
## 13 13 M
## 14 14 N
## 15 15 O
## 16 20
## myIndex letter
## 16 16 P
## 17 17 Q
## 18 18 R
## 19 19 S
## 20 20 T
What is the purpose of the line if (n %% 4)
?
The %%
is a modulo operator, which yields the remainder following division of the first argument by the second. The operation here ensures that the sample size is divisible by k
, since the if
statement will only evaluate to true if (n %% k)
evaluates to a non-zero value.
OLS regression for prediction.
Fit a regression from the dataset to predict votes1st
. You may use any combination of regressors that you wish. Save the model object to reg2_1
.
require(foreign)
## Loading required package: foreign
dail2002 <- read.dta("https://www.kenbenoit.net/files/dail2002.dta")
reg2_1 <- lm(votes1st ~ incumb:spend_total + senator + gender + electorate, data=dail2002)
Predict the votes1st
from the same sample to which you fitted the regression. What is the Root Mean Squared Error (RMSE) and how would you interpret this?
length(predict(reg2_1)) # shortened to 462 because of the two missing cases
## [1] 462
# refit model with different behaviour for na.action
reg2_1 <- lm(votes1st ~ incumb*spend_total + senator + gender + electorate,
data=dail2002, na.action=na.exclude)
reg2_1pred <- predict(reg2_1)
length(reg2_1pred) # now includes missing cases
## [1] 464
sum(is.na(reg2_1pred))
## [1] 2
# function to compute RMSE
computeRMSE <- function(lmobject) {
sqrt(sum(residuals(lmobject)^2, na.rm=TRUE) / df.residual(lmobject))
}
computeRMSE(reg2_1)
## [1] 1799.031
# compare to output from summary()
summary(reg2_1)
##
## Call:
## lm(formula = votes1st ~ incumb * spend_total + senator + gender +
## electorate, data = dail2002, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5769.3 -960.1 -189.1 891.2 7071.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.326e+02 4.115e+02 1.294 0.1962
## incumb 4.419e+03 4.782e+02 9.241 < 2e-16 ***
## spend_total 2.075e-01 1.166e-02 17.797 < 2e-16 ***
## senator -5.473e+02 6.519e+02 -0.840 0.4016
## genderf -5.023e+02 2.193e+02 -2.290 0.0225 *
## electorate 1.407e-04 5.279e-03 0.027 0.9787
## incumb:spend_total -1.074e-01 2.250e-02 -4.773 2.45e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1799 on 455 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.6675, Adjusted R-squared: 0.6631
## F-statistic: 152.2 on 6 and 455 DF, p-value: < 2.2e-16
Drop the incumbency variable – that you hopefully included in your answer to 2.1! – and repeat steps 2.1–2.2. Compute a new RMSE and compare this to the previous one. Which is a better predictor?
reg2_3 <- lm(votes1st ~ spend_total + senator + gender + electorate,
data=dail2002, na.action=na.exclude)
computeRMSE(reg2_3)
## [1] 2061.288
summary(reg2_3)$r.squared
## [1] 0.5615657
The model with incumbency is a better predictor, with a lower RMSE and a higher \(R^2\).
Logistic regression for prediction.
Fit a logistic regression (hint: use glm()
) to predict the outcome variable wonseat
. Use any specification that you think provides a good prediction.
reg3_1 <- glm(wonseat ~ incumb:spend_total + senator + gender + electorate, data=dail2002, family=binomial, na.action=na.exclude)
For the full sample, compute:
wonseat
by predicted wonseat
wonseat_predicted <- ifelse(predict(reg3_1) > 0, 1, 0)
(predtable <- table(wonseat_predicted, dail2002$wonseat))
##
## wonseat_predicted 0 1
## 0 267 65
## 1 30 100
sum(diag(prop.table(predtable))) * 100
## [1] 79.43723
# define a general function to compute precision & recall
precrecall <- function(mytable, verbose=TRUE) {
truePositives <- mytable[1,1]
falsePositives <- sum(mytable[1,]) - truePositives
falseNegatives <- sum(mytable[,1]) - truePositives
precision <- truePositives / (truePositives + falsePositives)
recall <- truePositives / (truePositives + falseNegatives)
if (verbose) {
print(mytable)
cat("\n precision =", round(precision, 2),
"\n recall =", round(recall, 2), "\n")
}
invisible(c(precision, recall))
}
# compute precision of wonseat (not lost seat!)
precrecall(predtable[2:1, 2:1])
##
## wonseat_predicted 1 0
## 1 100 30
## 0 65 267
##
## precision = 0.77
## recall = 0.61
Comparing two models.
# here it helps to know that n=464 and 464 %% 8 == 0
foldIndex <- rep(1:8, each = nrow(dail2002)/8)
# initialize vector to store F1 scores
f1result <- c()
# loop over folds
for (i in 1:8) {
glmresult <- glm(wonseat ~ incumb:spend_total + senator + gender + electorate,
na.action = na.exclude, family = "binomial",
data = subset(dail2002, foldIndex != i))
wonseatPred <- ifelse(predict(glmresult, newdata = subset(dail2002, foldIndex == i)) > 0, 1, 0)
predtable <- table(wonseatPred, subset(dail2002, foldIndex == i)$wonseat)
prrec <- precrecall(predtable[2:1, 2:1], verbose=FALSE)
f1result[i] <- 2 * prod(prrec) / sum(prrec)
}
f1result
## [1] 0.7619048 0.6315789 0.7222222 0.5945946 0.7567568 0.6470588 0.6486486
## [8] 0.6666667
mean(f1result)
## [1] 0.6786789
f1result <- c()
# loop over folds
for (i in 1:8) {
glmresult <- glm(wonseat ~ spend_total + senator + gender + electorate,
na.action = na.exclude, family = "binomial",
data = subset(dail2002, foldIndex != i))
wonseatPred <- ifelse(predict(glmresult, newdata = subset(dail2002, foldIndex == i)) > 0, 1, 0)
predtable <- table(wonseatPred, subset(dail2002, foldIndex == i)$wonseat)
prrec <- precrecall(predtable[2:1, 2:1], verbose=FALSE)
f1result[i] <- 2 * prod(prrec) / sum(prrec)
}
f1result
## [1] 0.5500000 0.6363636 0.6875000 0.7234043 0.7179487 0.6500000 0.5000000
## [8] 0.7317073
mean(f1result)
## [1] 0.6496155
kNN prediction.
Fit a \(k=1\) kNN model to predict the same specification as in your logistic regression. Compare the percent correctly predicted. Which model worked better? (Note: You can use the whole sample here, so that this will compare the results to those from 3.1).
# get the variables we want
dail2002sub <- subset(dail2002, select = c(spend_total, incumb, gender, electorate))
# convert the factor to numeric 0 or 1
dail2002sub$gender <- unclass(dail2002sub$gender) - 1
# figure out which are complete
completeCasesIndex <- complete.cases(dail2002sub)
# set the training and test datasets to be identical
train <- test <- dail2002sub[completeCasesIndex, ]
require(class)
## Loading required package: class
# make the true class a factor
wonseatf <- factor(dail2002$wonseat[completeCasesIndex], labels = c("No", "Yes"))
knnPred <- knn(train, test, wonseatf, k=1)
precrecall(table(knnPred, wonseatf)[2:1, 2:1])
## wonseatf
## knnPred Yes No
## Yes 165 0
## No 0 297
##
## precision = 1
## recall = 1
Experiment with two more settings of \(k\) to see how this affects prediction, reporting the percent correctly predicted.
# k = 2
precrecall(table(knnPred <- knn(train, test, wonseatf, k=2), wonseatf)[2:1, 2:1])
## wonseatf
## Yes No
## Yes 130 37
## No 35 260
##
## precision = 0.78
## recall = 0.79
# k = 3
precrecall(table(knnPred <- knn(train, test, wonseatf, k=2), wonseatf)[2:1, 2:1])
## wonseatf
## Yes No
## Yes 127 31
## No 38 266
##
## precision = 0.8
## recall = 0.77
# k = 4
precrecall(table(knnPred <- knn(train, test, wonseatf, k=2), wonseatf)[2:1, 2:1])
## wonseatf
## Yes No
## Yes 133 34
## No 32 263
##
## precision = 0.8
## recall = 0.81
# k = 4
precrecall(table(knnPred <- knn(train, test, wonseatf, k=2), wonseatf)[2:1, 2:1])
## wonseatf
## Yes No
## Yes 125 42
## No 40 255
##
## precision = 0.75
## recall = 0.76
We can also plot precision by recall to get an ROC curve-type plot, to evaluate performance at each level of k
. Note that this will not produce “curves” because we are not fitting to different folds or datasets, just once to recover the predictions from the whole training set.
maxk <- 10
prresults <- data.frame(precision = NA, recall = NA)
for (k in 1:maxk) {
prresults[k, ] <- precrecall(table(knnPred <- knn(train, test, wonseatf, k=k), wonseatf)[2:1, 2:1],
verbose=FALSE)
}
with(prresults, plot(recall, precision, type="n", xlim=c(.5, 1), ylim=c(.5, 1)))
with(prresults, text(recall, precision, paste0("k=", 1:maxk)))