This exercise covers using regression as a predictive tool, both for linear and logisitic regression, as well as simple machine learning using the Naive Bayes and kNN algorithms. You will use the same dataset as from Exercise 1, the `dail2002.dta`

data.

Please **submit this exercise by email** prior to the **start of class Monday, March 9**.

**Partitioning the dataset into “folds”.**For this exercise, we will be fitting a model from a subset of the data, and using the fitted model to predict the outcomes from the “left out” set and using that to evaluate RMSE or accuracy.- To start, familiarize yourself with the
`sample()`

command to draw a random sample of one-fifth of the observations in the`dail2002.dta`

dataset. Use a`.Random.seed`

and read about this function to see what it does. Why should you use this?

**This fixes the seed for the random number generator, and ensures that replications of your code produce the same answer.**- For the categorical predictions, we will also assess predictive ability based on “leave-one-out” testing and using “folds”, which are groups of observations that have been left out of the training set, and then attempting to predict them with accuracy. You will need to use indexing to partition the dataset to carry this out. You might want to write a loop for this. For instance, to partition a group of 20 observations into four sets, you could use the following
`for`

loop:

`n <- 20 k <- 4 data <- data.frame(myIndex = 1:n, letter = LETTERS[1:n]) size <- n/k if (n %% k) stop("n not divisible by k") for (i in 1:k) { startIndex <- 1 + (i-1)*size endIndex <- startIndex + size - 1 cat(startIndex, endIndex, "\n") print(data[startIndex : endIndex, ]) }`

`## 1 5 ## myIndex letter ## 1 1 A ## 2 2 B ## 3 3 C ## 4 4 D ## 5 5 E ## 6 10 ## myIndex letter ## 6 6 F ## 7 7 G ## 8 8 H ## 9 9 I ## 10 10 J ## 11 15 ## myIndex letter ## 11 11 K ## 12 12 L ## 13 13 M ## 14 14 N ## 15 15 O ## 16 20 ## myIndex letter ## 16 16 P ## 17 17 Q ## 18 18 R ## 19 19 S ## 20 20 T`

What is the purpose of the line

`if (n %% 4)`

?**The**`%%`

is a modulo operator, which yields the remainder following division of the first argument by the second. The operation here ensures that the sample size is divisible by`k`

, since the`if`

statement will only evaluate to true if`(n %% k)`

evaluates to a non-zero value.- To start, familiarize yourself with the
**OLS regression for prediction.**Fit a regression from the dataset to predict

`votes1st`

. You may use any combination of regressors that you wish. Save the model object to`reg2_1`

.`require(foreign)`

`## Loading required package: foreign`

`dail2002 <- read.dta("http://www.kenbenoit.net/files/dail2002.dta") reg2_1 <- lm(votes1st ~ incumb:spend_total + senator + gender + electorate, data=dail2002)`

Predict the

`votes1st`

from the same sample to which you fitted the regression. What is the Root Mean Squared Error (RMSE) and how would you interpret this?`length(predict(reg2_1)) # shortened to 462 because of the two missing cases`

`## [1] 462`

`# refit model with different behaviour for na.action reg2_1 <- lm(votes1st ~ incumb*spend_total + senator + gender + electorate, data=dail2002, na.action=na.exclude) reg2_1pred <- predict(reg2_1) length(reg2_1pred) # now includes missing cases`

`## [1] 464`

`sum(is.na(reg2_1pred))`

`## [1] 2`

`# function to compute RMSE computeRMSE <- function(lmobject) { sqrt(sum(residuals(lmobject)^2, na.rm=TRUE) / df.residual(lmobject)) } computeRMSE(reg2_1)`

`## [1] 1799.031`

`# compare to output from summary() summary(reg2_1)`

`## ## Call: ## lm(formula = votes1st ~ incumb * spend_total + senator + gender + ## electorate, data = dail2002, na.action = na.exclude) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5769.3 -960.1 -189.1 891.2 7071.7 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.326e+02 4.115e+02 1.294 0.1962 ## incumb 4.419e+03 4.782e+02 9.241 < 2e-16 *** ## spend_total 2.075e-01 1.166e-02 17.797 < 2e-16 *** ## senator -5.473e+02 6.519e+02 -0.840 0.4016 ## genderf -5.023e+02 2.193e+02 -2.290 0.0225 * ## electorate 1.407e-04 5.279e-03 0.027 0.9787 ## incumb:spend_total -1.074e-01 2.250e-02 -4.773 2.45e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1799 on 455 degrees of freedom ## (2 observations deleted due to missingness) ## Multiple R-squared: 0.6675, Adjusted R-squared: 0.6631 ## F-statistic: 152.2 on 6 and 455 DF, p-value: < 2.2e-16`

Drop the incumbency variable – that you hopefully included in your answer to 2.1! – and repeat steps 2.1–2.2. Compute a new RMSE and compare this to the previous one. Which is a better predictor?

`reg2_3 <- lm(votes1st ~ spend_total + senator + gender + electorate, data=dail2002, na.action=na.exclude) computeRMSE(reg2_3)`

`## [1] 2061.288`

`summary(reg2_3)$r.squared`

`## [1] 0.5615657`

**The model with incumbency is a better predictor, with a lower RMSE and a higher \(R^2\).**

**Logistic regression for prediction**.Fit a logistic regression (hint: use

`glm()`

) to predict the outcome variable`wonseat`

. Use any specification that you think provides a good prediction.`reg3_1 <- glm(wonseat ~ incumb:spend_total + senator + gender + electorate, data=dail2002, family=binomial, na.action=na.exclude)`

For the full sample, compute:

- a table of actual
`wonseat`

by predicted`wonseat`

`wonseat_predicted <- ifelse(predict(reg3_1) > 0, 1, 0) (predtable <- table(wonseat_predicted, dail2002$wonseat))`

`## ## wonseat_predicted 0 1 ## 0 267 65 ## 1 30 100`

- percent correctly predicted

`sum(diag(prop.table(predtable))) * 100`

`## [1] 79.43723`

- precision
- recall

`# define a general function to compute precision & recall precrecall <- function(mytable, verbose=TRUE) { truePositives <- mytable[1,1] falsePositives <- sum(mytable[1,]) - truePositives falseNegatives <- sum(mytable[,1]) - truePositives precision <- truePositives / (truePositives + falsePositives) recall <- truePositives / (truePositives + falseNegatives) if (verbose) { print(mytable) cat("\n precision =", round(precision, 2), "\n recall =", round(recall, 2), "\n") } invisible(c(precision, recall)) } # compute precision of wonseat (not lost seat!) precrecall(predtable[2:1, 2:1])`

`## ## wonseat_predicted 1 0 ## 1 100 30 ## 0 65 267 ## ## precision = 0.77 ## recall = 0.61`

- a table of actual
Comparing two models.

- Compute an 8-fold validation, where for 8 different training sets consisting of 7/8 of the observations, you predict the other held-out 1/8 and compare the actual to predicted for the 1/8 test set. Compute an average F1 score for the 8 models.

`# here it helps to know that n=464 and 464 %% 8 == 0 foldIndex <- rep(1:8, each = nrow(dail2002)/8) # initialize vector to store F1 scores f1result <- c() # loop over folds for (i in 1:8) { glmresult <- glm(wonseat ~ incumb:spend_total + senator + gender + electorate, na.action = na.exclude, family = "binomial", data = subset(dail2002, foldIndex != i)) wonseatPred <- ifelse(predict(glmresult, newdata = subset(dail2002, foldIndex == i)) > 0, 1, 0) predtable <- table(wonseatPred, subset(dail2002, foldIndex == i)$wonseat) prrec <- precrecall(predtable[2:1, 2:1], verbose=FALSE) f1result[i] <- 2 * prod(prrec) / sum(prrec) } f1result`

`## [1] 0.7619048 0.6315789 0.7222222 0.5945946 0.7567568 0.6470588 0.6486486 ## [8] 0.6666667`

`mean(f1result)`

`## [1] 0.6786789`

- Now drop a variable or two, and repeat the previous step to compare the average F1 score for this model.

`f1result <- c() # loop over folds for (i in 1:8) { glmresult <- glm(wonseat ~ spend_total + senator + gender + electorate, na.action = na.exclude, family = "binomial", data = subset(dail2002, foldIndex != i)) wonseatPred <- ifelse(predict(glmresult, newdata = subset(dail2002, foldIndex == i)) > 0, 1, 0) predtable <- table(wonseatPred, subset(dail2002, foldIndex == i)$wonseat) prrec <- precrecall(predtable[2:1, 2:1], verbose=FALSE) f1result[i] <- 2 * prod(prrec) / sum(prrec) } f1result`

`## [1] 0.5500000 0.6363636 0.6875000 0.7234043 0.7179487 0.6500000 0.5000000 ## [8] 0.7317073`

`mean(f1result)`

`## [1] 0.6496155`

- Why is it valuable to use the different folds here, rather than simply comparing the F1 score for the predicted outcome of the entire sample, fit to the entire sample?
**Because fitting on the whole sample can lead to overfitting. For calibrating a predictive model we need to test it out of sample.**

**kNN prediction.**Fit a \(k=1\) kNN model to predict the same specification as in your logistic regression. Compare the percent correctly predicted. Which model worked better? (Note: You can use the whole sample here, so that this will compare the results to those from 3.1).

`# get the variables we want dail2002sub <- subset(dail2002, select = c(spend_total, incumb, gender, electorate)) # convert the factor to numeric 0 or 1 dail2002sub$gender <- unclass(dail2002sub$gender) - 1 # figure out which are complete completeCasesIndex <- complete.cases(dail2002sub) # set the training and test datasets to be identical train <- test <- dail2002sub[completeCasesIndex, ] require(class)`

`## Loading required package: class`

`# make the true class a factor wonseatf <- factor(dail2002$wonseat[completeCasesIndex], labels = c("No", "Yes")) knnPred <- knn(train, test, wonseatf, k=1) precrecall(table(knnPred, wonseatf)[2:1, 2:1])`

`## wonseatf ## knnPred Yes No ## Yes 165 0 ## No 0 297 ## ## precision = 1 ## recall = 1`

Experiment with two more settings of \(k\) to see how this affects prediction, reporting the percent correctly predicted.

`# k = 2 precrecall(table(knnPred <- knn(train, test, wonseatf, k=2), wonseatf)[2:1, 2:1])`

`## wonseatf ## Yes No ## Yes 130 37 ## No 35 260 ## ## precision = 0.78 ## recall = 0.79`

`# k = 3 precrecall(table(knnPred <- knn(train, test, wonseatf, k=2), wonseatf)[2:1, 2:1])`

`## wonseatf ## Yes No ## Yes 127 31 ## No 38 266 ## ## precision = 0.8 ## recall = 0.77`

`# k = 4 precrecall(table(knnPred <- knn(train, test, wonseatf, k=2), wonseatf)[2:1, 2:1])`

`## wonseatf ## Yes No ## Yes 133 34 ## No 32 263 ## ## precision = 0.8 ## recall = 0.81`

`# k = 4 precrecall(table(knnPred <- knn(train, test, wonseatf, k=2), wonseatf)[2:1, 2:1])`

`## wonseatf ## Yes No ## Yes 125 42 ## No 40 255 ## ## precision = 0.75 ## recall = 0.76`

We can also plot precision by recall to get an ROC curve-type plot, to evaluate performance at each level of

`k`

. Note that this will not produce “curves” because we are not fitting to different folds or datasets, just once to recover the predictions from the whole training set.`maxk <- 10 prresults <- data.frame(precision = NA, recall = NA) for (k in 1:maxk) { prresults[k, ] <- precrecall(table(knnPred <- knn(train, test, wonseatf, k=k), wonseatf)[2:1, 2:1], verbose=FALSE) } with(prresults, plot(recall, precision, type="n", xlim=c(.5, 1), ylim=c(.5, 1))) with(prresults, text(recall, precision, paste0("k=", 1:maxk)))`