This exercise covers using regression as a predictive tool, both for linear and logisitic regression, as well as simple machine learning using the Naive Bayes and kNN algorithms. You will use the same dataset as from Exercise 1, the dail2002.dta
data.
Please submit this exercise by email prior to the start of class Monday, March 9.
Partitioning the dataset into “folds”. For this exercise, we will be fitting a model from a subset of the data, and using the fitted model to predict the outcomes from the “left out” set and using that to evaluate RMSE or accuracy.
To start, familiarize yourself with the sample()
command to draw a random sample of one-fifth of the observations in the dail2002.dta
dataset. Use a .Random.seed
and read about this function to see what it does. Why should you use this?
For the categorical predictions, we will also assess predictive ability based on “leave-one-out” testing and using “folds”, which are groups of observations that have been left out of the training set, and then attempting to predict them with accuracy. You will need to use indexing to partition the dataset to carry this out. You might want to write a loop for this. For instance, to partition a group of 20 observations into four sets, you could use the following for
loop:
n <- 20
k <- 4
data <- data.frame(myIndex = 1:n, letter = LETTERS[1:n])
size <- n/4
if (n %% 4)
stop("n not divisible by k")
for (i in 1:k) {
startIndex <- 1 + (i-1)*size
endIndex <- startIndex + size - 1
cat(startIndex, endIndex, "\n")
print(data[startIndex : endIndex, ])
}
## 1 5
## myIndex letter
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
## 6 10
## myIndex letter
## 6 6 F
## 7 7 G
## 8 8 H
## 9 9 I
## 10 10 J
## 11 15
## myIndex letter
## 11 11 K
## 12 12 L
## 13 13 M
## 14 14 N
## 15 15 O
## 16 20
## myIndex letter
## 16 16 P
## 17 17 Q
## 18 18 R
## 19 19 S
## 20 20 T
What is the purpose of the line `if (n %% 4)`?
OLS regression for prediction.
Fit a regression from the dataset to predict votes1st
. You may use any combination of regressors that you wish. Save the model object to reg2_1
.
Predict the votes1st
from the same sample to which you fitted the regression. What is the Root Mean Squared Error (RMSE) and how would you interpret this?
Drop the incumbency variable – that you hopefully included in your answer to 2.1! – and repeat steps 2.1–2.2. Compute a new RMSE and compare this to the previous one. Which is a better predictor?
Logistic regression for prediction.
Fit a logistic regression (hint: use glm()
) to predict the outcome variable wonseat
. Use any specification that you think provides a good prediction.
For the full sample, compute:
wonseat
by predicted wonseat
Comparing two models.
Compute an 8-fold validation, where for 8 different training sets consisting of 7/8 of the observations, you predict the other held-out 1/8 and compare the actual to predicted for the 1/8 test set. Compute an average F1 score for the 8 models.
Now drop a variable or two, and repeat the previous step to compare the average F1 score for this model.
Why is it valuable to use the different folds here, rather than simply comparing the F1 score for the predicted outcome of the entire sample, fit to the entire sample?
kNN prediction.
Fit a \(k=1\) kNN model to predict the same specification as in your logistic regression. Compare the percent correctly predicted. Which model worked better? (Note: You can use the whole sample here, so that this will compare the results to those from 3.1).
Experiment with two more settings of \(k\) to see how this affects prediction, reporting the percent correctly predicted.