Data Mining and Statistical Learning
2015, Trinity College, Dublin, Department of Political Science
Instructor: Prof Kenneth Benoit, LSE
Details: Class meets MONDAYS in FebMarch from 14:00 – 16:30, with one exception on Day 2 (see below)
Rooms: See specific dates below.
Note: As the class proceeds, I will add resources (slides, R code, text datasets, problem sets) to each session below.
Main Texts:

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Springer Science & Business Media.

Lantz, Brett. 2013. Machine Learning with R. Packt Publishing Ltd.

Zumel, N., & Mount, J. 2014. Practical data science with R. Shelter Island, NY: Manning.
Detailed Schedule
Day 1 Working with data and data structures
(Mon 9 Feb, 14:0016:30, Room 201 Pheonix House)
 Datasets, databases, data formats, transforming and organizing data. Review of R data structures, SQL and alternatives.
 Required readings: Lantz Ch. 2; Zumel and Mount Ch. 2;
 Recommended readings:
 Exercise 1, due Wed Mar 25. Answer key here.
Day 2 Rethinking regression as a predictive tool (Wed 25 Mar, 10:0012:30, Arts Block 3025)

Revisiting prediction for the classical regression model, including logistic regression. Prediction v. association and causation.
 Required Readings
 James et al, Chs 34
 Lantz, Ch. 6
 Recommended readings:
 Conway, Drew, and John White. 2012. Machine Learning for Hackers. O’Reilly. Chapter 5, “Regression: Predicting Page Views”.
 Zumel and Mount, Ch. 7
 Exercises: None, due to the short week, prediction methods will be rolled into the exercise for week 3.
Day 3 Introduction to machine learning (Mon 2 Mar, 14:0016:30, 206 Pheonix House)

Naive Bayes classifier, kNearest Neighbour, Support Vector Machines.
 Readings:
 James et al, Ch 4
 Lantz, Chs 34
 Recommended Readings:
 Manning, Raghavan and Schütze (2008, Ch. 13)
 Evans et al. (2007)
 Statsoft, “Naive Bayes Classifier Introductory Overview,” http://www.statsoft.com/textbook/ naivebayesclassifier/.
 Exercise 2: Machine Learning and Prediction. Answer set here.
Day 4 Shrinkage methods (Mon 9 Mar, 14:0016:30, 206 Pheonix House)

Ridge regression, the Lasso
 Readings:
 James et al, Ch 6
 Recommended readings:
 Conway, Drew, and John White. 2012. Machine Learning for Hackers. O’Reilly. Chapter 6, “Regularization: Text Regression”.
 Exercise 3:
 Using the dail2002.dta dataset, select a random subset of 80% of the candidates, and then stepwise methods to discover the or a model that maximizes the variation explained in this training dataset. Then predict the fit to the 20% that you left out, and report the RMSE.
 Following the worked examples from James et al Ch. 6, do Problem 9 from p263 using the College dataset. You can get this from the “ISLR” package.
Day 5 Unsupervised learning (Mon 16 Mar, 14:0016:30,Aras an Phiarsigh Room 2.04)

Principal components, clustering methods.

review the last part of James et al, Ch 6 on principal components regression

James et al Ch 10

Bond, Robert, and Solomon Messing. 2015. “Quantifying Social Media’s Political Space: Estimating Ideology From Publicly Revealed Preferences on Facebook.” American Political Science Review 109(01): 62–78.

Weller, Susan C, and A Kimball Romney. 1990. Metric Scaling: Correspondence Analysis. Sage.
Day 6 Working with text (Mon 23 Mar, 14:0016:30, Room 201 Pheonix House)

Applications to textual data, with a focus on social media text mining from Twitter.
 Readings:
 Grimmer, J, and B M Stewart. 2013. “Text as Data: the Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” 21(3): 267–97.
 Benoit, Kenneth and Alexander Herzog. In press. “Text Analysis: Estimating Policy Preferences From Written and Spoken Words.” In Analytics, Policy and Governance, eds. Jennifer Bachner, Kathyrn Wagner Hill, and Benjamin Ginsberg.
 Exercise 4: Working with Textual Data.
Leave a Comment