# Data Mining and Statistical Learning

2015, Trinity College, Dublin, Department of Political Science

**Instructor:**
Prof Kenneth Benoit, LSE

**Details:** Class meets MONDAYS in Feb-March from 14:00 – 16:30, with one exception on Day 2 (see below)

**Rooms:** See specific dates below.

**Note:** As the class proceeds, I will add resources (slides, R code, text datasets, problem sets) to each session below.

**Main Texts:**

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013.

*An Introduction to Statistical Learning*. Springer Science & Business Media.Lantz, Brett. 2013.

*Machine Learning with R*. Packt Publishing Ltd.Zumel, N., & Mount, J. 2014.

*Practical data science with R*. Shelter Island, NY: Manning.

**Detailed Schedule**

**Day 1
Working with data and data structures**

(Mon 9 Feb, 14:00-16:30, Room 201 Pheonix House)

- Datasets, databases, data formats, transforming and organizing data. Review of R data structures, SQL and alternatives.
- Required readings: Lantz Ch. 2; Zumel and Mount Ch. 2;
- Recommended readings:
- Introduction to SQL
- Introduction to reshape2
- [A non-programmer’s introduction to JSON](A%20Non-Programmer’s Introduction to JSON)

- Exercise 1, due Wed Mar 25. Answer key here.

**Day 2
Rethinking regression as a predictive tool**
(Wed 25 Mar, 10:00-12:30, Arts Block 3025)

Revisiting prediction for the classical regression model, including logistic regression. Prediction v. association and causation.

Required Readings

- James et al, Chs 3-4
- Lantz, Ch. 6
- Recommended readings:
- Conway, Drew, and John White. 2012.
*Machine Learning for Hackers*. O’Reilly. Chapter 5, “Regression: Predicting Page Views”. - Zumel and Mount, Ch. 7

Exercises: None, due to the short week, prediction methods will be rolled into the exercise for week 3.

**Day 3
Introduction to machine learning**
(Mon 2 Mar, 14:00-16:30, 206 Pheonix House)

Naive Bayes classifier, k-Nearest Neighbour, Support Vector Machines.

Readings:

- James et al, Ch 4
- Lantz, Chs 3-4
- Recommended Readings:
- Manning, Raghavan and Schütze (2008, Ch. 13)
- Evans et al. (2007)
- Statsoft, “Naive Bayes Classifier Introductory Overview,” http://www.statsoft.com/textbook/ naive-bayes-classifier/.

Exercise 2: Machine Learning and Prediction. Answer set here.

**Day 4
Shrinkage methods**
(Mon 9 Mar, 14:00-16:30, 206 Pheonix House)

Ridge regression, the Lasso

Readings:

- James et al, Ch 6
- Recommended readings:
- Conway, Drew, and John White. 2012.
*Machine Learning for Hackers*. O’Reilly. Chapter 6, “Regularization: Text Regression”.

Exercise 3:

- Using the dail2002.dta dataset, select a random subset of 80% of the candidates, and then stepwise methods to discover the or a model that maximizes the variation explained in this training dataset. Then predict the fit to the 20% that you left out, and report the RMSE.
- Following the worked examples from James et al Ch. 6, do Problem 9 from p263 using the College dataset. You can get this from the “ISLR” package.

**Day 5
Unsupervised learning**
(Mon 16 Mar, 14:00-16:30,Aras an Phiarsigh Room 2.04)

Principal components, clustering methods.

`review the last part of James et al, Ch 6 on principal components regression`

`James et al Ch 10`

`Bond, Robert, and Solomon Messing. 2015. “Quantifying Social Media’s Political Space: Estimating Ideology From Publicly Revealed Preferences on Facebook.” _American Political Science Review_ 109(01): 62–78.`

`Weller, Susan C, and A Kimball Romney. 1990. _Metric Scaling: Correspondence Analysis_. Sage.`

**Day 6
Working with text**
(Mon 23 Mar, 14:00-16:30, Room 201 Pheonix House)

Applications to textual data, with a focus on social media text mining from Twitter.

Readings:

- Grimmer, J, and B M Stewart. 2013. “Text as Data: the Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” 21(3): 267–97.
- Benoit, Kenneth and Alexander Herzog. In press. “
Text Analysis: Estimating Policy Preferences From Written and Spoken Words.” In
*Analytics, Policy and Governance*, eds. Jennifer Bachner, Kathyrn Wagner Hill, and Benjamin Ginsberg.