The Quantitative Analysis of Textual Data (NYU Fall 2014)
Sponsored by: NYU Department of Politics 2014
Instructor: Prof Kenneth Benoit, LSE
Details: Class meets TUESDAYS 10:00 – 11:50 in Room 217
Note: As the class proceeds, I will add resources (slides, R code, text datasets, problem sets) to each session below.
Day 0 (16 Sept): Course overview and introduction to the quanteda R package
Day 1 (23 Sept) : Quantitative text analysis fundamentals
- examples code (revised 26 September)
- Exercise 1
Day 2 (30 Sept): Descriptive Statistical Methods for Texts
- examples code, plus text data for Russian texts, Spanish tweets, and the Excel example.
- Exercise 2 (on the way)
- Additional recommended readings about lexical diversity: (see Dropbox link for the file, or get it from the link here if you have access):
- Labbé, Cyril, Dominique Labbé, and Pierre Hubert. 2004. “ Automatic Segmentation of Texts and Corpora.” Journal of Quantitative Linguistics 11(3): 193–213.I wanted to recreate their Figure 8 and perform some tests, so I located the corpus files from the Oxford Text Archive at http://www.ota.ox.ac.uk/desc/2466 (and kudos to Labbé et al for making these available, and with such good documentation). I wrote make_deGaulle.R, which uses files from the corpus as noted, plus this dataset I created from their notes: deGaulleData.csv. If you are looking for a tutorial on how to construct a corpus, this is a good example, and I’ve extensively commented the code.
- Additional recommended readings about collocations:
- Manning, Christopher D, and Hinrich Schütze. 2000. Foundations of Statistical Natural Language Processing. Cambridge, Mass: MIT Press. Ch. 5, “Collocations”.
- Bautin, Mikhail, and Michael Hart. “ Significant Phrases Detection.”
- Pecina, Pavel. 2005. “ An Extensive Empirical Study of Collocation Extraction Methods.” In Association for Computational Linguistics.
Day 3 (7 Oct): Quantitative methods for comparing texts
Day 4 (21 Oct): Dictionary Methods
a useful regular expressions “cheat sheet”
Additional recommended readings about dictionaries:
- Graham, Jesse, Jonathan Haidt, and Brian A Nosek. 2009. “ Liberals and Conservatives Rely on Different Sets of Moral Foundations.” Journal of Personality and Social Psychology 96(5): 1029–46.
- The LIWC-formatted dictionary for this article is available here.
**Day 5 (28 Oct): Classifiers and supervised scaling **
- examples code
- Exercise pending
- Additional recommended readings about supervised scaling methods:
- Lowe, Will, There’s (Basically) Only One Way to Do it (August 30, 2013). Available at SSRN: http://ssrn.com/abstract=2318543 or http://dx.doi.org/10.2139/ssrn.2318543
Day 6 (4 Nov): Unsupervised scaling models for text
Day 7 (18 Nov): Clustering and topic models
Day 8 (2 Dec): Mining social media
- demonstration of text cleaning and Crowd-sourcing (Ken Benoit)
- Database structures (Jonathan Ronen)
- SQL notes (Jonathan Ronen)
- SQL dump of some tweets (Jonathan Ronen)
- Social media (Pablo Barberá)
- code: Social media (Pablo Barberá)