Exercise 4: Supervised and Unsupervised Scaling for Text

Note: There is a bug in the CRAN version (quanteda 0.9.4) of the wordfish textmodel() code, that causes a warning message to be printed even though it has no effect. You are encouraged (as always!) to install the GitHub version where this is now fixed.

Reproduce the Laver, Benoit and Garry (2000) scores of the UK manifestos, using Wordscores. You can load and select them in the same way as the code used to make the populism dictionary.
1. [12 points] Score the reference text words using textmodel() to implement Wordscores, with the reference value set as in Figure 1 of LBG (2003) for the three 1992 texts. Extract the word scores for “drugs”, “secure”, and “poverty”, and compare these to the values in Figure 1 of LBG (2003). Are the scores the same? Hint: Use str() to examine the wordscores fitted object, and address the object inside using this
2. [8] points] Now get the “text scores” for the 1997 texts, using the LBG transformation. Compare the results to the article. Did you get the same values?
Scaling movie reviews. Here we will return to the movie reviews from Exercise 5.
1. [5 points] Load the movies dataset from quantedaData.
2. [5 points] Taking a random sample of 500 of the movie reviews as your “reference” texts, set the ones that are positive to a reference value of +1, and the negative reviews to a value of -1.
3. [10 points] Score the remaining movie reviews, and predict their “positive-negative” rating using Wordscores.
4. [10 points] From the results of c, compare the values using boxplot() for the categories of their rater assigned positivity or negativity. Describe the resulting pattern.
2010 Irish budget corpus scaling. Use the ie2010Corpus in quanteda for this.
1. [6 points] Score all of the texts (including the reference texts) using two reference texts: the first set to +1, and the second set to -1. This involves first fitting the wordscores model, then predicting the text score for all texts (including the refernece texts.) Print the object and show this output. Assign the fitted object to wsFit, and the predicted object to wsPred.
2. [6 points] Fit a wordfish model to the same texts, and call this wfFit. Print the object and show the output.
3. [6 points] Plot the results of the Wordscores “text score” for each text against the wordfish theta-hat for each text. Describe the differences.
4. [6 points] Plot the results of the Wordscores “word score” for each word (stored inside wsFit) and compare the the beta-hat values in wfFit. How different are they?
5. [6 points] Plot the wordfish “Eiffel Tower” plot (as in Figure 2 of Slapin and Proksch 2008), from the objects in wfFit. (3 points extra credit if you can plot the words instead of points.)
6. [5 points] Plot the log of the length in tokens of each text against the alpha-hat from wfFit. You can get the length using ntoken() on the dfm, to ensure that you get the same values as in your fitted textmodel. What does the relationship indicate?
7. Extra credit (5 points): Plot the log of the frequency of the top 1000 words againt the same psi-hat values from wfit, and describe the relationship.
Fit the correspondence analysis model to the ie2010Corpus dfm.

[15 points] Compare the results for the word scaled values (call it caFit) from the first dimension to the beta-hats from wfFit. Hint: you can get the CA word scaled values from the first column of caFit$colcord.
Not graded, but appreciated if you have interest and time:

What extractor functions would you like to have available, in order to make it easier to get the word and document values from fitted and predicted textmodel objects? For a hint, explore some of the methods defined for lm and summary.lm objects, using e.g. methods(class = "lm").

One of the tricky design issues here is that these do not all have the same format. For instance word parameters for a CA model have multiple dimensions, but only one dimension for Wordscores and wordfish models. Some also have standard errors, while others do not.

Exercise 4: Supervised and Unsupervised Scaling for Text

Ken Benoit

18 March 2016