---
title: "Exercise 3"
author: "Ken Benoit"
date: "11 March 2016"
output: html_document
---


### Working with dictionaries

This exercise covers the material from Day 3, for working with dictionaries using [**quanteda**](http://github.com/kbenoit/quanteda). 

1.  **Getting used to dictionaries**

    a.  Creating a simple dictionary.
    
        Dictionaries are named lists, consisting of a "key" and a set of entries defining
        the equivalence class for the given key.  To create a simple dictionary of parts of
        speech, for instance we could define a dictionary consisting of articles and conjunctions,
        using the `dictionary()` constructor
        ```{r}
        require(quanteda, warn.conflicts = FALSE, quietly = TRUE)
        posDict <- dictionary(list(articles = c("the", "a", "an"),
                                   conjunctions = c("and", "but", "or", "nor", "for", "yet", "so")))
        ```
        
        You can examine this dictionary by invoking its `print` method, simply by typing the name of the object 
        and pressing Enter.  Try that now.
        
        What is the structure of this object?  (Hint: use `str()`.)
        
        To let this define a set of features, we can use this dictionary when we create a `dfm`, 
        for instance:
        ```{r}
        posDfm <- dfm(inaugCorpus, dictionary = posDict)
        head(posDfm)
        ```
        
        Weight the `posDfm` by relative term frequency, and plot the values of articles
        and conjunctions (actually, here just the coordinating conjunctions) across the speeches.
        (**Hint:** you can use `docvars(inaugCorpus, "Year"))` for the *x*-axis.)
        
        Is the distribution of normalized articles and conjunctions relatively constant across
        years, as you would expect?
        
    b.  Hierarchical dictionaries.
    
        Dictionaries may also be hierarchical, where a top-level key can consist of subordinate keys, 
        each a list of its own.  For instance, 
        `list(articles = list(definite="the", indefinite=c("a", "and"))` defines a valid
        list for articles.  Make a dictionary of articles and conjunctions where you define two
        levels, one for definite and indefinite articles, and one for coordinating and 
        subordinating conjunctions.  (A sufficient list for your purposes of  subordinating 
        conjuctions is "although", "because", "since", "unless".)
            
        Now apply this to the `inaugCorpus` object, and examine the resulting features.  What happened to the hierarchies, to make them
        into "features"?  Do the subcategories sum to the two categories from the previous question?
        
2.  **Getting used to thesauruses**

    A "thesaurus" is a list of feature equivalencies specified in the same list format as a dictionary, 
    but which---unlike a dictionary---returns all the features *not* specified as entries in the
    thesaurus.  
    
    If we wanted to count pronouns as equivalent, for instance, we could use the thesaurus argument
    to `dfm` in order to group all listed prounouns into a single feature labelled "PRONOUN".
    ```{r}
    mytexts <- c("We are not schizophrenic, but I am.", "I bought myself a new car.")
    myThes <- dictionary(list(pronouns = list(firstp=c("I", "me", "my", "mine", "myself", "we", "us", "our", "ours"))))
    myDfm <- dfm(mytexts, thesaurus = myThes)
    myDfm
    ```
    Notice how the thesaurus key has been made into uppercase --- this is to identify it as 
    a key, as opposed to a word feature from the original text.
    
    Try running the articles and conjunctions dictionary from the previous exercise on 
    as a thesaurus, and compare the results.
        
3.  **More than one way to skin a cat.**

    When you call `dfm()` with a `dictionary = ` or `thesaurus = ` argument, then what `dfm()` does internally is actually
    to first constructing the entire dfm, and then select features using a call to `applyDictionary()`.
    
    Try creating a dfm object using the first five inaugural speeches, with no dictionary applied.  Then apply the `posDict` 
    from the first question to select features a) in a way that replicates the `dictionary` argument to `dfm()`, 
    and b) in a way that replicates the `thesaurus` argument to `dfm()`. 


4.  **Populism dictionary.**
    
    Here we will create and implement the populism dictionary from Rooduijn, Matthijs, and Teun 
    Pauwels. 2011. "Measuring Populism: Comparing Two Methods of Content Analysis." *West European
    Politics* 34(6): 1272–83.  Appendix B of that paper provides the term entries for a dictionary
    key for the concept *populism*.  Implement this as a dictionary, and apply it to the same
    UK manifestos as in the article.
    
    Hint: You can get a corpus of the UK manifestos from their article using the following:
    ```{r}
    data(ukManifestos, package = "quantedaData")
    ukPopCorpus <- subset(ukManifestos, (Year %in% c(1992, 2001, 2005) & 
                                        (Party %in% c("Lab", "LD", "Con", "BNP", "UKIP"))))
    summary(ukPopCorpus)
    ```
    
    Create a dfm of the populism dictionary on the UK manifestos.  Use this dfm to reproduce
    the *x*-axis for the UK-based parties from Figure 1 in the article.  Suggestion: Use 
    `dotchart()`.  You will need to normalize the values first by term frequency within 
    document.  Hint: Use `tf(youDfmName, "prop")` on the dfm.
    
    You can explore some of these terms within the corpus to see whether you think they are 
    appropriate measures of populism.  How can you search the corpus for the regular expression
    `politici*` as a "keyword in context"?
        
2.  **Laver and Garry (2000) ideology dictionary.**
    
    Here, we will apply the dictionary of Laver, Michael, and John Garry. 2000. "Estimating Policy
    Positions From Political Texts." *American Journal of Political Science* 44(3): 619–34. 
    Using the pre-built Laver and Garry (2000) dictionary file,
    which is distributed by Provalis Research for use with its 
    [Wordstat software package from Provalis](http://provalisresearch.com/products/content-analysis-software/wordstat-dictionary/laver-garry-dictionary-of-policy-position/), 
    we will apply this to the same manifestos from the UK manifesto set.  
    To do this, you will need to:
    
    *   download and save the Wordstat-formatted dictionary file
        [LaverGarry.cat](/assets/courses/essex2014qta/LaverGarry.cat);
        
    *   load this into a dictionary list using `dictionary(file = "LaverGarry.cat", format = "wordstat")`;
    
    *   build a dfm for the corpus subset for the Labour, Liberal Democrat, and Conservative
        Party manifestos from 1992 and 1997; and
        
    *   try to replicate their measures from the "Computer" column of Table 2, for Economic 
        Policy.  (Not as easy as you thought---any ideas as to why?)
        
3.  **Fun with the Regressive Imagery Dictionary.**

    Try analyzing the inaugural speeches from 1980 onward using the Regressive Imagery Dictionary, 
    from Martindale, C. (1975)
    *Romantic progression: The psychology of literary history.* Washington, D.C.: Hemisphere.
    You can download the dictionary from http://www.provalisresearch.com/Download/RID.ZIP, 
    formatted for WordStat.  Compare the Presidents based on the level of "Icarian Imagery."
    Which president is the most Icarian?