Working with dictionaries

This exercise covers the material from Day 3, for working with dictionaries using quanteda.

  1. Getting used to dictionaries

    1. Creating a simple dictionary.

      Dictionaries are named lists, consisting of a “key” and a set of entries defining the equivalence class for the given key. To create a simple dictionary of parts of speech, for instance we could define a dictionary consisting of articles and conjunctions, using the dictionary() constructor

      require(quanteda, warn.conflicts = FALSE, quietly = TRUE)
      posDict <- dictionary(list(articles = c("the", "a", "an"),
                                 conjunctions = c("and", "but", "or", "nor", "for", "yet", "so")))

      You can examine this dictionary by invoking its print method, simply by typing the name of the object and pressing Enter. Try that now.

      What is the structure of this object? (Hint: use str().)

      To let this define a set of features, we can use this dictionary when we create a dfm, for instance:

      posDfm <- dfm(inaugCorpus, dictionary = posDict)
      ## Creating a dfm from a corpus ...
      ##    ... lowercasing
      ##    ... tokenizing
      ##    ... indexing documents: 57 documents
      ##    ... indexing features: 9,215 feature types
      ##    ... applying a dictionary consisting of 2 keys
      ##    ... created a 57 x 2 sparse dfm
      ##    ... complete. 
      ## Elapsed time: 0.295 seconds.
      head(posDfm)
      ## Document-feature matrix of: 57 documents, 2 features.
      ## (showing first 6 documents and first 2 features)
      ##                  features
      ## docs              articles conjunctions
      ##   1789-Washington      140           73
      ##   1793-Washington       14            4
      ##   1797-Adams           232          192
      ##   1801-Jefferson       154          109
      ##   1805-Jefferson       168          126
      ##   1809-Madison         128           63

      Weight the posDfm by relative term frequency, and plot the values of articles and conjunctions (actually, here just the coordinating conjunctions) across the speeches. (Hint: you can use docvars(inaugCorpus, "Year")) for the x-axis.)

      Is the distribution of normalized articles and conjunctions relatively constant across years, as you would expect?

    2. Hierarchical dictionaries.

      Dictionaries may also be hierarchical, where a top-level key can consist of subordinate keys, each a list of its own. For instance, list(articles = list(definite="the", indefinite=c("a", "and")) defines a valid list for articles. Make a dictionary of articles and conjunctions where you define two levels, one for definite and indefinite articles, and one for coordinating and subordinating conjunctions. (A sufficient list for your purposes of subordinating conjuctions is “although”, “because”, “since”, “unless”.)

      Now apply this to the inaugCorpus object, and examine the resulting features. What happened to the hierarchies, to make them into “features”? Do the subcategories sum to the two categories from the previous question?

  2. Getting used to thesauruses

    A “thesaurus” is a list of feature equivalencies specified in the same list format as a dictionary, but which—unlike a dictionary—returns all the features not specified as entries in the thesaurus.

    If we wanted to count pronouns as equivalent, for instance, we could use the thesaurus argument to dfm in order to group all listed prounouns into a single feature labelled “PRONOUN”.

    mytexts <- c("We are not schizophrenic, but I am.", "I bought myself a new car.")
    myThes <- dictionary(list(pronouns = list(firstp=c("I", "me", "my", "mine", "myself", "we", "us", "our", "ours"))))
    myDfm <- dfm(mytexts, thesaurus = myThes)
    ## 
    ##    ... lowercasing
    ##    ... tokenizing
    ##    ... indexing documents: 2 documents
    ##    ... indexing features: 12 feature types
    ##    ... applying a dictionary consisting of 1 key
    ##    ... created a 2 x 10 sparse dfm
    ##    ... complete. 
    ## Elapsed time: 0.016 seconds.
    myDfm
    ## Document-feature matrix of: 2 documents, 10 features.
    ## 2 x 10 sparse Matrix of class "dfmSparse"
    ##       are not schizophrenic but am bought a new car PRONOUNS.FIRSTP
    ## text1   1   1             1   1  1      0 0   0   0               2
    ## text2   0   0             0   0  0      1 1   1   1               2

    Notice how the thesaurus key has been made into uppercase — this is to identify it as a key, as opposed to a word feature from the original text.

    Try running the articles and conjunctions dictionary from the previous exercise on as a thesaurus, and compare the results.

  3. More than one way to skin a cat.

    When you call dfm() with a dictionary = or thesaurus = argument, then what dfm() does internally is actually to first constructing the entire dfm, and then select features using a call to applyDictionary().

    Try creating a dfm object using the first five inaugural speeches, with no dictionary applied. Then apply the posDict from the first question to select features a) in a way that replicates the dictionary argument to dfm(), and b) in a way that replicates the thesaurus argument to dfm().

  4. Populism dictionary.

    Here we will create and implement the populism dictionary from Rooduijn, Matthijs, and Teun Pauwels. 2011. “Measuring Populism: Comparing Two Methods of Content Analysis.” West European Politics 34(6): 1272–83. Appendix B of that paper provides the term entries for a dictionary key for the concept populism. Implement this as a dictionary, and apply it to the same UK manifestos as in the article.

    Hint: You can get a corpus of the UK manifestos from their article using the following:

    data(ukManifestos, package = "quantedaData")
    ukPopCorpus <- subset(ukManifestos, (Year %in% c(1992, 2001, 2005) & 
                                        (Party %in% c("Lab", "LD", "Con", "BNP", "UKIP"))))
    summary(ukPopCorpus)
    ## Corpus consisting of 11 documents.
    ## 
    ##                  Text Types Tokens Sentences Country Type Year Language
    ##   UK_natl_1992_en_Con  4598  33506      1811      UK natl 1992       en
    ##   UK_natl_1992_en_Lab  2338  12777      1751      UK natl 1992       en
    ##    UK_natl_1992_en_LD  3055  19894      2586      UK natl 1992       en
    ##   UK_natl_2001_en_Con  2821  14693       931      UK natl 2001       en
    ##   UK_natl_2001_en_Lab  4097  33024      4994      UK natl 2001       en
    ##    UK_natl_2001_en_LD  3867  24144      1586      UK natl 2001       en
    ##   UK_natl_2005_en_BNP  4831  28788      1266      UK natl 2005       en
    ##   UK_natl_2005_en_Con  2089   8615       487      UK natl 2005       en
    ##   UK_natl_2005_en_Lab  4025  27346      1355      UK natl 2005       en
    ##    UK_natl_2005_en_LD  3240  18316      1857      UK natl 2005       en
    ##  UK_natl_2005_en_UKIP  2467  10138       509      UK natl 2005       en
    ##  Party
    ##    Con
    ##    Lab
    ##     LD
    ##    Con
    ##    Lab
    ##     LD
    ##    BNP
    ##    Con
    ##    Lab
    ##     LD
    ##   UKIP
    ## 
    ## Source:  /Users/kbenoit/Dropbox/QUANTESS/quantedaData_kenlocal_gh/* on x86_64 by kbenoit
    ## Created: Sat Nov 15 18:43:36 2014
    ## Notes:

    Create a dfm of the populism dictionary on the UK manifestos. Use this dfm to reproduce the x-axis for the UK-based parties from Figure 1 in the article. Suggestion: Use dotchart(). You will need to normalize the values first by term frequency within document. Hint: Use tf(youDfmName, "prop") on the dfm.

    You can explore some of these terms within the corpus to see whether you think they are appropriate measures of populism. How can you search the corpus for the regular expression politici* as a “keyword in context”?

  5. Laver and Garry (2000) ideology dictionary.

    Here, we will apply the dictionary of Laver, Michael, and John Garry. 2000. “Estimating Policy Positions From Political Texts.” American Journal of Political Science 44(3): 619–34. Using the pre-built Laver and Garry (2000) dictionary file, which is distributed by Provalis Research for use with its Wordstat software package from Provalis, we will apply this to the same manifestos from the UK manifesto set.
    To do this, you will need to:

    • download and save the Wordstat-formatted dictionary file LaverGarry.cat;

    • load this into a dictionary list using dictionary(file = "LaverGarry.cat", format = "wordstat");

    • build a dfm for the corpus subset for the Labour, Liberal Democrat, and Conservative Party manifestos from 1992 and 1997; and

    • try to replicate their measures from the “Computer” column of Table 2, for Economic Policy. (Not as easy as you thought—any ideas as to why?)

  6. Fun with the Regressive Imagery Dictionary.

    Try analyzing the inaugural speeches from 1980 onward using the Regressive Imagery Dictionary, from Martindale, C. (1975) Romantic progression: The psychology of literary history. Washington, D.C.: Hemisphere. You can download the dictionary from http://www.provalisresearch.com/Download/RID.ZIP, formatted for WordStat. Compare the Presidents based on the level of “Icarian Imagery.” Which president is the most Icarian?