Working with dictionaries

This exercise covers the material from Day 4, for working with dictionaries using quanteda. We will also be using the quanteda package, which contains some additional corpora not in the base quanteda package.

Always remember to install a fresh copy of both packages, and from the dev branch for quanteda:

if (!require(devtools)) install.packages("devtools", dependencies=TRUE)
devtools::install_github("kbenoit/quanteda", dependencies=TRUE, ref="dev")
devtools::install_github("kbenoit/quantedaData")
  1. Getting used to dictionaries

    1. Creating a simple dictionary.

      Dictionaries are named lists, consisting of a “key” and a set of entries defining the equivalence class for the given key. To create a simple dictionary of parts of speech, for instance we could define a dictionary consisting of articles and conjunctions, using:

      posDict <- list(articles = c("the", "a", "and"),
                      conjunctions = c("and", "but", "or", "nor", "for", "yet", "so"))

      To let this define a set of features, we can use this dictionary when we create a dfm, for instance:

      require(quanteda)
      ## Loading required package: quanteda
      ## Loading required package: data.table
      posDfm <- dfm(inaugCorpus, dictionary=posDict)
      ## Creating dfm from a corpus: ... done.
      posDfm[1:10,]
      ##                  features
      ## docs              articles conjunctions Non_Dictionary
      ##   1789-Washington      178           73           1179
      ##   1793-Washington       15            4            116
      ##   1797-Adams           344          192           1782
      ##   1801-Jefferson       232          109           1385
      ##   1805-Jefferson       256          126           1784
      ##   1809-Madison         166           63            946
      ##   1813-Madison         169           63            978
      ##   1817-Monroe          458          174           2738
      ##   1821-Monroe          577          195           3687
      ##   1825-Adams           448          150           2317

      Weight the posDfm by term frequency using tf(), and plot the values of articles and conjunctions (actually, here just the coordinating conjunctions) across the speeches. (Hint: you can use docvars(inaugCorpus, "Year")) for the x-axis.)

      Is the distribution of normalized articles and conjunctions relatively constant across years, as you would expect?

    2. Hierarchical dictionaries.

      Dictionaries may also be hierarchical, where a top-level key can consist of subordinate keys, each a list of its own. For instance, list(articles = list(definite="the", indefinite=c("a", "and")) defines a valid list for articles. Make a dictionary of articles and conjunctions where you define two levels, one for definite and indefinite articles, and one for coordinating and subordinating conjunctions. (A sufficient list for your purposes of subordinating conjuctions is “although”, “because”, “since”, “unless”.)

      Output the results and examine them. What happened to the hierarchies, to make them into “features”?

  2. Getting used to thesauruses

    A “thesaurus” is a list of feature equivalencies specified in the same list format as a dictionary, but which—unlike a dictionary—returns all the features not specified as entries in the thesaurus.

    If we wanted to count pronouns as equivalent, for instance, we could use the thesaurus argument to dfm in order to group all listed prounouns into a single feature labelled “PRONOUN”.

    mytexts <- c("We are not schizophrenic, but I am.", "I bought myself a new car.")
    myThes <- list(pronouns = list(firstp=c("I", "me", "my", "mine", "myself", "we", "us", "our", "ours")))
    myDfm <- dfm(mytexts, thesaurus=myThes)
    ## Creating dfm from character vector ... done.
    myDfm[1:2,]
    ##        features
    ## docs    PRONOUNS.FIRSTP are not schizophrenic but am bought a new car
    ##   text1               1   1   1             0   1  0      0 1   0   2
    ##   text2               2   0   0             1   0  0      1 0   0   2

    Notice how the thesaurus key has been made into uppercase — this is to identify it as a key, as opposed to a word feature from the original text.

    Try running the articles and conjunctions dictionary from the previous exercise on as a thesaurus, and compare the results.

  3. Using regular expressions

    Regular expressions are very important concepts in text processing, as they offer tools for searching and matching text strings based on symbolic representations. For the dictionary and thesaurus features, we can define equivalency classes in terms of regular expressions. There is an excellent tutorial on regular expressions at http://www.regular-expressions.info.

    This provides an easy way to recover syntactic variations on specific words, without relying on a stemmer. For instance, we could construct a tax dictionary as follows:

    econDict <- list(tax=c("tax", "charge"), cuts=c("cut", "^aust", "budget"))
    econTexts <- c("The new budget raises income taxes and introduces a cuteness tax.",
                   "The new budgetary era of austerity is upon us, taxing us with surcharges.",
                   "We live in a new caustic era of taxes and charges.")
    econDfm <- dfm(econTexts, dictionary=econDict)
    ## Creating dfm from character vector ... done.
    econDfm[1:3,]
    ##        features
    ## docs    tax cuts Non_Dictionary
    ##   text1   1    1              9
    ##   text2   0    0             13
    ##   text3   0    0             11

    Observe the results. Why are some words picked up, and others not? Contrast that to the results if dfm() is rerun with the dictionary_regex=TRUE option. Is “caustic” in text3 not picked up under the “cuts” key, and if not, why not? How could the regular expression be modified in order to pick it up?

    Why does “tax” as a regular expression pick up all of the tax words?

    How could we modify the terms in the cuts key to not count cuteness?

  4. Replicating a published dictionary analysis

    1. Populism dictionary.

      Here we will create and implement the populism dictionary from Rooduijn, Matthijs, and Teun Pauwels. 2011. “Measuring Populism: Comparing Two Methods of Content Analysis.” West European Politics 34(6): 1272–83. Appendix B of that paper provides the term entries for a dictionary key for the concept populism. Implement this as a dictionary, and apply it to the same UK manifestos as in the article.

      Hint: You can get a corpus of the UK manifestos from their article using the following:

      require(quantedaData)
      ## Loading required package: quantedaData
      data(ukManifestos)
      ukPopCorpus <- subset(ukManifestos, (Year %in% c(1992, 2001, 2005) & 
                                          (Party %in% c("Lab", "LD", "Con", "BNP", "UKIP"))))
      summary(ukPopCorpus)
      ## Corpus consisting of 11 documents.
      ## 
      ##                  Text Types Tokens Sentences Country Type Year Language
      ##   UK_natl_1992_en_Con  3829  29560      1605      UK natl 1992       en
      ##   UK_natl_1992_en_Lab  2295  11355       623      UK natl 1992       en
      ##    UK_natl_1992_en_LD  2979  17381       939      UK natl 1992       en
      ##   UK_natl_2001_en_Con  2473  13196       721      UK natl 2001       en
      ##   UK_natl_2001_en_Lab  3523  28711      1602      UK natl 2001       en
      ##    UK_natl_2001_en_LD  3263  21177      1232      UK natl 2001       en
      ##   UK_natl_2005_en_BNP  4575  25214      1058      UK natl 2005       en
      ##   UK_natl_2005_en_Con  1842   7687       420      UK natl 2005       en
      ##   UK_natl_2005_en_Lab  3512  23806      1557      UK natl 2005       en
      ##    UK_natl_2005_en_LD  2832  16081       840      UK natl 2005       en
      ##  UK_natl_2005_en_UKIP  2201   8882       425      UK natl 2005       en
      ##  Party
      ##    Con
      ##    Lab
      ##     LD
      ##    Con
      ##    Lab
      ##     LD
      ##    BNP
      ##    Con
      ##    Lab
      ##     LD
      ##   UKIP
      ## 
      ## Source:  /home/paul/Dropbox/code/quantedaData/* on x86_64 by paul.
      ## Created: Tue Sep 16 16:17:33 2014.
      ## Notes:   .

      Create a dfm of the populism dictionary on the UK manifestos. Use this dfm to reproduce the x-axis for the UK-based parties from Figure 1 in the article. Suggestion: Use dotchart(). You will need to normalize the values first by term frequency within document. Hint: Use tf() on the dfm.

      You can explore some of these terms within the corpus to see whether you think they are appropriate measures of populism. How can you search the corpus for the regular expression politici* as a “keyword in context”?

    2. Laver and Garry (2000) ideology dictionary.

      Here, we will apply the dictionary of Laver, Michael, and John Garry. 2000. “Estimating Policy Positions From Political Texts.” American Journal of Political Science 44(3): 619–34. Using the pre-built Laver and Garry (2000) dictionary file, which is distributed by Provalis Research for use with its Wordstat software package from Provalis, we will apply this to the same manifestos from the UK manifesto set.
      To do this, you will need to:

      • download and save the Wordstat-formatted dictionary file LaverGarry.cat;

      • load this into a dictionary list using readWStatDict();

      • build a dfm for the corpus subset for the Labour, Liberal Democrat, and Conservative Party manifestos from 1992 and 1997; and

      • try to replicate their measures from the “Computer” column of Table 2, for Economic Policy. (Not as easy as you thought—any ideas as to why?)

    3. Fun with the Regressive Imagery Dictionary and the LIWC.

      • Try analyzing the inaugural speeches from 1980 onward using the Linguistic Inquiry and Word Count. The LIWC dictionary is available from the file LIWC2001_English.dic available from the course Readings folder (see the Dropbox link distributed by email). To load it, you can use the quanteda function readLIWCdict() which reads LIWC-formatted dictionaries. Note: In the current implementation, this takes up to a minute to create the dfm, a while because of the size and complexity of the LIWC dictionary. Compare the speeches based on the (normalized) Anger measure.

      • Try the same thing but using the Regressive Imagery Dictionary, from Martindale, C. (1975) Romantic progression: The psychology of literary history. Washington, D.C.: Hemisphere. You can download the dictionary from http://www.provalisresearch.com/Download/RID.ZIP, formatted for WordStat. Compare the Presidents based on the level of “Icarian Imagery.” Which president is the most Icarian?

  5. Mega-extra credit

    (and possibly leading to me hiring you as a post-doc) Examine the WordStat sentiment dictionary which has a series of relative rules here: http://provalisresearch.com/Download/WSD.zip. How could these rules be implemented for dictionaries and dfm creation in quanteda?