** Due: April 3, 2015**

Working with Textual Data

This exercise is designed to get you working with quanteda. The focus will be on exploring the package and getting some texts into the corpus object format. quanteda package has several functions for creating a corpus of texts which we will use in this exercise.

  1. Getting Started.

    You can use R or Rstudio for these exercises. You will first need to install the package,
    using the commands below. Also see the instructions for installation from the dev branch page of http://github.com/kbenoit/quanteda.

    # needs the devtools package for this to work
    if (!require(devtools)) install.packages("devtools", dependencies=TRUE)
    # be sure to install the latest version from GitHub, using dev branch:
    devtools::install_github("quanteda", username="kbenoit", dependencies=TRUE, ref="dev")
    # if this fails remove the ref="dev" part
    # and quantedaData
    devtools::install_github("quantedaData", username="kbenoit")
  2. Exploring quanteda functions.

    You can try running demo(quanteda), and also use the example() function for any function in the package, to run the examples and see how the function works. Of course you should also browse the documentation, especially ?corpus to see the structure and operations of how to construct a corpus.

  3. Making a corpus and corpus structure

    1. From a vector of texts already in memory.

      The simplest way to create a corpus is to use a vector of texts already present in R’s global environment. Some text and corpus objects are built into the package, for example inaugTexts is the UTF-8 encoded set of 57 presidential inaugural addresses. Try using corpus() on this set of texts to create a corpus.

      Once you have constructed this corpus, use the summary() method to see a brief description of the corpus. The names of the character vector inaugTexts should have become the document names.

    2. From a directory of text files.

      The corpus() function can take as its main argument the name of a directory, if you wrap the path to the directory within a directory() call. (See ?directory for an example.) If you call directory() with no arguments, then it should allow you to choose the directory interactively (you will need to have installed the tcltk2 package first though.)

      Here you are encouraged to select any directory of plain text files of your own.
      How did it work? Try using docvars() to assign a set of document-level variables.

      Note that if you document level metadata in your filenames, then this can be automatically parsed by corpus.directory() into docvars.

      require(quanteda)
      ## Loading required package: quanteda
      mydir <- textfile("~/Dropbox/QUANTESS/corpora/ukManRenamed/*.txt")
      mycorpus <- corpus(mydir)
      summary(mycorpus, 5)
      ## Corpus consisting of 101 documents, showing 5 documents.
      ## 
      ##                     Text Types Tokens Sentences
      ##  UK_natl_1945_en_Con.txt  1578   6095       275
      ##  UK_natl_1945_en_Lab.txt  1258   4975       241
      ##  UK_natl_1945_en_Lib.txt  1061   3377       158
      ##  UK_natl_1950_en_Con.txt  1806   7411       381
      ##  UK_natl_1950_en_Lab.txt  1342   4879       275
      ## 
      ## Source:  /Users/kbenoit/Dropbox/Classes/Trinity/Data Mining 2015/Notes/Day 6 - Text/* on x86_64 by kbenoit.
      ## Created: Mon Mar 23 22:11:21 2015.
      ## Notes:   .
    3. From .csv or .json files — see the documentation with ?textfile.

  4. Explore some phrases in the text.

    You can do this using the kwic (for “key-words-in-context”) to explore a specific word or phrase.

    kwic(inaugCorpus, "terror", 3)
    ##                                                  preword      word
    ##    [1797-Adams, 1183]                    or violence, by   terror,
    ## [1933-Roosevelt, 100] nameless, unreasoning, unjustified    terror
    ## [1941-Roosevelt, 252]                    by a fatalistic   terror,
    ##   [1961-Kennedy, 763]               uncertain balance of   terror 
    ##   [1961-Kennedy, 872]                     instead of its  terrors.
    ##    [1981-Reagan, 691]                 Americans from the  terror  
    ##   [1981-Reagan, 1891]                 those who practice terrorism
    ##   [1997-Clinton, 929]                  the fanaticism of   terror.
    ##  [1997-Clinton, 1462]             strong defense against   terror 
    ##    [2009-Obama, 1433]                   aims by inducing    terror
    ##                                          postword
    ##    [1797-Adams, 1183] intrigue, or venality,     
    ## [1933-Roosevelt, 100] which paralyzes needed     
    ## [1941-Roosevelt, 252] we proved that             
    ##   [1961-Kennedy, 763] that stays the             
    ##   [1961-Kennedy, 872] Together let us            
    ##    [1981-Reagan, 691] of runaway living          
    ##   [1981-Reagan, 1891] and prey upon              
    ##   [1997-Clinton, 929] And they torment           
    ##  [1997-Clinton, 1462] and destruction. Our       
    ##    [2009-Obama, 1433] and slaughtering innocents,

    Try substituting your own search terms, or working with your own corpus.

  5. Create a document-feature matrix, using dfm. First, read the documentation using ?dfm to see the available options.

    mydfm <- dfm(inaugCorpus, ignoredFeatures=stopwords("english"))
    ## Creating a dfm from a corpus ...
    ##    ... indexing 57 documents
    ##    ... tokenizing texts, found 134,142 total tokens
    ##    ... cleaning the tokens, 461 removed entirely
    ##    ... ignoring 174 feature types, discarding 69,005 total features (51.6%)
    ##    ... summing tokens by document
    ##    ... indexing 9,085 feature types
    ##    ... building sparse matrix
    ##    ... created a 57 x 9085 sparse dfm
    ##    ... complete. Elapsed time: 3.366 seconds.
    dim(mydfm)
    ## [1]   57 9085
    topfeatures(mydfm, 20)
    ##       will     people government         us        can       upon 
    ##        871        564        561        476        470        371 
    ##       must        may      great     states      shall      world 
    ##        363        338        334        331        314        305 
    ##    country      every     nation      peace        one        new 
    ##        294        291        287        253        244        241 
    ##      power     public 
    ##        232        223

    Experiment with different dfm options, such as stem=TRUE. The function trim() allows you to reduce the size of the dfm following its construction.

    Grouping on a variable is an excellent feature of dfm(), in fact one of my favorites.
    For instance, if you want to aggregate all speeches by presidential name, you can execute

    mydfm <- dfm(inaugCorpus, groups="President")
    ## Creating a dfm from a corpus ...
    ##    ... grouping texts by variable: President
    ##    ... indexing 34 documents
    ##    ... tokenizing texts, found 134,142 total tokens
    ##    ... cleaning the tokens, 461 removed entirely
    ##    ... summing tokens by document
    ##    ... indexing 9,208 feature types
    ##    ... building sparse matrix
    ##    ... created a 34 x 9208 sparse dfm
    ##    ... complete. Elapsed time: 1.246 seconds.
    dim(mydfm)
    ## [1]   34 9208
    docnames(mydfm)
    ##  [1] "Adams"      "Buchanan"   "Bush"       "Carter"     "Cleveland" 
    ##  [6] "Clinton"    "Coolidge"   "Eisenhower" "Garfield"   "Grant"     
    ## [11] "Harding"    "Harrison"   "Hayes"      "Hoover"     "Jackson"   
    ## [16] "Jefferson"  "Johnson"    "Kennedy"    "Lincoln"    "Madison"   
    ## [21] "McKinley"   "Monroe"     "Nixon"      "Obama"      "Pierce"    
    ## [26] "Polk"       "Reagan"     "Roosevelt"  "Taft"       "Taylor"    
    ## [31] "Truman"     "VanBuren"   "Washington" "Wilson"

    Note that this groups Theodore and Franklin D. Roosevelt together – to separate them we would have needed to add a firstname variable using docvars() and grouped on that as well.

  6. Explore the ability to subset a corpus.

    There is a subset() method defined for a corpus, which works just like R’s normal subset() command. This provides an easy method to send specific documents to downstream functions, like dfm(), which will be useful workaround until I implement a full set of subsetting and indexing features for the dfm class object.

    For instance if you want a wordcloud of just Obama’s two inagural addresses, you would need to subset the corpus first:

    obamadfm <- dfm(subset(inaugCorpus, President=="Obama"), stopwords=TRUE)
    ## Creating a dfm from a corpus ...
    ##    ... indexing 2 documents
    ##    ... tokenizing texts, found 4,525 total tokens
    ##    ... cleaning the tokens, 43 removed entirely
    ##    ... summing tokens by document
    ##    ... indexing 1,333 feature types
    ##    ... building sparse matrix
    ##    ... created a 2 x 1333 sparse dfm
    ##    ... complete. Elapsed time: 0.05 seconds.
    plot(obamadfm)

  7. Preparing and pre-processing texts

    1. “Cleaning”" texts

      It is common to “clean” texts before processing, usually by removing punctuation, digits, and converting to lower case. Look at the documentation for quanteda}’s clean command ?clean and use the command on the exampleString text (you can load this from quantedaData using data(exampleString). Can you think of cases where cleaning could introduce homonymy?

    2. Tokenizing texts

      In order to count word frequencies, we first need to split the text into words through a process known as tokenization. Look at the documentation for quanteda’s tokenize command using the built in help function (? before any object/command). Use the tokenize command on exampleString, and examine the results. Are there cases where it is unclear where the boundary between two words lies? You can experiment with the options to tokenize.

      Try reshaping the sentences from exampleString into sentences, using segmentSentence. What sort of object is returned if you tokenize the segmented sentence object?

    3. Stemming.

      Stemming removes the suffixes using the Porter stemmer, found in the SnowballC library. The quanteda function to invoke the stemmer is wordstem. Apply stemming to the exampleString and examine the results. Why does it not appear to work, and what do you need to do to make it work? How would you apply this to the sentence-segmented vector?

    4. Applying pre-processing to the creation of a dfm.

      quanteda’s dfm() function makes it wasy to pass the cleaning arguments to clean, which are executed as part of the tokenization implemented by dfm(). Compare the steps required in a similar text preparation package, tm:

      require(tm)
      ## Loading required package: tm
      ## Loading required package: NLP
      ## 
      ## Attaching package: 'tm'
      ## 
      ## The following objects are masked from 'package:quanteda':
      ## 
      ##     as.DocumentTermMatrix, stopwords
      data("crude")
      crude <- tm_map(crude, content_transformer(tolower))
      crude <- tm_map(crude, removePunctuation)
      crude <- tm_map(crude, removeNumbers)
      crude <- tm_map(crude, stemDocument)
      tdm <- TermDocumentMatrix(crude)
      
      # same in quanteda
      require(quanteda)
      crudeCorpus <- corpus(crude)
      crudeDfm <- dfm(crudeCorpus)
      ## Creating a dfm from a corpus ...
      ##    ... indexing 20 documents
      ##    ... tokenizing texts, found 3,863 total tokens
      ##    ... cleaning the tokens, 0 removed entirely
      ##    ... summing tokens by document
      ##    ... indexing 973 feature types
      ##    ... building sparse matrix
      ##    ... created a 20 x 973 sparse dfm
      ##    ... complete. Elapsed time: 0.047 seconds.

      Inspect the dimensions of the resulting objects, including the names of the words extracted as features. It is also worth comparing the structure of the document-feature matrixes returned by each package. tm uses the slam simple triplet matrix format for representing a sparse matrix.

      It is also – in fact almost always – useful to inspect the structure of this object:

      str(tdm)
      ## List of 6
      ##  $ i       : int [1:1954] 49 86 110 148 166 167 178 183 184 195 ...
      ##  $ j       : int [1:1954] 1 1 1 1 1 1 1 1 1 1 ...
      ##  $ v       : num [1:1954] 1 2 1 1 1 1 2 1 1 2 ...
      ##  $ nrow    : int 943
      ##  $ ncol    : int 20
      ##  $ dimnames:List of 2
      ##   ..$ Terms: chr [1:943] "abdulaziz" "abil" "abl" "about" ...
      ##   ..$ Docs : chr [1:20] "127" "144" "191" "194" ...
      ##  - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
      ##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

      THis indicates that we can extract the names of the words from the tm TermDocumentMatrix object by getting the rownames from inspecting the tdm:

      head(tdm$dimnames$Terms, 20)
      ##  [1] "abdulaziz" "abil"      "abl"       "about"     "abov"     
      ##  [6] "above"     "abroad"    "accept"    "accord"    "across"   
      ## [11] "act"       "activity"  "add"       "added"     "address"  
      ## [16] "adher"     "advantag"  "advisers"  "after"     "again"

      Compare this to the results of the same operations from quanteda. To get the “words” from a quanteda object, you can use the features() function:

      features_quanteda <- features(crudeDfm)
      head(features_quanteda, 20)
      ##  [1] "a"         "abdulaziz" "abil"      "abl"       "about"    
      ##  [6] "abov"      "above"     "abroad"    "accept"    "accord"   
      ## [11] "across"    "act"       "activity"  "ad"        "add"      
      ## [16] "added"     "address"   "adher"     "advantag"  "advisers"
      str(crudeDfm)
      ## Formal class 'dfmSparse' [package "quanteda"] with 9 slots
      ##   ..@ settings :List of 1
      ##   .. ..$ : NULL
      ##   ..@ weighting: chr "frequency"
      ##   ..@ smooth   : num 0
      ##   ..@ Dim      : int [1:2] 20 973
      ##   ..@ Dimnames :List of 2
      ##   .. ..$ docs    : chr [1:20] "reut-00001.xml" "reut-00002.xml" "reut-00004.xml" "reut-00005.xml" ...
      ##   .. ..$ features: chr [1:973] "a" "abdulaziz" "abil" "abl" ...
      ##   ..@ i        : int [1:2172] 0 1 2 3 4 5 6 7 8 9 ...
      ##   ..@ p        : int [1:974] 0 18 19 21 23 31 35 37 38 39 ...
      ##   ..@ x        : num [1:2172] 5 7 2 3 1 8 10 2 4 4 ...
      ##   ..@ factors  : list()

      What proportion of the crudeDfm are zeros? Compare the sizes of tdm and crudeDfm using the object.size() function.

  8. Keywords-in-context

    1. quanteda provides a keyword-in-context function that is easily usable and configurable to explore texts in a descriptive way. Type ?kwic to view the documentation.

    2. Load the Irish budget debate speeches for the year 2010 using

      require(quantedaData)
      ## Loading required package: quantedaData
      data(ie2010Corpus)

      and experiment with the kwic function, following the syntax specified on the help page for kwic. kwic can be used either on a character vector or a corpus object. What class of object is returned? Try assigning the return value from kwic to a new object and then examine the object by clicking on it in the environment pane in RStudio (or using the inspection method of your choice).

    3. Use the kwic function to discover the context of the word “clean”. Is this associated with environmental policy?

    4. Examine the context of words related to “disaster”. Hint: you can use the stem of the word along with setting the regex argument to TRUE.

  9. Descriptive statistics

    1. We can extract basic descriptive statistics from a corpus from its document feature matrix. Make a dfm from the 2010 Irish budget speeches corpus.

    2. Examine the most frequent word features using topfeatures. What are the five most frequent word in the corpus?

    3. summary quanteda provides a function to count syllables in a word — syllables. Try the function at the prompt. The code below will apply this function to all the words in the corpus, to give you a count of the total syllables in the corpus.

      # count syllables from texts in the 2010 speech corpus 
      textSyls <- syllables(texts(ie2010Corpus))
      # sum the syllable counts 
      totalSyls <- sum(textSyls)                           
  10. Lexical Diversity over Time

    1. We can plot the type-token ratio of the Irish budget speeches over time. To do this, begin by extracting a subset of iebudgets that contains only the first speaker from each year:

      data(iebudgetsCorpus, package="quantedaData")
      finMins <- subset(iebudgetsCorpus, number=="01")
      tokeninfo <- summary(finMins)
      ## Corpus consisting of 6 documents.
      ## 
      ##                                Text Types Tokens Sentences year    debate
      ##       2008_BUDGET_01_Brian_Cowen_FF  1705   8659       417 2008    BUDGET
      ##     2009_BUDGET_01_Brian_Lenihan_FF  1653   7593       418 2009    BUDGET
      ##  2009_BUDGETSUP_01_Brian_Lenihan_FF  1639   7500       410 2009 BUDGETSUP
      ##     2010_BUDGET_01_Brian_Lenihan_FF  1649   7719       390 2010    BUDGET
      ##     2011_BUDGET_01_Brian_Lenihan_FF  1539   7049       371 2011    BUDGET
      ##    2012_BUDGET_01_Michael_Noonan_FG  1521   6412       294 2012    BUDGET
      ##  number namefirst namelast party
      ##      01     Brian    Cowen    FF
      ##      01     Brian  Lenihan    FF
      ##      01     Brian  Lenihan    FF
      ##      01     Brian  Lenihan    FF
      ##      01     Brian  Lenihan    FF
      ##      01   Michael   Noonan    FG
      ## 
      ## Source:  /Users/kbenoit/Dropbox/QUANTESS/quantedaData_kenlocal_gh/* on x86_64 by kbenoit.
      ## Created: Sat Nov 15 18:32:54 2014.
      ## Notes:   .

      Note the quotation marks around the value for number. Why are these required here?

    2. Get the type-token ratio for each text from this subset, and plot the resulting vector of TTRs as a function of the year.

    3. Now compare the results from the lexdiv function applied to the texts. Are the results the same?

  11. Document and word associations

    1. Load the presidential inauguration corpus selecting from 1900-1050, and create a dfm from this corpus.

    2. Measure the document similarities using similarity(). Compare the results for Euclidean, Euclidean on the term frequency standardized dfm, cosine, and Jaccard.

    3. Measure the term similarities for the following words: economy, health, women.

  12. Working with dictionaries

    1. Creating a simple dictionary.

      Dictionaries are named lists, consisting of a “key” and a set of entries defining the equivalence class for the given key. To create a simple dictionary of parts of speech, for instance we could define a dictionary consisting of articles and conjunctions, using:

      posDict <- dictionary(list(articles = c("the", "a", "and"),
                                 conjunctions = c("and", "but", "or", "nor", "for", "yet", "so")))

      To let this define a set of features, we can use this dictionary when we create a dfm, for instance:

      posDfm <- dfm(inaugCorpus, dictionary=posDict)
      ## Creating a dfm from a corpus ...
      ##    ... indexing 57 documents
      ##    ... tokenizing texts, found 134,142 total tokens
      ##    ... cleaning the tokens, 461 removed entirely
      ##    ... applying a dictionary consisting of 2 key entries
      ##    ... created a 57 x 3 sparse dfm
      ##    ... complete. Elapsed time: 1.847 seconds.
      posDfm[1:10,]
      ## Document-feature matrix of: 10 documents, 3 features.
      ## 10 x 3 sparse Matrix of class "dfmSparse"
      ##                 articles conjunctions Non_Dictionary
      ## 1789-Washington      178           73           1178
      ## 1793-Washington       15            4            116
      ## 1797-Adams           344          192           1782
      ## 1801-Jefferson       232          109           1385
      ## 1805-Jefferson       256          126           1784
      ## 1809-Madison         166           63            946
      ## 1813-Madison         169           63            978
      ## 1817-Monroe          458          174           2738
      ## 1821-Monroe          577          195           3685
      ## 1825-Adams           448          150           2317

      Weight the posDfm by term frequency using tf(), and plot the values of articles and conjunctions (actually, here just the coordinating conjunctions) across the speeches. (Hint: you can use docvars(inaugCorpus, "Year")) for the x-axis.)

      Is the distribution of normalized articles and conjunctions relatively constant across years, as you would expect?

    2. Replicating a published dictionary analysis

      Here we will create and implement the populism dictionary from Rooduijn, Matthijs, and Teun Pauwels. 2011. “Measuring Populism: Comparing Two Methods of Content Analysis.” West European Politics 34(6): 1272–83. Appendix B of that paper provides the term entries for a dictionary key for the concept populism. Implement this as a dictionary, and apply it to the same UK manifestos as in the article.

      Hint: You can get a corpus of the UK manifestos from their article using the following:

      data(ukManifestos, package="quantedaData")
      ukPopCorpus <- subset(ukManifestos, (Year %in% c(1992, 2001, 2005) & 
                                          (Party %in% c("Lab", "LD", "Con", "BNP", "UKIP"))))
      summary(ukPopCorpus)
      ## Corpus consisting of 11 documents.
      ## 
      ##                  Text Types Tokens Sentences Country Type Year Language
      ##   UK_natl_1992_en_Con  3886  29560      1605      UK natl 1992       en
      ##   UK_natl_1992_en_Lab  2313  11355       623      UK natl 1992       en
      ##    UK_natl_1992_en_LD  3004  17381       939      UK natl 1992       en
      ##   UK_natl_2001_en_Con  2517  13196       721      UK natl 2001       en
      ##   UK_natl_2001_en_Lab  3600  28704      1602      UK natl 2001       en
      ##    UK_natl_2001_en_LD  3291  21174      1232      UK natl 2001       en
      ##   UK_natl_2005_en_BNP  4444  25112      1058      UK natl 2005       en
      ##   UK_natl_2005_en_Con  1860   7685       420      UK natl 2005       en
      ##   UK_natl_2005_en_Lab  3579  23800      1557      UK natl 2005       en
      ##    UK_natl_2005_en_LD  2859  16081       840      UK natl 2005       en
      ##  UK_natl_2005_en_UKIP  2185   8856       425      UK natl 2005       en
      ##  Party
      ##    Con
      ##    Lab
      ##     LD
      ##    Con
      ##    Lab
      ##     LD
      ##    BNP
      ##    Con
      ##    Lab
      ##     LD
      ##   UKIP
      ## 
      ## Source:  /Users/kbenoit/Dropbox/QUANTESS/quantedaData_kenlocal_gh/* on x86_64 by kbenoit.
      ## Created: Sat Nov 15 18:43:36 2014.
      ## Notes:   .

      Create a dfm of the populism dictionary on the UK manifestos. Use this dfm to reproduce the x-axis for the UK-based parties from Figure 1 in the article. Suggestion: Use dotchart(). You will need to normalize the values first by term frequency within document. Hint: Use weight(x, "relFreq") on the dfm.

      You can explore some of these terms within the corpus to see whether you think they are appropriate measures of populism. How can you search the corpus for the regular expression politici* as a “keyword in context”?