Instructions

Work your way through the examples, studying each to understand what it is doing. Where questions are asked, include your answer when you write this up.

Ways to prepare your answer:

Naming your file: Please use the following convention:

Exercise1_Lastname_FirstName.pdf (or whatever extension is appropriate)

Submitting your answers: Can be done by email to kbenoit@tcd.ie.

Exercise 1

  1. Preliminaries: Installation

    1. Install the package

    First, you need to have quanteda installed. You can do this from inside RStudio, from the Tools…Install Packages menu, or simply using

    install.packages("quanteda")

    (Optional) You can install some additional corpus data from quantedaData using

    ## the devtools package is required to install quanteda from Github
    devtools::install_github("kbenoit/quantedaData")

    Note that on Windows platforms, it is also (highly) recommended that you install the RTools suite, and for OS X, that you install XCode from the App Store.

    1. Test your setup

    Before you can execute the quanteda commands in this file, you will need to attach its functions using a require() or library() call.

    require(quanteda)
    ## Loading required package: quanteda
    ##
    ## Attaching package: 'quanteda'
    ##
    ## The following object is masked from 'package:stats':
    ##
    ##     df
    ##
    ## The following object is masked from 'package:base':
    ##
    ##     sample

    Now summarize some texts in the Irish 2010 budget speech corpus:

    summary(ie2010Corpus)
    ## Corpus consisting of 14 documents.
    ##
    ##                                   Text Types Tokens Sentences year debate
    ##        2010_BUDGET_01_Brian_Lenihan_FF  1754   7916       404 2010 BUDGET
    ##       2010_BUDGET_02_Richard_Bruton_FG   995   4086       217 2010 BUDGET
    ##         2010_BUDGET_03_Joan_Burton_LAB  1521   5790       309 2010 BUDGET
    ##        2010_BUDGET_04_Arthur_Morgan_SF  1499   6510       345 2010 BUDGET
    ##          2010_BUDGET_05_Brian_Cowen_FF  1544   5964       252 2010 BUDGET
    ##           2010_BUDGET_06_Enda_Kenny_FG  1087   3896       155 2010 BUDGET
    ##      2010_BUDGET_07_Kieran_ODonnell_FG   638   2086       133 2010 BUDGET
    ##       2010_BUDGET_08_Eamon_Gilmore_LAB  1123   3807       202 2010 BUDGET
    ##     2010_BUDGET_09_Michael_Higgins_LAB   457   1149        44 2010 BUDGET
    ##        2010_BUDGET_10_Ruairi_Quinn_LAB   415   1181        60 2010 BUDGET
    ##      2010_BUDGET_11_John_Gormley_Green   381    939        50 2010 BUDGET
    ##        2010_BUDGET_12_Eamon_Ryan_Green   486   1519        90 2010 BUDGET
    ##      2010_BUDGET_13_Ciaran_Cuffe_Green   426   1144        45 2010 BUDGET
    ##  2010_BUDGET_14_Caoimhghin_OCaolain_SF  1110   3699       177 2010 BUDGET
    ##  number      foren     name party
    ##      01      Brian  Lenihan    FF
    ##      02    Richard   Bruton    FG
    ##      03       Joan   Burton   LAB
    ##      04     Arthur   Morgan    SF
    ##      05      Brian    Cowen    FF
    ##      06       Enda    Kenny    FG
    ##      07     Kieran ODonnell    FG
    ##      08      Eamon  Gilmore   LAB
    ##      09    Michael  Higgins   LAB
    ##      10     Ruairi    Quinn   LAB
    ##      11       John  Gormley Green
    ##      12      Eamon     Ryan Green
    ##      13     Ciaran    Cuffe Green
    ##      14 Caoimhghin OCaolain    SF
    ##
    ## Source:  /home/paul/Dropbox/code/quantedaData/* on x86_64 by paul
    ## Created: Tue Sep 16 15:58:21 2014
    ## Notes:

    Create a document-feature matrix from this corpus, removing stop words:

    ieDfm <- dfm(ie2010Corpus, ignoredFeatures = c(stopwords("english"), "will"), stem = TRUE)
    ## Creating a dfm from a corpus ...
    ##    ... lowercasing
    ##    ... tokenizing
    ##    ... indexing documents: 14 documents
    ##    ... indexing features: 4,881 feature types
    ##    ... removed 118 features, from 175 supplied (glob) feature types
    ##    ... stemming features (English), trimmed 1510 feature variants
    ##    ... created a 14 x 3253 sparse dfm
    ##    ... complete.
    ## Elapsed time: 0.102 seconds.

    Look at the top occuring features:

    topfeatures(ieDfm)
    ##  budget   peopl  govern    year  minist     tax  public economi     cut
    ##     271     266     242     198     197     195     179     172     172
    ##     job
    ##     148

    Make a word cloud:

    plot(ieDfm, min.freq=25, random.order=FALSE)

    world cloud

    Did you get the same output?

  2. Basic string manipulation functions in R

    There are several useful string manipulation functions in the R base library. In addition, we will look at the stringr package which provides an additional interface for simple text manipulation.

    The fundamental type (or mode) in which R stores text is the character vector. The most simple case is a character vector of length one. The nchar function returns the number of characters in a character vector.

    require(quanteda)
    s1 <- 'my example text'
    length(s1)
    ## [1] 1
    nchar(s1)
    ## [1] 15
    1. Counting characters.

    The nchar function is vectorized, meaning that when called on a vector it returns a value for each element of the vector.

    s2 <- c('This is', 'my example text.', 'So imaginative.')
    length(s2)
    ## [1] 3
    nchar(s2)
    ## [1]  7 16 15
    sum(nchar(s2))
    ## [1] 38

    We can use this to answer some simple questions about the inaugural addresses.

    Which were the longest and shortest speeches? We can query this using two functions, nchar() and which.max() and which.min().

    which.max(nchar(inaugTexts))
    ## 1841-Harrison
    ##            14
    which.min(nchar(inaugTexts))
    ## 1793-Washington
    ##               2
    1. Extracting characters.

    Unlike in some other programming languages, it is not possible to index into a string in R:

    s1 <- 'This file contains many fascinating example sentences.'
    s1[6:9]
    ## [1] NA NA NA NA

    To extract a substring, instead we use the substr() function. Using the help page from ?subst, execute a call to substr() to return the characters from s1 below from the 6th to the 9th characters.

    s1 <- 'This file contains many fascinating example sentences.'
    substr(s1, 6, 9)
    ## [1] "file"

    A note for you C programmers: R counts from 1, not 0.

    1. Splitting texts and using lists.

    Often we would like to split character vectors to extract a term of interest. This is possible using the strsplit function. Consider the names of the inaugural texts:

    names(inaugTexts)
    ##  [1] "1789-Washington" "1793-Washington" "1797-Adams"
    ##  [4] "1801-Jefferson"  "1805-Jefferson"  "1809-Madison"
    ##  [7] "1813-Madison"    "1817-Monroe"     "1821-Monroe"
    ## [10] "1825-Adams"      "1829-Jackson"    "1833-Jackson"
    ## [13] "1837-VanBuren"   "1841-Harrison"   "1845-Polk"
    ## [16] "1849-Taylor"     "1853-Pierce"     "1857-Buchanan"
    ## [19] "1861-Lincoln"    "1865-Lincoln"    "1869-Grant"
    ## [22] "1873-Grant"      "1877-Hayes"      "1881-Garfield"
    ## [25] "1885-Cleveland"  "1889-Harrison"   "1893-Cleveland"
    ## [28] "1897-McKinley"   "1901-McKinley"   "1905-Roosevelt"
    ## [31] "1909-Taft"       "1913-Wilson"     "1917-Wilson"
    ## [34] "1921-Harding"    "1925-Coolidge"   "1929-Hoover"
    ## [37] "1933-Roosevelt"  "1937-Roosevelt"  "1941-Roosevelt"
    ## [40] "1945-Roosevelt"  "1949-Truman"     "1953-Eisenhower"
    ## [43] "1957-Eisenhower" "1961-Kennedy"    "1965-Johnson"
    ## [46] "1969-Nixon"      "1973-Nixon"      "1977-Carter"
    ## [49] "1981-Reagan"     "1985-Reagan"     "1989-Bush"
    ## [52] "1993-Clinton"    "1997-Clinton"    "2001-Bush"
    ## [55] "2005-Bush"       "2009-Obama"      "2013-Obama"
    # returns a list of parts
    parts <- strsplit(names(inaugTexts), '-')
    years <- sapply(parts, function(x) x[1])
    pres <-  sapply(parts, function(x) x[2])

    Examine the previous code carefully, as it uses list data types in R, which are something fundamentally important to understand. In quanteda, the tokenizedTexts class of object – created when you call tokenize() on a character object or corpus – is a type of list. Try it:

    toks <- tokenize("This is a sentence containing some caractères Français.")

    Now examine the “structure” of that object – assigned to “toks” – using str(). What does it indicate?

    Try sending toks to the global environment, by simply typing its name in the console and pressing Enter. Can you explain why it looks the way that it does? Hint: You can examine all available “methods” for an object class using the methods() function. Try methods(class = "tokenizedTexts"), and use the help function ?methods to explain what you see.

    1. Joining character objects together.

    The paste function is used to join character vectors together. The way in which the elements are combined depends on the values of the sep and collapse arguments:

    paste('one','two','three')
    ## [1] "one two three"
    paste('one','two','three', sep='_')
    ## [1] "one_two_three"
    paste(years, pres, sep='-')
    ##  [1] "1789-Washington" "1793-Washington" "1797-Adams"
    ##  [4] "1801-Jefferson"  "1805-Jefferson"  "1809-Madison"
    ##  [7] "1813-Madison"    "1817-Monroe"     "1821-Monroe"
    ## [10] "1825-Adams"      "1829-Jackson"    "1833-Jackson"
    ## [13] "1837-VanBuren"   "1841-Harrison"   "1845-Polk"
    ## [16] "1849-Taylor"     "1853-Pierce"     "1857-Buchanan"
    ## [19] "1861-Lincoln"    "1865-Lincoln"    "1869-Grant"
    ## [22] "1873-Grant"      "1877-Hayes"      "1881-Garfield"
    ## [25] "1885-Cleveland"  "1889-Harrison"   "1893-Cleveland"
    ## [28] "1897-McKinley"   "1901-McKinley"   "1905-Roosevelt"
    ## [31] "1909-Taft"       "1913-Wilson"     "1917-Wilson"
    ## [34] "1921-Harding"    "1925-Coolidge"   "1929-Hoover"
    ## [37] "1933-Roosevelt"  "1937-Roosevelt"  "1941-Roosevelt"
    ## [40] "1945-Roosevelt"  "1949-Truman"     "1953-Eisenhower"
    ## [43] "1957-Eisenhower" "1961-Kennedy"    "1965-Johnson"
    ## [46] "1969-Nixon"      "1973-Nixon"      "1977-Carter"
    ## [49] "1981-Reagan"     "1985-Reagan"     "1989-Bush"
    ## [52] "1993-Clinton"    "1997-Clinton"    "2001-Bush"
    ## [55] "2005-Bush"       "2009-Obama"      "2013-Obama"
    paste(years, pres, collapse='-')
    ## [1] "1789 Washington-1793 Washington-1797 Adams-1801 Jefferson-1805 Jefferson-1809 Madison-1813 Madison-1817 Monroe-1821 Monroe-1825 Adams-1829 Jackson-1833 Jackson-1837 VanBuren-1841 Harrison-1845 Polk-1849 Taylor-1853 Pierce-1857 Buchanan-1861 Lincoln-1865 Lincoln-1869 Grant-1873 Grant-1877 Hayes-1881 Garfield-1885 Cleveland-1889 Harrison-1893 Cleveland-1897 McKinley-1901 McKinley-1905 Roosevelt-1909 Taft-1913 Wilson-1917 Wilson-1921 Harding-1925 Coolidge-1929 Hoover-1933 Roosevelt-1937 Roosevelt-1941 Roosevelt-1945 Roosevelt-1949 Truman-1953 Eisenhower-1957 Eisenhower-1961 Kennedy-1965 Johnson-1969 Nixon-1973 Nixon-1977 Carter-1981 Reagan-1985 Reagan-1989 Bush-1993 Clinton-1997 Clinton-2001 Bush-2005 Bush-2009 Obama-2013 Obama"
    1. Manipulating case

    tolower and toupper change the case of character objects:

    tolower(s1)
    ## [1] "this file contains many fascinating example sentences."
    toupper(s1)
    ## [1] "THIS FILE CONTAINS MANY FASCINATING EXAMPLE SENTENCES."

    These are also examples of “vectorized” functions: They work on vectors of objects, rather than just atomic objects. Try these functions on the character vectors below:

    sVec <- c("Quanteda is the Best Text Package Ever, approved by NATO!",
              "Quanteda является лучший текст пакет тех, утвержденной НАТО!")

    Try running tolower() on that vector. What results?

    quanteda has its own, smarter lowercase function, called toLower(). Try it on sVec. There is an option to preserve the acronym – try it a second time while preserving the acronym NATO as uppercase. To find out how, read the fine manual (RTFM): ?toLower.

  3. Counting and comparing objects.

    1. Comparing character objects.

    Charcter vectors can be compared using the == and %in% operators:

    tolower(s1) == toupper(s1)
    ## [1] FALSE
    'apples'=='oranges'
    ## [1] FALSE
    tolower(s1) == tolower(s1)
    ## [1] TRUE
    'pears' == 'pears'
    ## [1] TRUE
    c1 <- c('apples', 'oranges', 'pears')
    'pears' %in% c1
    ## [1] TRUE
    c2 <- c('bananas', 'pears')
    c2 %in% c1
    ## [1] FALSE  TRUE

    Extra credit: Try using this with the length() function to figure out how many times new occurs in the tokenized text of the 57th inaugural speech, which you can access as a quanteda built-in object as inaugTexts[57]. Hint: use %in% to return a logical vector, and coerce this to 0s and 1s by using the sum() function on the resulting vector.

    1. Pattern matching

    The base functions for searching and replacing within text are similar to familiar commands from the other text manipulation environments, grep and gsub. The grep manual page provides an overview of these functions.

    The grep command tests whether a pattern occurs within a string:

    grep('orange', 'these are oranges')
    ## [1] 1
    grep('pear', 'these are oranges')
    ## integer(0)
    grep('orange', c('apples', 'oranges', 'pears'))
    ## [1] 2
    grep('pears', c('apples', 'oranges', 'pears'))
    ## [1] 3

    The gsub command substitutes one pattern for another within a string:

    gsub('oranges', 'apples', 'these are oranges')
    ## [1] "these are apples"
  4. Making a corpus and corpus structure

    1. From a vector of texts already in memory.

      The simplest way to create a corpus is to use a vector of texts already present in R’s global environment. Some text and corpus objects are built into the package, for example inaugTexts is the UTF-8 encoded set of 57 presidential inaugural addresses. Try using corpus() on this set of texts to create a corpus.

      Once you have constructed this corpus, use the summary() method to see a brief description of the corpus. The names of the character vector inaugTexts should have become the document names.

    2. From a directory of text files.

      The corpus() function can take as its main argument the name of a directory, if you wrap the path to the directory within a directory() call. (See ?directory for an example.) If you call directory() with no arguments, then it should allow you to choose the directory interactively (you will need to have installed the tcltk2 package first though.)

      Here you are encouraged to select any directory of plain text files of your own.
      How did it work? Try using docvars() to assign a set of document-level variables.

      Note that if you document level metadata in your filenames, then this can be automatically parsed by corpus.directory() into docvars.

      # mytf <- textfile("~/Dropbox/QUANTESS/corpora/home_office_animals/txts/*.txt", encoding = "UTF-8")
      mytf <- textfile("~/Dropbox/QUANTESS/corpora/amicus/all/*.txt")
      mycorpus <- corpus(mytf)
      summary(mycorpus, 5)
      ## Corpus consisting of 102 documents, showing 5 documents.
      ##
      ##       Text Types Tokens Sentences
      ##  sAP01.txt  1660   6441       256
      ##  sAP02.txt  1913   6645       393
      ##  sAP03.txt  1958   8123       475
      ##  sAP04.txt  1258   4922       232
      ##  sAP05.txt  2031   7375       372
      ##
      ## Source:  /Users/kbenoit/Dropbox/Classes/Trinity/Text Analysis 2016/Exercises/Exercise 1/* on x86_64 by kbenoit
      ## Created: Sun Feb  7 18:41:50 2016
      ## Notes:
    3. There are many other ways to create a corpus, most using the intermediate function textfile() to read texts into R. Explore these ways by studying ?textfile. Can you reproduce the examples?