Revised: 19 October 2014

### Describing and comparing texts

This exercise covers the material from Days 2 and 3, for describing and comparing texts using quanteda. We will also be using the quanteda package, which contains some additional corpora not in the base quanteda package.

Always remember to install a fresh copy of both packages, and from the dev branch for quanteda:

if (!require(devtools)) install.packages("devtools", dependencies=TRUE)
devtools::install_github("kbenoit/quanteda", dependencies=TRUE, ref="dev")
1. Preparing and pre-processing texts

1. “Cleaning”" texts

It is common to “clean” texts before processing, usually by removing punctuation, digits, and converting to lower case. Look at the documentation for quanteda}’s clean command ?clean and use the command on the exampleString text (you can load this from quantedaData using data(exampleString). Can you think of cases where cleaning could introduce homonymy?

2. Tokenizing texts

In order to count word frequencies, we first need to split the text into words through a process known as tokenization. Look at the documentation for quanteda’s tokenize command using the built in help function (? before any object/command). Use the tokenize command on exampleString, and examine the results. Are there cases where it is unclear where the boundary between two words lies? You can experiment with the options to tokenize.

Try reshaping the sentences from exampleString into sentences, using segmentSentence. What sort of object is returned if you tokenize the segmented sentence object?

3. Stemming.

Stemming removes the suffixes using the Porter stemmer, found in the SnowballC library. The quanteda function to invoke the stemmer is wordstem. Apply stemming to the exampleString and examine the results. Why does it not appear to work, and what do you need to do to make it work? How would you apply this to the sentence-segmented vector?

4. Applying pre-processing to the creation of a dfm.

quanteda’s dfm() function makes it wasy to pass the cleaning arguments to clean, which are executed as part of the tokenization implemented by dfm(). Compare the steps required in a similar text preparation package, tm:

require(tm)
data("crude")
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, stemDocument)
tdm <- TermDocumentMatrix(crude)

# same in quanteda
require(quanteda)
crudeCorpus <- corpus(crude)
crudeDfm <- dfm(crudeCorpus)
## Creating dfm from a corpus: ... done.

Inspect the dimensions of the resulting objects, including the names of the words extracted as features. It is also worth comparing the structure of the document-feature matrixes returned by each package. tm uses the slam simple triplet matrix format for representing a sparse matrix.

It is also – in fact almost always – useful to inspect the structure of this object:

str(tdm)
## List of 6
##  $i : int [1:1954] 49 86 110 148 166 167 178 183 184 195 ... ##$ j       : int [1:1954] 1 1 1 1 1 1 1 1 1 1 ...
##  $v : num [1:1954] 1 2 1 1 1 1 2 1 1 2 ... ##$ nrow    : int 943
##  $ncol : int 20 ##$ dimnames:List of 2
##   ..$Terms: chr [1:943] "abdulaziz" "abil" "abl" "about" ... ## ..$ Docs : chr [1:20] "127" "144" "191" "194" ...
##  - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

THis indicates that we can extract the names of the words from the tm TermDocumentMatrix object by getting the rownames from inspecting the tdm:

head(tdm$dimnames$Terms, 20)
##  [1] "abdulaziz" "abil"      "abl"       "about"     "abov"
##  [6] "above"     "abroad"    "accept"    "accord"    "across"

Compare this to the results of the same operations from quanteda. To get the “words” from a quanteda object, you can use the features() function:

features_quanteda <- features(crudeDfm)
##  [1] "a"         "abdulaziz" "abil"      "abl"       "about"
##  [6] "abov"      "above"     "abroad"    "accept"    "accord"
str(crudeDfm)
##  int [1:20, 1:973] 5 7 2 3 1 8 10 2 4 4 ...
##  - attr(*, "dimnames")=List of 2
##   ..$docs : chr [1:20] "reut-00001.xml" "reut-00002.xml" "reut-00004.xml" "reut-00005.xml" ... ## ..$ features: chr [1:973] "a" "abdulaziz" "abil" "abl" ...
##  - attr(*, "class")= chr [1:2] "dfm" "matrix"
##  - attr(*, "settings")=List of 13
##   ..$stopwords : NULL ## ..$ collocations       : NULL
##   ..$dictionary : NULL ## ..$ dictionary_regex   : logi FALSE
##   ..$stem : logi FALSE ## ..$ delimiter_word     : chr " "
##   ..$delimiter_sentence : chr ".!?" ## ..$ delimiter_paragraph: chr "\n\n"
##   ..$clean_tolower : logi TRUE ## ..$ clean_removeDigits : logi TRUE
##   ..$clean_removePunct : logi TRUE ## ..$ units              : chr "documents"
##   ..\$ unitsoriginal      : chr "documents"

What proportion of the crudeDfm are zeros? Compare the sizes of tdm and crudeDfm using the object.size() function.

2. Keywords-in-context

1. quanteda provides a keyword-in-context function that is easily usable and configurable to explore texts in a descriptive way. Type ?kwic to view the documentation.

2. Load the Irish budget debate speeches for the year 2010 using

data(ie2010Corpus)

and experiment with the kwic function, following the syntax specified on the help page for kwic. kwic can be used either on a character vector or a corpus object. What class of object is returned? Try assigning the return value from kwic to a new object and then examine the object by clicking on it in the environment pane in RStudio (or using the inspection method of your choice).

3. Use the kwic function to discover the context of the word “clean”. Is this associated with environmental policy?

4. Examine the context of words related to “disaster”. Hint: you can use the stem of the word along with setting the regex argument to TRUE.

3. Descriptive statistics

1. We can extract basic descriptive statistics from a corpus from its document feature matrix. Make a dfm from the 2010 Irish budget speeches corpus.

2. Examine the most frequent word features using topfeatures. What are the five most frequent word in the corpus?

3. summary quanteda provides a function to count syllables in a word — countSyllables. Try the function at the prompt. The code below will apply this function to all the words in the corpus, to give you a count of the total syllables in the corpus.

# count syllables from texts in the 2010 speech corpus
textSyls <- countSyllables(texts(ie2010Corpus))
# sum the syllable counts
totalSyls <- sum(textSyls)
4. Now compute the readability measures known as the Flesch-Kincaid index. The formula is:

$206.835 - 1.015 \left( \frac{total tokens}{total sentences} \right) - 84.6 \left( \frac{total syllables}{total tokens} \right)$

You should now have the values for these variables — calculate the Flesch-Kincaid index of the Irish budget speeches.

4. Lexical Diversity over Time

1. We can plot the type-token ratio of the Irish budget speeches over time. To do this, begin by extracting a subset of iebudgets that contains only the first speaker from each year:

data(iebudgets)
finMins <- subset(iebudgets, no=="01")
tokeninfo <- summary(finMins)
## Corpus consisting of 5 documents.
##
##                                Text Types Tokens Sentences year    debate
##    2012_BUDGET_01_Michael_Noonan_FG  1538   6450       294 2012    BUDGET
##     2011_BUDGET_01_Brian_Lenihan_FF  1537   7094       371 2011    BUDGET
##     2010_BUDGET_01_Brian_Lenihan_FF  1655   7799       390 2010    BUDGET
##  2009_BUDGETSUP_01_Brian_Lenihan_FF  1632   7570       410 2009 BUDGETSUP
##       2008_BUDGET_01_Brian_Cowen_FF  1715   8815       417 2008    BUDGET
##  no namefirst namelast party
##  01   Michael   Noonan    FG
##  01     Brian  Lenihan    FF
##  01     Brian  Lenihan    FF
##  01     Brian  Lenihan    FF
##  01     Brian    Cowen    FF
##
## Source:  /Users/kbenoit/Dropbox/QUANTESS/quantedaData_kenlocal_gh/* on x86_64 by kbenoit.
## Created: Mon Sep 15 14:42:09 2014.
## Notes:   .

Note the quotation marks around no. Why are these required here?

2. Get the type-token ratio for each text from this subset, and plot the resulting vector of TTRs as a function of the year.

3. Now compare the results from the statLexdiv function applied to the texts. Are the results the same?

5. Document and word associations

1. Load the presidential inauguration corpus selecting from 1900-1050, and create a dfm from this corpus.

2. Measure the document similarities using simil(). Compare the results for Euclidean, Euclidean on the term frequency standardized dfm, cosine, and Jaccard.

3. Measure the term similarities for the following words: economy, health, women.