### Describing and comparing texts

This exercise covers the material from Day 2, for describing texts using quanteda. We will also be using the quantedaData package, which contains some additional corpora not in the base quanteda package.

To install the quantedaData package, use:

devtools::install_github("kbenoit/quantedaData")

This requires that you have first installed the devtools package.

1. Preparing and pre-processing texts

1. “Cleaning” texts

It is common to “clean” texts before processing, usually by removing punctuation, digits, and converting to lower case.

“Cleaning” in quanteda takes through decisions made at the tokenization stage. In order to count word frequencies, we first need to split the text into words through a process known as tokenization. Look at the documentation for quanteda’s tokenize command using the built in help function (? before any object/command). Use the tokenize command on exampleString (a built-in data type in the quanteda package), and examine the results.

Tokenize this text:

1. with punctuation removed

2. with hyphenated words broken up and hyphens removed

3. with punctuation removed and converted to lowercase

4. segmented into sentences without other cleaning applied.

2. Stemming.

Stemming removes the suffixes using the Porter stemmer, found in the SnowballC library. The quanteda function to invoke the stemmer is wordstem(). Apply stemming to the exampleString and examine the results. Why does it not work, and what do you need to do to make it work? How would you apply this to the sentence-segmented vector?

3. Applying pre-processing to the creation of a dfm.

quanteda’s dfm() function by default applies certain “cleaning” steps to the text, which are not the defaults in tokenize(). Create a dfm from exampleString. What are the differences between the steps applied by dfm() and the default settings for tokenize()?

Compare the steps required in a similar text preparation package, tm:

require(tm)
data(crude)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, stemDocument)
(dtm <- DocumentTermMatrix(crude))

# same in quanteda
require(quanteda)
crudeCorpus <- corpus(crude)
(crudeDfm <- dfm(crudeCorpus))

Inspect the dimensions of the resulting objects, including the names of the words extracted as features. It is also worth comparing the structure of the document-feature matrixes returned by each package. tm uses the slam simple triplet matrix format for representing a sparse matrix.

It is also – in fact almost always – useful to inspect the structure of this object:

str(dtm)
str(crudeDfm)

This indicates that we can extract the names of the words from the tm TermDocumentMatrix object by getting the rownames from inspecting the tdm:

head(Terms(dtm), 20)

Compare this to the results of the same operations from quanteda. To get the “words” from a quanteda object, you can use the features() function:

head(features(crudeDfm), 20)

Why was is the first 20 terms listed different between the two packages?

What proportion of the crudeDfm are zeros? Compare the sizes of tdm and crudeDfm using the object.size() function.

Now we will detach the tm package to rempove it from our namespace.

search()
detach("package:tm")
search()
2. Key-words-in-context

1. quanteda provides a keyword-in-context function that is easily usable and configurable to explore texts in a descriptive way. Type ?kwic to view the documentation.

2. Using the ie2010Corpus object, examine the context for the word “Irish”. What is its predominant usage?

3. Use the kwic function to discover the context of the word “clean”. Is this associated with environmental policy?

4. Examine the context of words related to “disaster”. Hint: you can use the stem of the word along with setting the regex argument to TRUE. Execute a query using a pattern match that returns different variations of words based on “disaster” (such as disasters, disastrous, disastrously, etc.).

5. Load the text of Herman Melville’s Moby Dick and assign a KWIC search for “Ahab” to an object named kwicAhab.
Examine the structure of this object. Now plot it and describe what this plot represents. Note: You might want to resize this so that the height is just small enough to see the lines.

To access Moby Dick, use the syntax below. (This file is distributed with quanteda but is a bit tricky to access since it is compressed “raw” text data rather than an R formatted object.)

require(quanteda, warn.conflicts = FALSE, quietly = TRUE)
mobydicktf <- textfile(unzip(system.file("extdata", "pg2701.txt.zip", package = "quanteda")))
mobydickCorpus <- corpus(mobydicktf)
3. Descriptive statistics

1. We can extract basic descriptive statistics from a corpus from its document feature matrix. Make a dfm from the 2010 Irish budget speeches corpus.

2. Examine the most frequent word features using topfeatures(). What are the five most frequent word in the corpus?

3. Compute and average syllable length for each text in the ie2010Corpus object. To do this, you will use two vectorized functions: ntoken() and syllables(). But be careful, since one of these two functions works fine for a corpus class object, but the other does not. The help pages for each function will tell you which works with what sort of object. (Hint: see also texts().)

4. Lexical diversity and reading difficulty

1. Compare the post-1960 inaugural speeches grouped by president, in terms of their lexical diversity, using the CTTR measure. Plot these in a dotchart.

2. Compare the post-1960 inaugural speeches (not grouped) in terms of their readability, on both the Flesch-Kincaid and FOG indexes. Plot these values by year, connecting them using the type = "b" option to the base package’s plot() method. (You are welcome to use ggplot2 if you prefer that graphics package.)

3. Extra credit: What is/are the word(s) with the highest Scrabble value spoken in Obama’s 2015 inaugural speech?

5. Document and word associations

1. Load the presidential State of the Union (SOTU) corpus. Select just the speeches since 1980, and and create a dfm from this corpus, after removing quanteda’s list of built-in English stop words, and stemming the terms.

To load in the SOTU corpus, make sure you have installed the package quantedaData, and use this command:

data(SOTUCorpus, package = "quantedaData")

For selection, see ?subset.corpus. In R’s “S3” object-oriented system, the .corpus means that this specific method will dispatch when supplied a corpus class object as its first argument. When calling this method for subset, you do not actually need to type subset.corpus(), just subset(). (But you may also type the full name if you wish.) Many methods in R have multiple definitions for different object classes, and understanding this and how it works will serve you well as you learn more R. quanteda is very object-oriented, and this is why many of the same methods can be applied to both corpus and dfm class objects.

This also explains why you get certain warning messages when you attach the quanteda package, e.g.

detach("package:quanteda")
require(quanteda)
## Loading required package: quanteda
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:stats':
##
##     df
## The following object is masked from 'package:base':
##
##     sample

Here, the object df() (a function) from the stats package – which is one of the standard packages that is always attached when you start R – has been superceded in priority in R’s “namespace” by another object (also a function) called df() from the quanteda package. Compare the two using ?df, where you should see two versions listed.

Compute the density for an F distribution for a value of 5, with df1 = 1 and df2 = 4. How do you need to address this function to call it?

Now compute the document frequency of the texts in your selected SOTUCorpus-based corpus. How can you address that function?

Extra credit: Can you describe what the warning messages about the sample object mean?

2. Measure the document similarities using similarity(). Compare the results for the correlation and the cosine methods, with and without first converting the dfm to relative frequencies (also known as “normalization”).

3. Measure the term similarities for the following words: economy, health, women, using cosine distance.

4. Extra credit: Now weight the dfm object using tf-idf weights, and recompute the term similarity matrix. How different are the results? Compute the same similarities for the tf-idf weighted results for a similar dfm but without removing stopwords. How different are these results?