Revised: 19 October 2014
This exercise covers the material from Days 2 and 3, for describing and comparing texts using quanteda. We will also be using the quanteda package, which contains some additional corpora not in the base quanteda package.
Always remember to install a fresh copy of both packages, and from the dev
branch for quanteda:
if (!require(devtools)) install.packages("devtools", dependencies=TRUE)
devtools::install_github("kbenoit/quanteda", dependencies=TRUE, ref="dev")
devtools::install_github("kbenoit/quantedaData")
Preparing and pre-processing texts
“Cleaning”" texts
It is common to “clean” texts before processing, usually by removing punctuation, digits, and converting to lower case. Look at the documentation for quanteda}’s clean
command ?clean
and use the command on the exampleString
text (you can load this from quantedaData using data(exampleString)
. Can you think of cases where cleaning could introduce homonymy?
Tokenizing texts
In order to count word frequencies, we first need to split the text into words through a process known as tokenization. Look at the documentation for quanteda’s tokenize
command using the built in help function (?
before any object/command). Use the tokenize
command on exampleString
, and examine the results. Are there cases where it is unclear where the boundary between two words lies? You can experiment with the options to tokenize
.
Try reshaping the sentences from exampleString
into sentences, using segmentSentence
. What sort of object is returned if you tokenize the segmented sentence object?
Stemming.
Stemming removes the suffixes using the Porter stemmer, found in the SnowballC library. The quanteda function to invoke the stemmer is wordstem
. Apply stemming to the exampleString
and examine the results. Why does it not appear to work, and what do you need to do to make it work? How would you apply this to the sentence-segmented vector?
Applying pre-processing to the creation of a dfm.
quanteda’s dfm()
function makes it wasy to pass the cleaning arguments to clean, which are executed as part of the tokenization implemented by dfm()
. Compare the steps required in a similar text preparation package, tm:
require(tm)
## Loading required package: tm
## Loading required package: NLP
data("crude")
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, stemDocument)
tdm <- TermDocumentMatrix(crude)
# same in quanteda
require(quanteda)
## Loading required package: quanteda
## Loading required package: data.table
crudeCorpus <- corpus(crude)
crudeDfm <- dfm(crudeCorpus)
## Creating dfm from a corpus: ... done.
Inspect the dimensions of the resulting objects, including the names of the words extracted as features. It is also worth comparing the structure of the document-feature matrixes returned by each package. tm uses the slam simple triplet matrix format for representing a sparse matrix.
It is also – in fact almost always – useful to inspect the structure of this object:
str(tdm)
## List of 6
## $ i : int [1:1954] 49 86 110 148 166 167 178 183 184 195 ...
## $ j : int [1:1954] 1 1 1 1 1 1 1 1 1 1 ...
## $ v : num [1:1954] 1 2 1 1 1 1 2 1 1 2 ...
## $ nrow : int 943
## $ ncol : int 20
## $ dimnames:List of 2
## ..$ Terms: chr [1:943] "abdulaziz" "abil" "abl" "about" ...
## ..$ Docs : chr [1:20] "127" "144" "191" "194" ...
## - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
THis indicates that we can extract the names of the words from the tm TermDocumentMatrix object by getting the rownames from inspecting the tdm:
head(tdm$dimnames$Terms, 20)
## [1] "abdulaziz" "abil" "abl" "about" "abov"
## [6] "above" "abroad" "accept" "accord" "across"
## [11] "act" "activity" "add" "added" "address"
## [16] "adher" "advantag" "advisers" "after" "again"
Compare this to the results of the same operations from quanteda. To get the “words” from a quanteda object, you can use the features()
function:
features_quanteda <- features(crudeDfm)
head(features_quanteda, 20)
## [1] "a" "abdulaziz" "abil" "abl" "about"
## [6] "abov" "above" "abroad" "accept" "accord"
## [11] "across" "act" "activity" "ad" "add"
## [16] "added" "address" "adher" "advantag" "advisers"
str(crudeDfm)
## int [1:20, 1:973] 5 7 2 3 1 8 10 2 4 4 ...
## - attr(*, "dimnames")=List of 2
## ..$ docs : chr [1:20] "reut-00001.xml" "reut-00002.xml" "reut-00004.xml" "reut-00005.xml" ...
## ..$ features: chr [1:973] "a" "abdulaziz" "abil" "abl" ...
## - attr(*, "class")= chr [1:2] "dfm" "matrix"
## - attr(*, "settings")=List of 13
## ..$ stopwords : NULL
## ..$ collocations : NULL
## ..$ dictionary : NULL
## ..$ dictionary_regex : logi FALSE
## ..$ stem : logi FALSE
## ..$ delimiter_word : chr " "
## ..$ delimiter_sentence : chr ".!?"
## ..$ delimiter_paragraph: chr "\n\n"
## ..$ clean_tolower : logi TRUE
## ..$ clean_removeDigits : logi TRUE
## ..$ clean_removePunct : logi TRUE
## ..$ units : chr "documents"
## ..$ unitsoriginal : chr "documents"
What proportion of the crudeDfm
are zeros? Compare the sizes of tdm
and crudeDfm
using the object.size()
function.
Keywords-in-context
quanteda provides a keyword-in-context function that is easily usable and configurable to explore texts in a descriptive way. Type ?kwic
to view the documentation.
Load the Irish budget debate speeches for the year 2010 using
require(quantedaData)
## Loading required package: quantedaData
data(ie2010Corpus)
and experiment with the kwic
function, following the syntax specified on the help page for kwic
. kwic
can be used either on a character vector or a corpus object. What class of object is returned? Try assigning the return value from kwic
to a new object and then examine the object by clicking on it in the environment pane in RStudio (or using the inspection method of your choice).
Use the kwic
function to discover the context of the word “clean”. Is this associated with environmental policy?
Examine the context of words related to “disaster”. Hint: you can use the stem of the word along with setting the regex
argument to TRUE
.
Descriptive statistics
We can extract basic descriptive statistics from a corpus from its document feature matrix. Make a dfm from the 2010 Irish budget speeches corpus.
Examine the most frequent word features using topfeatures
. What are the five most frequent word in the corpus?
summary
quanteda
provides a function to count syllables in a word — countSyllables
. Try the function at the prompt. The code below will apply this function to all the words in the corpus, to give you a count of the total syllables in the corpus.
# count syllables from texts in the 2010 speech corpus
textSyls <- countSyllables(texts(ie2010Corpus))
# sum the syllable counts
totalSyls <- sum(textSyls)
Now compute the readability measures known as the Flesch-Kincaid index. The formula is:
\[206.835 - 1.015 \left( \frac{total tokens}{total sentences} \right) - 84.6 \left( \frac{total syllables}{total tokens} \right)\]
You should now have the values for these variables — calculate the Flesch-Kincaid index of the Irish budget speeches.
Lexical Diversity over Time
We can plot the type-token ratio of the Irish budget speeches over time. To do this, begin by extracting a subset of iebudgets that contains only the first speaker from each year:
data(iebudgets)
finMins <- subset(iebudgets, no=="01")
tokeninfo <- summary(finMins)
## Corpus consisting of 5 documents.
##
## Text Types Tokens Sentences year debate
## 2012_BUDGET_01_Michael_Noonan_FG 1538 6450 294 2012 BUDGET
## 2011_BUDGET_01_Brian_Lenihan_FF 1537 7094 371 2011 BUDGET
## 2010_BUDGET_01_Brian_Lenihan_FF 1655 7799 390 2010 BUDGET
## 2009_BUDGETSUP_01_Brian_Lenihan_FF 1632 7570 410 2009 BUDGETSUP
## 2008_BUDGET_01_Brian_Cowen_FF 1715 8815 417 2008 BUDGET
## no namefirst namelast party
## 01 Michael Noonan FG
## 01 Brian Lenihan FF
## 01 Brian Lenihan FF
## 01 Brian Lenihan FF
## 01 Brian Cowen FF
##
## Source: /Users/kbenoit/Dropbox/QUANTESS/quantedaData_kenlocal_gh/* on x86_64 by kbenoit.
## Created: Mon Sep 15 14:42:09 2014.
## Notes: .
Note the quotation marks around no
. Why are these required here?
Get the type-token ratio for each text from this subset, and plot the resulting vector of TTRs as a function of the year.
Now compare the results from the statLexdiv
function applied to the texts. Are the results the same?
Document and word associations
Load the presidential inauguration corpus selecting from 1900-1050, and create a dfm from this corpus.
Measure the document similarities using simil()
. Compare the results for Euclidean, Euclidean on the term frequency standardized dfm, cosine, and Jaccard.
Measure the term similarities for the following words: economy, health, women.