** Due: April 3, 2015**
This exercise is designed to get you working with quanteda. The focus will be on exploring the package and getting some texts into the corpus object format. quanteda package has several functions for creating a corpus of texts which we will use in this exercise.
Getting Started.
You can use R or Rstudio for these exercises. You will first need to install the package,
using the commands below. Also see the instructions for installation from the dev branch page of http://github.com/kbenoit/quanteda.
# needs the devtools package for this to work
if (!require(devtools)) install.packages("devtools", dependencies=TRUE)
# be sure to install the latest version from GitHub, using dev branch:
devtools::install_github("quanteda", username="kbenoit", dependencies=TRUE, ref="dev")
# if this fails remove the ref="dev" part
# and quantedaData
devtools::install_github("quantedaData", username="kbenoit")
Exploring quanteda functions.
You can try running demo(quanteda)
, and also use the example()
function for any function in the package, to run the examples and see how the function works. Of course you should also browse the documentation, especially ?corpus
to see the structure and operations of how to construct a corpus.
Making a corpus and corpus structure
From a vector of texts already in memory.
The simplest way to create a corpus is to use a vector of texts already present in R’s global environment. Some text and corpus objects are built into the package, for example inaugTexts
is the UTF-8 encoded set of 57 presidential inaugural addresses. Try using corpus()
on this set of texts to create a corpus.
Once you have constructed this corpus, use the summary()
method to see a brief description of the corpus. The names of the character vector inaugTexts
should have become the document names.
From a directory of text files.
The corpus()
function can take as its main argument the name of a directory, if you wrap the path to the directory within a directory()
call. (See ?directory
for an example.) If you call directory()
with no arguments, then it should allow you to choose the directory interactively (you will need to have installed the tcltk2
package first though.)
Here you are encouraged to select any directory of plain text files of your own.
How did it work? Try using docvars()
to assign a set of document-level variables.
Note that if you document level metadata in your filenames, then this can be automatically parsed by corpus.directory()
into docvars
.
require(quanteda)
## Loading required package: quanteda
mydir <- textfile("~/Dropbox/QUANTESS/corpora/ukManRenamed/*.txt")
mycorpus <- corpus(mydir)
summary(mycorpus, 5)
## Corpus consisting of 101 documents, showing 5 documents.
##
## Text Types Tokens Sentences
## UK_natl_1945_en_Con.txt 1578 6095 275
## UK_natl_1945_en_Lab.txt 1258 4975 241
## UK_natl_1945_en_Lib.txt 1061 3377 158
## UK_natl_1950_en_Con.txt 1806 7411 381
## UK_natl_1950_en_Lab.txt 1342 4879 275
##
## Source: /Users/kbenoit/Dropbox/Classes/Trinity/Data Mining 2015/Notes/Day 6 - Text/* on x86_64 by kbenoit.
## Created: Mon Mar 23 22:11:21 2015.
## Notes: .
From .csv
or .json
files — see the documentation with ?textfile
.
Explore some phrases in the text.
You can do this using the kwic
(for “key-words-in-context”) to explore a specific word or phrase.
kwic(inaugCorpus, "terror", 3)
## preword word
## [1797-Adams, 1183] or violence, by terror,
## [1933-Roosevelt, 100] nameless, unreasoning, unjustified terror
## [1941-Roosevelt, 252] by a fatalistic terror,
## [1961-Kennedy, 763] uncertain balance of terror
## [1961-Kennedy, 872] instead of its terrors.
## [1981-Reagan, 691] Americans from the terror
## [1981-Reagan, 1891] those who practice terrorism
## [1997-Clinton, 929] the fanaticism of terror.
## [1997-Clinton, 1462] strong defense against terror
## [2009-Obama, 1433] aims by inducing terror
## postword
## [1797-Adams, 1183] intrigue, or venality,
## [1933-Roosevelt, 100] which paralyzes needed
## [1941-Roosevelt, 252] we proved that
## [1961-Kennedy, 763] that stays the
## [1961-Kennedy, 872] Together let us
## [1981-Reagan, 691] of runaway living
## [1981-Reagan, 1891] and prey upon
## [1997-Clinton, 929] And they torment
## [1997-Clinton, 1462] and destruction. Our
## [2009-Obama, 1433] and slaughtering innocents,
Try substituting your own search terms, or working with your own corpus.
Create a document-feature matrix, using dfm
. First, read the documentation using ?dfm
to see the available options.
mydfm <- dfm(inaugCorpus, ignoredFeatures=stopwords("english"))
## Creating a dfm from a corpus ...
## ... indexing 57 documents
## ... tokenizing texts, found 134,142 total tokens
## ... cleaning the tokens, 461 removed entirely
## ... ignoring 174 feature types, discarding 69,005 total features (51.6%)
## ... summing tokens by document
## ... indexing 9,085 feature types
## ... building sparse matrix
## ... created a 57 x 9085 sparse dfm
## ... complete. Elapsed time: 3.366 seconds.
dim(mydfm)
## [1] 57 9085
topfeatures(mydfm, 20)
## will people government us can upon
## 871 564 561 476 470 371
## must may great states shall world
## 363 338 334 331 314 305
## country every nation peace one new
## 294 291 287 253 244 241
## power public
## 232 223
Experiment with different dfm
options, such as stem=TRUE
. The function trim()
allows you to reduce the size of the dfm following its construction.
Grouping on a variable is an excellent feature of dfm()
, in fact one of my favorites.
For instance, if you want to aggregate all speeches by presidential name, you can execute
mydfm <- dfm(inaugCorpus, groups="President")
## Creating a dfm from a corpus ...
## ... grouping texts by variable: President
## ... indexing 34 documents
## ... tokenizing texts, found 134,142 total tokens
## ... cleaning the tokens, 461 removed entirely
## ... summing tokens by document
## ... indexing 9,208 feature types
## ... building sparse matrix
## ... created a 34 x 9208 sparse dfm
## ... complete. Elapsed time: 1.246 seconds.
dim(mydfm)
## [1] 34 9208
docnames(mydfm)
## [1] "Adams" "Buchanan" "Bush" "Carter" "Cleveland"
## [6] "Clinton" "Coolidge" "Eisenhower" "Garfield" "Grant"
## [11] "Harding" "Harrison" "Hayes" "Hoover" "Jackson"
## [16] "Jefferson" "Johnson" "Kennedy" "Lincoln" "Madison"
## [21] "McKinley" "Monroe" "Nixon" "Obama" "Pierce"
## [26] "Polk" "Reagan" "Roosevelt" "Taft" "Taylor"
## [31] "Truman" "VanBuren" "Washington" "Wilson"
Note that this groups Theodore and Franklin D. Roosevelt together – to separate them we would have needed to add a firstname variable using docvars()
and grouped on that as well.
Explore the ability to subset a corpus.
There is a subset()
method defined for a corpus, which works just like R’s normal subset()
command. This provides an easy method to send specific documents to downstream functions, like dfm()
, which will be useful workaround until I implement a full set of subsetting and indexing features for the dfm
class object.
For instance if you want a wordcloud of just Obama’s two inagural addresses, you would need to subset the corpus first:
obamadfm <- dfm(subset(inaugCorpus, President=="Obama"), stopwords=TRUE)
## Creating a dfm from a corpus ...
## ... indexing 2 documents
## ... tokenizing texts, found 4,525 total tokens
## ... cleaning the tokens, 43 removed entirely
## ... summing tokens by document
## ... indexing 1,333 feature types
## ... building sparse matrix
## ... created a 2 x 1333 sparse dfm
## ... complete. Elapsed time: 0.05 seconds.
plot(obamadfm)
Preparing and pre-processing texts
“Cleaning”" texts
It is common to “clean” texts before processing, usually by removing punctuation, digits, and converting to lower case. Look at the documentation for quanteda}’s clean
command ?clean
and use the command on the exampleString
text (you can load this from quantedaData using data(exampleString)
. Can you think of cases where cleaning could introduce homonymy?
Tokenizing texts
In order to count word frequencies, we first need to split the text into words through a process known as tokenization. Look at the documentation for quanteda’s tokenize
command using the built in help function (?
before any object/command). Use the tokenize
command on exampleString
, and examine the results. Are there cases where it is unclear where the boundary between two words lies? You can experiment with the options to tokenize
.
Try reshaping the sentences from exampleString
into sentences, using segmentSentence
. What sort of object is returned if you tokenize the segmented sentence object?
Stemming.
Stemming removes the suffixes using the Porter stemmer, found in the SnowballC library. The quanteda function to invoke the stemmer is wordstem
. Apply stemming to the exampleString
and examine the results. Why does it not appear to work, and what do you need to do to make it work? How would you apply this to the sentence-segmented vector?
Applying pre-processing to the creation of a dfm.
quanteda’s dfm()
function makes it wasy to pass the cleaning arguments to clean, which are executed as part of the tokenization implemented by dfm()
. Compare the steps required in a similar text preparation package, tm:
require(tm)
## Loading required package: tm
## Loading required package: NLP
##
## Attaching package: 'tm'
##
## The following objects are masked from 'package:quanteda':
##
## as.DocumentTermMatrix, stopwords
data("crude")
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, stemDocument)
tdm <- TermDocumentMatrix(crude)
# same in quanteda
require(quanteda)
crudeCorpus <- corpus(crude)
crudeDfm <- dfm(crudeCorpus)
## Creating a dfm from a corpus ...
## ... indexing 20 documents
## ... tokenizing texts, found 3,863 total tokens
## ... cleaning the tokens, 0 removed entirely
## ... summing tokens by document
## ... indexing 973 feature types
## ... building sparse matrix
## ... created a 20 x 973 sparse dfm
## ... complete. Elapsed time: 0.047 seconds.
Inspect the dimensions of the resulting objects, including the names of the words extracted as features. It is also worth comparing the structure of the document-feature matrixes returned by each package. tm uses the slam simple triplet matrix format for representing a sparse matrix.
It is also – in fact almost always – useful to inspect the structure of this object:
str(tdm)
## List of 6
## $ i : int [1:1954] 49 86 110 148 166 167 178 183 184 195 ...
## $ j : int [1:1954] 1 1 1 1 1 1 1 1 1 1 ...
## $ v : num [1:1954] 1 2 1 1 1 1 2 1 1 2 ...
## $ nrow : int 943
## $ ncol : int 20
## $ dimnames:List of 2
## ..$ Terms: chr [1:943] "abdulaziz" "abil" "abl" "about" ...
## ..$ Docs : chr [1:20] "127" "144" "191" "194" ...
## - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
THis indicates that we can extract the names of the words from the tm TermDocumentMatrix object by getting the rownames from inspecting the tdm:
head(tdm$dimnames$Terms, 20)
## [1] "abdulaziz" "abil" "abl" "about" "abov"
## [6] "above" "abroad" "accept" "accord" "across"
## [11] "act" "activity" "add" "added" "address"
## [16] "adher" "advantag" "advisers" "after" "again"
Compare this to the results of the same operations from quanteda. To get the “words” from a quanteda object, you can use the features()
function:
features_quanteda <- features(crudeDfm)
head(features_quanteda, 20)
## [1] "a" "abdulaziz" "abil" "abl" "about"
## [6] "abov" "above" "abroad" "accept" "accord"
## [11] "across" "act" "activity" "ad" "add"
## [16] "added" "address" "adher" "advantag" "advisers"
str(crudeDfm)
## Formal class 'dfmSparse' [package "quanteda"] with 9 slots
## ..@ settings :List of 1
## .. ..$ : NULL
## ..@ weighting: chr "frequency"
## ..@ smooth : num 0
## ..@ Dim : int [1:2] 20 973
## ..@ Dimnames :List of 2
## .. ..$ docs : chr [1:20] "reut-00001.xml" "reut-00002.xml" "reut-00004.xml" "reut-00005.xml" ...
## .. ..$ features: chr [1:973] "a" "abdulaziz" "abil" "abl" ...
## ..@ i : int [1:2172] 0 1 2 3 4 5 6 7 8 9 ...
## ..@ p : int [1:974] 0 18 19 21 23 31 35 37 38 39 ...
## ..@ x : num [1:2172] 5 7 2 3 1 8 10 2 4 4 ...
## ..@ factors : list()
What proportion of the crudeDfm
are zeros? Compare the sizes of tdm
and crudeDfm
using the object.size()
function.
Keywords-in-context
quanteda provides a keyword-in-context function that is easily usable and configurable to explore texts in a descriptive way. Type ?kwic
to view the documentation.
Load the Irish budget debate speeches for the year 2010 using
require(quantedaData)
## Loading required package: quantedaData
data(ie2010Corpus)
and experiment with the kwic
function, following the syntax specified on the help page for kwic
. kwic
can be used either on a character vector or a corpus object. What class of object is returned? Try assigning the return value from kwic
to a new object and then examine the object by clicking on it in the environment pane in RStudio (or using the inspection method of your choice).
Use the kwic
function to discover the context of the word “clean”. Is this associated with environmental policy?
Examine the context of words related to “disaster”. Hint: you can use the stem of the word along with setting the regex
argument to TRUE
.
Descriptive statistics
We can extract basic descriptive statistics from a corpus from its document feature matrix. Make a dfm from the 2010 Irish budget speeches corpus.
Examine the most frequent word features using topfeatures
. What are the five most frequent word in the corpus?
summary
quanteda
provides a function to count syllables in a word — syllables
. Try the function at the prompt. The code below will apply this function to all the words in the corpus, to give you a count of the total syllables in the corpus.
# count syllables from texts in the 2010 speech corpus
textSyls <- syllables(texts(ie2010Corpus))
# sum the syllable counts
totalSyls <- sum(textSyls)
Lexical Diversity over Time
We can plot the type-token ratio of the Irish budget speeches over time. To do this, begin by extracting a subset of iebudgets that contains only the first speaker from each year:
data(iebudgetsCorpus, package="quantedaData")
finMins <- subset(iebudgetsCorpus, number=="01")
tokeninfo <- summary(finMins)
## Corpus consisting of 6 documents.
##
## Text Types Tokens Sentences year debate
## 2008_BUDGET_01_Brian_Cowen_FF 1705 8659 417 2008 BUDGET
## 2009_BUDGET_01_Brian_Lenihan_FF 1653 7593 418 2009 BUDGET
## 2009_BUDGETSUP_01_Brian_Lenihan_FF 1639 7500 410 2009 BUDGETSUP
## 2010_BUDGET_01_Brian_Lenihan_FF 1649 7719 390 2010 BUDGET
## 2011_BUDGET_01_Brian_Lenihan_FF 1539 7049 371 2011 BUDGET
## 2012_BUDGET_01_Michael_Noonan_FG 1521 6412 294 2012 BUDGET
## number namefirst namelast party
## 01 Brian Cowen FF
## 01 Brian Lenihan FF
## 01 Brian Lenihan FF
## 01 Brian Lenihan FF
## 01 Brian Lenihan FF
## 01 Michael Noonan FG
##
## Source: /Users/kbenoit/Dropbox/QUANTESS/quantedaData_kenlocal_gh/* on x86_64 by kbenoit.
## Created: Sat Nov 15 18:32:54 2014.
## Notes: .
Note the quotation marks around the value for number
. Why are these required here?
Get the type-token ratio for each text from this subset, and plot the resulting vector of TTRs as a function of the year.
Now compare the results from the lexdiv
function applied to the texts. Are the results the same?
Document and word associations
Load the presidential inauguration corpus selecting from 1900-1050, and create a dfm from this corpus.
Measure the document similarities using similarity()
. Compare the results for Euclidean, Euclidean on the term frequency standardized dfm, cosine, and Jaccard.
Measure the term similarities for the following words: economy, health, women.
Working with dictionaries
Creating a simple dictionary.
Dictionaries are named lists, consisting of a “key” and a set of entries defining the equivalence class for the given key. To create a simple dictionary of parts of speech, for instance we could define a dictionary consisting of articles and conjunctions, using:
posDict <- dictionary(list(articles = c("the", "a", "and"),
conjunctions = c("and", "but", "or", "nor", "for", "yet", "so")))
To let this define a set of features, we can use this dictionary when we create a dfm
, for instance:
posDfm <- dfm(inaugCorpus, dictionary=posDict)
## Creating a dfm from a corpus ...
## ... indexing 57 documents
## ... tokenizing texts, found 134,142 total tokens
## ... cleaning the tokens, 461 removed entirely
## ... applying a dictionary consisting of 2 key entries
## ... created a 57 x 3 sparse dfm
## ... complete. Elapsed time: 1.847 seconds.
posDfm[1:10,]
## Document-feature matrix of: 10 documents, 3 features.
## 10 x 3 sparse Matrix of class "dfmSparse"
## articles conjunctions Non_Dictionary
## 1789-Washington 178 73 1178
## 1793-Washington 15 4 116
## 1797-Adams 344 192 1782
## 1801-Jefferson 232 109 1385
## 1805-Jefferson 256 126 1784
## 1809-Madison 166 63 946
## 1813-Madison 169 63 978
## 1817-Monroe 458 174 2738
## 1821-Monroe 577 195 3685
## 1825-Adams 448 150 2317
Weight the posDfm
by term frequency using tf()
, and plot the values of articles and conjunctions (actually, here just the coordinating conjunctions) across the speeches. (Hint: you can use docvars(inaugCorpus, "Year"))
for the x-axis.)
Is the distribution of normalized articles and conjunctions relatively constant across years, as you would expect?
Replicating a published dictionary analysis
Here we will create and implement the populism dictionary from Rooduijn, Matthijs, and Teun Pauwels. 2011. “Measuring Populism: Comparing Two Methods of Content Analysis.” West European Politics 34(6): 1272–83. Appendix B of that paper provides the term entries for a dictionary key for the concept populism. Implement this as a dictionary, and apply it to the same UK manifestos as in the article.
Hint: You can get a corpus of the UK manifestos from their article using the following:
data(ukManifestos, package="quantedaData")
ukPopCorpus <- subset(ukManifestos, (Year %in% c(1992, 2001, 2005) &
(Party %in% c("Lab", "LD", "Con", "BNP", "UKIP"))))
summary(ukPopCorpus)
## Corpus consisting of 11 documents.
##
## Text Types Tokens Sentences Country Type Year Language
## UK_natl_1992_en_Con 3886 29560 1605 UK natl 1992 en
## UK_natl_1992_en_Lab 2313 11355 623 UK natl 1992 en
## UK_natl_1992_en_LD 3004 17381 939 UK natl 1992 en
## UK_natl_2001_en_Con 2517 13196 721 UK natl 2001 en
## UK_natl_2001_en_Lab 3600 28704 1602 UK natl 2001 en
## UK_natl_2001_en_LD 3291 21174 1232 UK natl 2001 en
## UK_natl_2005_en_BNP 4444 25112 1058 UK natl 2005 en
## UK_natl_2005_en_Con 1860 7685 420 UK natl 2005 en
## UK_natl_2005_en_Lab 3579 23800 1557 UK natl 2005 en
## UK_natl_2005_en_LD 2859 16081 840 UK natl 2005 en
## UK_natl_2005_en_UKIP 2185 8856 425 UK natl 2005 en
## Party
## Con
## Lab
## LD
## Con
## Lab
## LD
## BNP
## Con
## Lab
## LD
## UKIP
##
## Source: /Users/kbenoit/Dropbox/QUANTESS/quantedaData_kenlocal_gh/* on x86_64 by kbenoit.
## Created: Sat Nov 15 18:43:36 2014.
## Notes: .
Create a dfm of the populism dictionary on the UK manifestos. Use this dfm to reproduce the x-axis for the UK-based parties from Figure 1 in the article. Suggestion: Use dotchart()
. You will need to normalize the values first by term frequency within document. Hint: Use weight(x, "relFreq")
on the dfm.
You can explore some of these terms within the corpus to see whether you think they are appropriate measures of populism. How can you search the corpus for the regular expression politici*
as a “keyword in context”?