This take-home exercise is designed to get you working with quanteda. The focus will be on exploring the package and getting some texts into the corpus object format. quanteda package has several functions for creating a corpus of texts which we will use in this exercise.
Getting Started.
You can use R or Rstudio for these exercises. You will first need to install the package,
using:
# needs the devtools package for this to work
if (!require(devtools)) install.packages("devtools", dependencies=TRUE)
# be sure to install the latest version from GitHub, using dev branch:
devtools::install_github("quanteda", username="kbenoit", dependencies=TRUE, ref="dev")
# and quantedaData
devtools::install_github("quantedaData", username="kbenoit")
Exploring quanteda functions.
You can try running demo(quanteda)
, and also use the example()
function for any function in the package, to run the examples and see how the function works. Of course you should also browse the documentation, especially ?corpus
to see the structure and operations of how to construct a corpus.
Making a corpus and corpus structure
From a vector of texts already in memory.
The simplest way to create a corpus is to use a vector of texts already present in R’s global environment. Some text and corpus objects are built into the package, for example inaugTexts
is the UTF-8 encoded set of 57 presidential inaugural addresses. Try using corpus()
on this set of texts to create a corpus.
Once you have constructed this corpus, use the summary()
method to see a brief description of the corpus. The names of the character vector inaugTexts
should have become the document names.
From a directory of text files.
The corpus()
function can take as its main argument the name of a directory, if you wrap the path to the directory within a directory()
call. (See ?directory
for an example.) If you call directory()
with no arguments, then it should allow you to choose the directory interactively (you will need to have installed the tcltk2
package first though.)
Here you are encouraged to select any directory of plain text files of your own.
How did it work? Try using docvars()
to assign a set of document-level variables.
Note that if you document level metadata in your filenames, then this can be automatically parsed by corpus.directory()
into docvars
.
require(quanteda)
## Loading required package: quanteda
mydir <- directory("~/Dropbox/QUANTESS/corpora/ukManRenamed")
mycorpus <- corpus(mydir)
summary(mycorpus, 5)
## Corpus consisting of 101 documents.
##
## Text Types Tokens Sentences docvar1 docvar2 docvar3
## UK_natl_1945_en_Con 1578 6095 275 UK natl 1945
## UK_natl_1945_en_Lab 1258 4975 241 UK natl 1945
## UK_natl_1945_en_Lib 1060 3377 158 UK natl 1945
## UK_natl_1950_en_Con 1800 7413 381 UK natl 1950
## UK_natl_1950_en_Lab 1342 4879 275 UK natl 1950
## docvar4 docvar5
## en Con
## en Lab
## en Lib
## en Con
## en Lab
##
## Source: /Users/kbenoit/Dropbox/Classes/QTA NYU/Exercises/Exercise 1/* on x86_64 by kbenoit.
## Created: Fri Sep 26 12:39:15 2014.
## Notes: .
From a zipped file. For this, you can try downloading the zipped file of Irish budget speeches available from /assets/courses/nyu2014qta/iebudget2010.zip. Then execute this code:
myzipfiles <- zipfiles() # allows you to locate the file interactively
mycorpus <- corpus(myzipfiles,
docvarnames=c("year", "debate", "seq", "fname", "lname", "party"))
summary(mycorpus, 5)
If you are familiar with the tm text package for R, it is also now possible to import one of its VCorpus
objects directly, using the corpus.VCorpus()
method.
This is a feature I just added, so I am eager for you to test it.
Explore some phrases in the text.
You can do this using the kwic
(for “key-words-in-context”) to explore a specific word or phrase.
kwic(inaugCorpus, "terror", 3)
## preword word
## [1797-Adams, 1183] or violence, by terror,
## [1933-Roosevelt, 100] nameless, unreasoning, unjustified terror
## [1941-Roosevelt, 252] by a fatalistic terror,
## [1961-Kennedy, 763] uncertain balance of terror
## [1961-Kennedy, 872] instead of its terrors.
## [1981-Reagan, 691] Americans from the terror
## [1981-Reagan, 1891] those who practice terrorism
## [1997-Clinton, 929] the fanaticism of terror.
## [1997-Clinton, 1462] strong defense against terror
## [2009-Obama, 1433] aims by inducing terror
## postword
## [1797-Adams, 1183] intrigue, or venality,
## [1933-Roosevelt, 100] which paralyzes needed
## [1941-Roosevelt, 252] we proved that
## [1961-Kennedy, 763] that stays the
## [1961-Kennedy, 872] Together let us
## [1981-Reagan, 691] of runaway living
## [1981-Reagan, 1891] and prey upon
## [1997-Clinton, 929] And they torment
## [1997-Clinton, 1462] and destruction. Our
## [2009-Obama, 1433] and slaughtering innocents,
Try substituting your own search terms, or working with your own corpus.
Create a document-feature matrix, using dfm
. First, read the documentation using ?dfm
to see the available options.
mydfm <- dfm(inaugCorpus, stopwords=TRUE)
## Creating dfm from a corpus: ... removing stopwords ... done.
dim(mydfm)
## [1] 57 9087
topfeatures(mydfm, 20)
## will people government us can upon
## 871 564 561 476 470 371
## must may great states shall world
## 363 338 334 331 314 305
## country every nation peace one new
## 294 291 287 253 244 241
## power public
## 232 223
Experiment with different dfm
options, such as stem=TRUE
. The function trimdfm()
allows you to reduce the size of the dfm following its construction.
Grouping on a variable is an excellent feature of dfm()
, in fact one of my favorites.
For instance, if you want to aggregate all speeches by presidential name, you can execute
mydfm <- dfm(inaugCorpus, groups="President")
## Creating dfm from a corpus: ... aggregating by group: President... complete ... done.
dim(mydfm)
## [1] 34 9210
docnames(mydfm)
## [1] "Adams" "Buchanan" "Bush" "Carter" "Cleveland"
## [6] "Clinton" "Coolidge" "Eisenhower" "Garfield" "Grant"
## [11] "Harding" "Harrison" "Hayes" "Hoover" "Jackson"
## [16] "Jefferson" "Johnson" "Kennedy" "Lincoln" "Madison"
## [21] "McKinley" "Monroe" "Nixon" "Obama" "Pierce"
## [26] "Polk" "Reagan" "Roosevelt" "Taft" "Taylor"
## [31] "Truman" "VanBuren" "Washington" "Wilson"
Note that this groups Theodore and Franklin D. Roosevelt together – to separate them we would have needed to add a firstname variable using docvars()
and grouped on that as well.
Explore the ability to subset a corpus.
There is a subset()
method defined for a corpus, which works just like R’s normal subset()
command. This provides an easy method to send specific documents to downstream functions, like dfm()
, which will be useful workaround until I implement a full set of subsetting and indexing features for the dfm
class object.
For instance if you want a wordcloud of just Obama’s two inagural addresses, you would need to subset the corpus first:
obamadfm <- dfm(subset(inaugCorpus, President=="Obama"), stopwords=TRUE)
## Creating dfm from a corpus: ... removing stopwords ... done.
plot(obamadfm)
quanteda is a work in progress. Please send me suggestions, bug reports, etc. so that I can improve it. You can email these directly to kbenoit@lse.ac.uk.
In addition, if you are having trouble with importing your texts into a corpus, I welcome you to send me a set of your texts and I will write the code for you (and possibly add to the functionality of quanteda to make this possible). I am particularly interested in: * alternative formats - csv, pdf, Word, XML, etc. * non-English languages - encoding is one of the issues we plan to tackle * interesting/non-standard document units * large volumes of text, to see quanteda’s functionality scales.