Work your way through the examples, studying each to understand what it is doing. Where questions are asked, include your answer when you write this up.
Ways to prepare your answer:
Naming your file: Please use the following convention:
Exercise1_Lastname_FirstName.pdf
(or whatever extension is appropriate)
Submitting your answers: Can be done by email to kbenoit@tcd.ie.
Preliminaries: Installation
First, you need to have quanteda installed. You can do this from inside RStudio, from the Tools…Install Packages menu, or simply using
install.packages("quanteda")
(Optional) You can install some additional corpus data from quantedaData using
## the devtools package is required to install quanteda from Github
devtools::install_github("kbenoit/quantedaData")
Note that on Windows platforms, it is also (highly) recommended that you install the RTools suite, and for OS X, that you install XCode from the App Store.
Before you can execute the quanteda commands in this file, you will need to attach its functions using a require()
or library()
call.
require(quanteda)
## Loading required package: quanteda
##
## Attaching package: 'quanteda'
##
## The following object is masked from 'package:stats':
##
## df
##
## The following object is masked from 'package:base':
##
## sample
Now summarize some texts in the Irish 2010 budget speech corpus:
summary(ie2010Corpus)
## Corpus consisting of 14 documents.
##
## Text Types Tokens Sentences year debate
## 2010_BUDGET_01_Brian_Lenihan_FF 1754 7916 404 2010 BUDGET
## 2010_BUDGET_02_Richard_Bruton_FG 995 4086 217 2010 BUDGET
## 2010_BUDGET_03_Joan_Burton_LAB 1521 5790 309 2010 BUDGET
## 2010_BUDGET_04_Arthur_Morgan_SF 1499 6510 345 2010 BUDGET
## 2010_BUDGET_05_Brian_Cowen_FF 1544 5964 252 2010 BUDGET
## 2010_BUDGET_06_Enda_Kenny_FG 1087 3896 155 2010 BUDGET
## 2010_BUDGET_07_Kieran_ODonnell_FG 638 2086 133 2010 BUDGET
## 2010_BUDGET_08_Eamon_Gilmore_LAB 1123 3807 202 2010 BUDGET
## 2010_BUDGET_09_Michael_Higgins_LAB 457 1149 44 2010 BUDGET
## 2010_BUDGET_10_Ruairi_Quinn_LAB 415 1181 60 2010 BUDGET
## 2010_BUDGET_11_John_Gormley_Green 381 939 50 2010 BUDGET
## 2010_BUDGET_12_Eamon_Ryan_Green 486 1519 90 2010 BUDGET
## 2010_BUDGET_13_Ciaran_Cuffe_Green 426 1144 45 2010 BUDGET
## 2010_BUDGET_14_Caoimhghin_OCaolain_SF 1110 3699 177 2010 BUDGET
## number foren name party
## 01 Brian Lenihan FF
## 02 Richard Bruton FG
## 03 Joan Burton LAB
## 04 Arthur Morgan SF
## 05 Brian Cowen FF
## 06 Enda Kenny FG
## 07 Kieran ODonnell FG
## 08 Eamon Gilmore LAB
## 09 Michael Higgins LAB
## 10 Ruairi Quinn LAB
## 11 John Gormley Green
## 12 Eamon Ryan Green
## 13 Ciaran Cuffe Green
## 14 Caoimhghin OCaolain SF
##
## Source: /home/paul/Dropbox/code/quantedaData/* on x86_64 by paul
## Created: Tue Sep 16 15:58:21 2014
## Notes:
Create a document-feature matrix from this corpus, removing stop words:
ieDfm <- dfm(ie2010Corpus, ignoredFeatures = c(stopwords("english"), "will"), stem = TRUE)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 14 documents
## ... indexing features: 4,881 feature types
## ... removed 118 features, from 175 supplied (glob) feature types
## ... stemming features (English), trimmed 1510 feature variants
## ... created a 14 x 3253 sparse dfm
## ... complete.
## Elapsed time: 0.102 seconds.
Look at the top occuring features:
topfeatures(ieDfm)
## budget peopl govern year minist tax public economi cut
## 271 266 242 198 197 195 179 172 172
## job
## 148
Make a word cloud:
plot(ieDfm, min.freq=25, random.order=FALSE)
Did you get the same output?
Basic string manipulation functions in R
There are several useful string manipulation functions in the R base library. In addition, we will look at the stringr
package which provides an additional interface for simple text manipulation.
The fundamental type (or mode
) in which R stores text is the character vector. The most simple case is a character vector of length one. The nchar
function returns the number of characters in a character vector.
require(quanteda)
s1 <- 'my example text'
length(s1)
## [1] 1
nchar(s1)
## [1] 15
The nchar
function is vectorized, meaning that when called on a vector it returns a value for each element of the vector.
s2 <- c('This is', 'my example text.', 'So imaginative.')
length(s2)
## [1] 3
nchar(s2)
## [1] 7 16 15
sum(nchar(s2))
## [1] 38
We can use this to answer some simple questions about the inaugural addresses.
Which were the longest and shortest speeches? We can query this using two functions, nchar()
and which.max()
and which.min()
.
which.max(nchar(inaugTexts))
## 1841-Harrison
## 14
which.min(nchar(inaugTexts))
## 1793-Washington
## 2
Unlike in some other programming languages, it is not possible to index into a string in R:
s1 <- 'This file contains many fascinating example sentences.'
s1[6:9]
## [1] NA NA NA NA
To extract a substring, instead we use the substr()
function. Using the help page from ?subst
, execute a call to substr()
to return the characters from s1
below from the 6th to the 9th characters.
s1 <- 'This file contains many fascinating example sentences.'
substr(s1, 6, 9)
## [1] "file"
A note for you C programmers: R counts from 1, not 0.
Often we would like to split character vectors to extract a term of interest. This is possible using the strsplit
function. Consider the names of the inaugural texts:
names(inaugTexts)
## [1] "1789-Washington" "1793-Washington" "1797-Adams"
## [4] "1801-Jefferson" "1805-Jefferson" "1809-Madison"
## [7] "1813-Madison" "1817-Monroe" "1821-Monroe"
## [10] "1825-Adams" "1829-Jackson" "1833-Jackson"
## [13] "1837-VanBuren" "1841-Harrison" "1845-Polk"
## [16] "1849-Taylor" "1853-Pierce" "1857-Buchanan"
## [19] "1861-Lincoln" "1865-Lincoln" "1869-Grant"
## [22] "1873-Grant" "1877-Hayes" "1881-Garfield"
## [25] "1885-Cleveland" "1889-Harrison" "1893-Cleveland"
## [28] "1897-McKinley" "1901-McKinley" "1905-Roosevelt"
## [31] "1909-Taft" "1913-Wilson" "1917-Wilson"
## [34] "1921-Harding" "1925-Coolidge" "1929-Hoover"
## [37] "1933-Roosevelt" "1937-Roosevelt" "1941-Roosevelt"
## [40] "1945-Roosevelt" "1949-Truman" "1953-Eisenhower"
## [43] "1957-Eisenhower" "1961-Kennedy" "1965-Johnson"
## [46] "1969-Nixon" "1973-Nixon" "1977-Carter"
## [49] "1981-Reagan" "1985-Reagan" "1989-Bush"
## [52] "1993-Clinton" "1997-Clinton" "2001-Bush"
## [55] "2005-Bush" "2009-Obama" "2013-Obama"
# returns a list of parts
parts <- strsplit(names(inaugTexts), '-')
years <- sapply(parts, function(x) x[1])
pres <- sapply(parts, function(x) x[2])
Examine the previous code carefully, as it uses list data types in R, which are something fundamentally important to understand. In quanteda, the tokenizedTexts
class of object – created when you call tokenize()
on a character object or corpus – is a type of list. Try it:
toks <- tokenize("This is a sentence containing some caractères Français.")
Now examine the “structure” of that object – assigned to “toks
” – using str()
. What does it indicate?
Try sending toks
to the global environment, by simply typing its name in the console and pressing Enter. Can you explain why it looks the way that it does? Hint: You can examine all available “methods” for an object class using the methods()
function. Try methods(class = "tokenizedTexts")
, and use the help function ?methods
to explain what you see.
The paste
function is used to join character vectors together. The way in which the elements are combined depends on the values of the sep
and collapse
arguments:
paste('one','two','three')
## [1] "one two three"
paste('one','two','three', sep='_')
## [1] "one_two_three"
paste(years, pres, sep='-')
## [1] "1789-Washington" "1793-Washington" "1797-Adams"
## [4] "1801-Jefferson" "1805-Jefferson" "1809-Madison"
## [7] "1813-Madison" "1817-Monroe" "1821-Monroe"
## [10] "1825-Adams" "1829-Jackson" "1833-Jackson"
## [13] "1837-VanBuren" "1841-Harrison" "1845-Polk"
## [16] "1849-Taylor" "1853-Pierce" "1857-Buchanan"
## [19] "1861-Lincoln" "1865-Lincoln" "1869-Grant"
## [22] "1873-Grant" "1877-Hayes" "1881-Garfield"
## [25] "1885-Cleveland" "1889-Harrison" "1893-Cleveland"
## [28] "1897-McKinley" "1901-McKinley" "1905-Roosevelt"
## [31] "1909-Taft" "1913-Wilson" "1917-Wilson"
## [34] "1921-Harding" "1925-Coolidge" "1929-Hoover"
## [37] "1933-Roosevelt" "1937-Roosevelt" "1941-Roosevelt"
## [40] "1945-Roosevelt" "1949-Truman" "1953-Eisenhower"
## [43] "1957-Eisenhower" "1961-Kennedy" "1965-Johnson"
## [46] "1969-Nixon" "1973-Nixon" "1977-Carter"
## [49] "1981-Reagan" "1985-Reagan" "1989-Bush"
## [52] "1993-Clinton" "1997-Clinton" "2001-Bush"
## [55] "2005-Bush" "2009-Obama" "2013-Obama"
paste(years, pres, collapse='-')
## [1] "1789 Washington-1793 Washington-1797 Adams-1801 Jefferson-1805 Jefferson-1809 Madison-1813 Madison-1817 Monroe-1821 Monroe-1825 Adams-1829 Jackson-1833 Jackson-1837 VanBuren-1841 Harrison-1845 Polk-1849 Taylor-1853 Pierce-1857 Buchanan-1861 Lincoln-1865 Lincoln-1869 Grant-1873 Grant-1877 Hayes-1881 Garfield-1885 Cleveland-1889 Harrison-1893 Cleveland-1897 McKinley-1901 McKinley-1905 Roosevelt-1909 Taft-1913 Wilson-1917 Wilson-1921 Harding-1925 Coolidge-1929 Hoover-1933 Roosevelt-1937 Roosevelt-1941 Roosevelt-1945 Roosevelt-1949 Truman-1953 Eisenhower-1957 Eisenhower-1961 Kennedy-1965 Johnson-1969 Nixon-1973 Nixon-1977 Carter-1981 Reagan-1985 Reagan-1989 Bush-1993 Clinton-1997 Clinton-2001 Bush-2005 Bush-2009 Obama-2013 Obama"
tolower
and toupper
change the case of character objects:
tolower(s1)
## [1] "this file contains many fascinating example sentences."
toupper(s1)
## [1] "THIS FILE CONTAINS MANY FASCINATING EXAMPLE SENTENCES."
These are also examples of “vectorized” functions: They work on vectors of objects, rather than just atomic objects. Try these functions on the character vectors below:
sVec <- c("Quanteda is the Best Text Package Ever, approved by NATO!",
"Quanteda является лучший текст пакет тех, утвержденной НАТО!")
Try running tolower()
on that vector. What results?
quanteda has its own, smarter lowercase function, called toLower()
. Try it on sVec
. There is an option to preserve the acronym – try it a second time while preserving the acronym NATO
as uppercase. To find out how, read the fine manual (RTFM): ?toLower.
Counting and comparing objects.
Charcter vectors can be compared using the ==
and %in%
operators:
tolower(s1) == toupper(s1)
## [1] FALSE
'apples'=='oranges'
## [1] FALSE
tolower(s1) == tolower(s1)
## [1] TRUE
'pears' == 'pears'
## [1] TRUE
c1 <- c('apples', 'oranges', 'pears')
'pears' %in% c1
## [1] TRUE
c2 <- c('bananas', 'pears')
c2 %in% c1
## [1] FALSE TRUE
Extra credit: Try using this with the length()
function to figure out how many times new
occurs in the tokenized text of the 57th inaugural speech, which you can access as a quanteda built-in object as inaugTexts[57]
. Hint: use %in%
to return a logical vector, and coerce this to 0s and 1s by using the sum()
function on the resulting vector.
The base functions for searching and replacing within text are similar to familiar commands from the other text manipulation environments, grep
and gsub
. The grep
manual page provides an overview of these functions.
The grep
command tests whether a pattern occurs within a string:
grep('orange', 'these are oranges')
## [1] 1
grep('pear', 'these are oranges')
## integer(0)
grep('orange', c('apples', 'oranges', 'pears'))
## [1] 2
grep('pears', c('apples', 'oranges', 'pears'))
## [1] 3
The gsub
command substitutes one pattern for another within a string:
gsub('oranges', 'apples', 'these are oranges')
## [1] "these are apples"
Making a corpus and corpus structure
From a vector of texts already in memory.
The simplest way to create a corpus is to use a vector of texts already present in R’s global environment. Some text and corpus objects are built into the package, for example inaugTexts
is the UTF-8 encoded set of 57 presidential inaugural addresses. Try using corpus()
on this set of texts to create a corpus.
Once you have constructed this corpus, use the summary()
method to see a brief description of the corpus. The names of the character vector inaugTexts
should have become the document names.
From a directory of text files.
The corpus()
function can take as its main argument the name of a directory, if you wrap the path to the directory within a directory()
call. (See ?directory
for an example.) If you call directory()
with no arguments, then it should allow you to choose the directory interactively (you will need to have installed the tcltk2
package first though.)
Here you are encouraged to select any directory of plain text files of your own.
How did it work? Try using docvars()
to assign a set of document-level variables.
Note that if you document level metadata in your filenames, then this can be automatically parsed by corpus.directory()
into docvars
.
# mytf <- textfile("~/Dropbox/QUANTESS/corpora/home_office_animals/txts/*.txt", encoding = "UTF-8")
mytf <- textfile("~/Dropbox/QUANTESS/corpora/amicus/all/*.txt")
mycorpus <- corpus(mytf)
summary(mycorpus, 5)
## Corpus consisting of 102 documents, showing 5 documents.
##
## Text Types Tokens Sentences
## sAP01.txt 1660 6441 256
## sAP02.txt 1913 6645 393
## sAP03.txt 1958 8123 475
## sAP04.txt 1258 4922 232
## sAP05.txt 2031 7375 372
##
## Source: /Users/kbenoit/Dropbox/Classes/Trinity/Text Analysis 2016/Exercises/Exercise 1/* on x86_64 by kbenoit
## Created: Sun Feb 7 18:41:50 2016
## Notes:
There are many other ways to create a corpus, most using the intermediate function textfile()
to read texts into R. Explore these ways by studying ?textfile
. Can you reproduce the examples?