Before we dive into our first coding session, let’s become a bit more familiar with the programming tools used in this course.
We will write our annotated R code using Markdown.
Markdown is a simple formatting syntax to generate HTML or PDF documents. In combination with R, it will generate a document that includes the comments, the R code, and the output of running such code.
You can embed R code in chunks like this one:
1 + 1
## [1] 2
You can run each chunk of code one by one, by highlighting the code and clicking Run
(or pressing Ctrl + Enter
in Windows or command + enter
in OS X). You can see the output of the code in the console right below, inside the RStudio window.
Alternatively, you can generate (or knit) an html document with all the code, comment, and output in the entire .Rmd
file by clicking on Knit HTML
.
You can also embed plots and graphics, for example:
x <- c(1, 3, 4, 5)
y <- c(2, 6, 8, 10)
plot(x, y)
If you run the chunk of code, the plot will be generated on the panel on the bottom right corner. If instead you knit the entire file, the plot will appear after you view the html document.
Using R + Markdown has several advantages: it leaves an “audit trail” of your work, including documentation explaining the steps you made. This is helpful to not only keep your own progress organized, but also make your work reproducible and more transparent. You can easily correct errors (just fix them and run the script again), and after you have finished, you can generate a PDF or HTML version of your work.
We will be exploring R through R Markdown over the next few modules. For more details and documentation see http://rmarkdown.rstudio.com.
Follow the instructions in the class material and install R and RStudio. If you feel more comfortable using the basic R terminal, skip the step of installing RStudio and the corresponding chunk.
Now run the following code to make sure that you have the current version of R.
version$version.string
## [1] "R version 3.4.3 (2017-11-30)"
This chunk should return R version 3.4.3 (2017-11-30)
.
rstudioapi::versionInfo()$version
This chunk should print 1.1.383
.
installed.packages()["quanteda", "Version"]
## [1] "1.0.0"
This chunk should print 1.0.0
(published at CRAN on 15/1/2018).
If any of those chunks do not print the correct version numbers, head back to the slides of the first lab session and follow the steps outlined there to install
Start by loading quanteda
.
library("quanteda")
There are several useful string manipulation functions in the R base library. In addition, there is the stringr
package which provides an additional interface for simple text manipulation. The package will not be covered in this introduction but it is highly recommended if your work requires more than basic string manipulation.
The fundamental type (or mode
) in which R stores text is the character vector. The most simple case is a character vector of length one. The nchar
function returns the number of characters in a character vector.
s1 <- 'my example text'
length(s1)
## [1] 1
nchar(s1)
## [1] 15
The nchar
function is vectorized, meaning that when called on a vector it returns a value for each element of the vector.
s2 <- c('This is', 'my example text.', 'So imaginative.')
length(s2)
## [1] 3
nchar(s2)
## [1] 7 16 15
sum(nchar(s2))
## [1] 38
We can use this to answer some simple questions about election manifestos by UK parties on immigration.
Which are the longest and shortest statements? We can query this using two functions, nchar()
and which.max()
and which.min()
.
which.max(nchar(data_char_ukimmig2010))
## BNP
## 1
which.min(nchar(data_char_ukimmig2010))
## PC
## 7
Unlike in some other programming languages, it is not possible to index into a string in R:
s1 <- 'This file contains many fascinating example sentences.'
s1[6:9]
## [1] NA NA NA NA
To extract a substring, instead we use the substr()
function. Using the help page from ?subst
, execute a call to substr()
to return the characters from s1
below from the 6th to the 9th characters.
s1 <- 'This file contains many fascinating example sentences.'
substr(s1, 6, 9)
## [1] "file"
A note for you C, Python, Java, … programmers: R counts from 1, not 0.
Often we would like to split character vectors to extract a term of interest. This is possible using the strsplit
function. Consider the names of the inaugural texts corpus (data_corpus_inaugural
):
docnames(data_corpus_inaugural)
## [1] "1789-Washington" "1793-Washington" "1797-Adams"
## [4] "1801-Jefferson" "1805-Jefferson" "1809-Madison"
## [7] "1813-Madison" "1817-Monroe" "1821-Monroe"
## [10] "1825-Adams" "1829-Jackson" "1833-Jackson"
## [13] "1837-VanBuren" "1841-Harrison" "1845-Polk"
## [16] "1849-Taylor" "1853-Pierce" "1857-Buchanan"
## [19] "1861-Lincoln" "1865-Lincoln" "1869-Grant"
## [22] "1873-Grant" "1877-Hayes" "1881-Garfield"
## [25] "1885-Cleveland" "1889-Harrison" "1893-Cleveland"
## [28] "1897-McKinley" "1901-McKinley" "1905-Roosevelt"
## [31] "1909-Taft" "1913-Wilson" "1917-Wilson"
## [34] "1921-Harding" "1925-Coolidge" "1929-Hoover"
## [37] "1933-Roosevelt" "1937-Roosevelt" "1941-Roosevelt"
## [40] "1945-Roosevelt" "1949-Truman" "1953-Eisenhower"
## [43] "1957-Eisenhower" "1961-Kennedy" "1965-Johnson"
## [46] "1969-Nixon" "1973-Nixon" "1977-Carter"
## [49] "1981-Reagan" "1985-Reagan" "1989-Bush"
## [52] "1993-Clinton" "1997-Clinton" "2001-Bush"
## [55] "2005-Bush" "2009-Obama" "2013-Obama"
## [58] "2017-Trump"
# returns a list of parts
parts <- strsplit(docnames(data_corpus_inaugural), '-')
years <- sapply(parts, function(x) x[1])
pres <- sapply(parts, function(x) x[2])
Examine the previous code carefully, as it uses list data types in R, which are something fundamentally important to understand. In quanteda, the tokens
class of object – created when you call tokens()
on a character object or corpus – is a type of list. Try it:
toks <- tokens("This is a sentence containing some charactères français.")
Now examine the “structure” of that object – assigned to toks
– using str()
.
Try sending toks
to the global environment, by simply typing its name in the console and pressing Enter. Can you explain why it looks the way that it does? Hint: You can examine all available “methods” for an object class using the methods()
function. Try methods(class = "tokens")
, and use the help function ?methods
to explain what you see.
The paste
function is used to join character vectors together. The way in which the elements are combined depends on the values of the sep
and collapse
arguments:
paste('one','two','three')
## [1] "one two three"
paste('one','two','three', sep = '_')
## [1] "one_two_three"
paste(years, pres, sep = '-')
## [1] "1789-Washington" "1793-Washington" "1797-Adams"
## [4] "1801-Jefferson" "1805-Jefferson" "1809-Madison"
## [7] "1813-Madison" "1817-Monroe" "1821-Monroe"
## [10] "1825-Adams" "1829-Jackson" "1833-Jackson"
## [13] "1837-VanBuren" "1841-Harrison" "1845-Polk"
## [16] "1849-Taylor" "1853-Pierce" "1857-Buchanan"
## [19] "1861-Lincoln" "1865-Lincoln" "1869-Grant"
## [22] "1873-Grant" "1877-Hayes" "1881-Garfield"
## [25] "1885-Cleveland" "1889-Harrison" "1893-Cleveland"
## [28] "1897-McKinley" "1901-McKinley" "1905-Roosevelt"
## [31] "1909-Taft" "1913-Wilson" "1917-Wilson"
## [34] "1921-Harding" "1925-Coolidge" "1929-Hoover"
## [37] "1933-Roosevelt" "1937-Roosevelt" "1941-Roosevelt"
## [40] "1945-Roosevelt" "1949-Truman" "1953-Eisenhower"
## [43] "1957-Eisenhower" "1961-Kennedy" "1965-Johnson"
## [46] "1969-Nixon" "1973-Nixon" "1977-Carter"
## [49] "1981-Reagan" "1985-Reagan" "1989-Bush"
## [52] "1993-Clinton" "1997-Clinton" "2001-Bush"
## [55] "2005-Bush" "2009-Obama" "2013-Obama"
## [58] "2017-Trump"
paste(years, pres, collapse = '-')
## [1] "1789 Washington-1793 Washington-1797 Adams-1801 Jefferson-1805 Jefferson-1809 Madison-1813 Madison-1817 Monroe-1821 Monroe-1825 Adams-1829 Jackson-1833 Jackson-1837 VanBuren-1841 Harrison-1845 Polk-1849 Taylor-1853 Pierce-1857 Buchanan-1861 Lincoln-1865 Lincoln-1869 Grant-1873 Grant-1877 Hayes-1881 Garfield-1885 Cleveland-1889 Harrison-1893 Cleveland-1897 McKinley-1901 McKinley-1905 Roosevelt-1909 Taft-1913 Wilson-1917 Wilson-1921 Harding-1925 Coolidge-1929 Hoover-1933 Roosevelt-1937 Roosevelt-1941 Roosevelt-1945 Roosevelt-1949 Truman-1953 Eisenhower-1957 Eisenhower-1961 Kennedy-1965 Johnson-1969 Nixon-1973 Nixon-1977 Carter-1981 Reagan-1985 Reagan-1989 Bush-1993 Clinton-1997 Clinton-2001 Bush-2005 Bush-2009 Obama-2013 Obama-2017 Trump"
tolower
and toupper
change the case of character objects:
tolower(s1)
## [1] "this file contains many fascinating example sentences."
toupper(s1)
## [1] "THIS FILE CONTAINS MANY FASCINATING EXAMPLE SENTENCES."
These are also examples of “vectorized” functions: They work on vectors of objects, rather than just atomic objects. Try these functions on the character vectors below:
s_vec <- c("Quanteda is the Best Text Package Ever, approved by NATO!",
"Quanteda является лучший текст пакет тех, утвержденной НАТО!")
Try running tolower()
on that vector. What results?
quanteda has its own, smarter lowercase function, called char_tolower()
. Try it on s_vec
. There is an option to preserve the acronym – try it a second time while preserving the acronym NATO
as uppercase. To find out how, read the fine manual (RTFM): ?char_tolower
.
Note how this works in English as well as in Russian thanks to the marvels of Unicode!
Character vectors can be compared using the ==
and %in%
operators:
char_tolower(s1) == char_toupper(s1)
## [1] FALSE
'apples' == 'oranges'
## [1] FALSE
char_tolower(s1) == char_tolower(s1)
## [1] TRUE
'pears' == 'pears'
## [1] TRUE
c1 <- c('apples', 'oranges', 'pears')
'pears' %in% c1
## [1] TRUE
c2 <- c('bananas', 'pears')
c2 %in% c1
## [1] FALSE TRUE
It is common to “clean” texts before processing, usually by removing punctuation, digits, and converting to lower case.
“Cleaning” in quanteda takes through decisions made at the tokenization stage. In order to count word frequencies, we first need to split the text into words through a process known as tokenization. Look at the documentation for quanteda’s tokens
command using the built in help function (?
before any object/command). Use the tokens
command on data_char_sampletext
(a built-in data type in the quanteda package), and examine the results.
Stemming.
Stemming removes the suffixes using the Porter stemmer, found in the SnowballC library. The quanteda functions to invoke the stemmer are char_wordstem()
, tokens_wordstem()
, and dfm_wordstem()
. Apply stemming to the exampleString
and examine the results. Why does it not work, and what do you need to do to make it work? How would you apply this to the sentence-segmented vector?
Applying pre-processing to the creation of a dfm.
quanteda’s dfm()
function by default applies certain “cleaning” steps to the text, which are not the defaults in tokens()
. Create a dfm from data_char_sampletext
. What are the differences between the steps applied by dfm()
and the default settings for tokens()
?
sample_corpus <- corpus(data_char_sampletext)
(sample_dfm <- dfm(sample_corpus))
## Document-feature matrix of: 1 document, 239 features (0% sparse).
Inspect the dimensions of the resulting objects, including the names of the words extracted as features. To get the “words” from a quanteda object, you can use the featnames()
function:
featnames(sample_dfm) %>% head(20)
## [1] "instead" "we" "have" "a" "fine"
## [6] "gael-labour" "party" "government" "," "coming"
## [11] "into" "power" "promising" "real" "change"
## [16] "but" "slavishly" "following" "the" "previous"
Keywords in context.
Use the kwic
function to discover the context of the word “clean”. Is this associated with environmental policy?
Using the data_corpus_irishbudget2010
object, examine the context for the word “Irish”. What is its predominant usage?
Examine the context of words related to “disaster”. Hint: you can use the stem of the word along with setting the regex
argument to TRUE
. Execute a query using a pattern match that returns different variations of words based on “disaster” (such as disasters, disastrous, disastrously, etc.).
Load the text of Herman Melville’s Moby Dick. You can use the base R solution or alternatively install the readtext package and use its simpler interface. Use kwic()
to search for “Ahab”, and save this object. Send it to textplot_xray()
.
# This is a base R solution to reading text from a URL
mobydicktf <- paste(readLines("https://kenbenoit.net/assets/files/pg2701.txt"), collapse = "\n")
# If you install the "readtext" package, you can use the following simpler code:
# readtext::readtext("https://kenbenoit.net/assets/files/pg2701.txt")
# Highly recommended if you plan to read
mobydickCorpus <- corpus(mobydicktf, docvars = data.frame(doc_id = "pg2701.txt"))
# command to produce a kwic
# command to produce the x-ray plot
Descriptive statistics
Compute descriptive statistics for the data_corpus_irishbudget2010
object. Hint: ?summary
.
In R’s “S3” object-oriented system, functions of the same name can be written so as to dispatch different “methods” depending on the class of the object on which the function is called. This also explains why you get certain warning messages when you attach the quanteda package, e.g.
detach("package:quanteda")
require(quanteda)
## Loading required package: quanteda
## quanteda version 1.0.0
## Using 3 of 4 threads for parallel computing
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
Here, the object View
(a function) from the utils package – which is one of the standard packages that is always attached when you start R – has been superceded in priority in R’s “namespace” by another object (also a function) called View
from the quanteda package. Compare the two using ?View
, where you should see two versions listed.