The purpose of this assignment is to familiarize yourself with R, RMarkdown, and quanteda
.
As with the rest of assignments throughout the term, the goal is to write R code in the chunks within this RMarkdown file that will accomplish the tasks detailed below. Where questions are asked, include your answer in the after the question.
Start by loading the quanteda
package.
library("quanteda")
## quanteda version 1.0.4
## Using 7 of 8 threads for parallel computing
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
You will use the corpus of inaugural addresses by U.S. presidents data_corpus_inaugural
for this part of the exercise. These data are available as soon as you load quanteda
and are lazily-loaded once you access them.
Write code to inspect which inaugural address was the longest. Have a look at texts()
to extract character vectors from the quanteda corpus object.
which.max(nchar(texts(data_corpus_inaugural)))
## 1841-Harrison
## 14
Explain how this answer works.
This answer works because of the magic of the most amazing text package in existence. And because whatever.
How could you do this using a quanteda function?
Which president gave the longest speech and in what year was that?
Generate a document-feature matrix from this corpus and call it ic_dfm
. Make sure that the text is put into lowercase and words are stemmed as preprocessing steps.
# fill in the line below
# ic_dfm <- dfm(...
Now, look at the 20 most common features in the document-feature matrix. Hint: ?topfeatures
# examine top 20 common features
# topfeatures(, n = 10)
What do most of those tokens have in common?
What step(s) might you take to focus on a more interesting set of word frequencies?
Write code to figure out how many different types of punctuation and how many different types of stopwords were removed. You may use tokens()
. Hints: try the
verbose = TRUEoption. You might prefer to do this using
tokens()or some of the
tokens_*()` functions.
Plot a wordcloud of the dfm that has the stopwords, punctuation characters, and numbers removed, but without stemming. Plot this only for words with a minimum count of 10.
# textplot_wordcloud( )
Search for the word “Christmas” from the data_corpus_irishbudget2010
object. What is the context in which this term is being used?
Search for the word “Irish” and save the results of this keywords-in-context object to a new object called irish_kwic
. Plot an “x-ray” plot for this kwic, in the code bloc below:
# textplot_xray()
Examine the context of words related to “disaster”. Hint: you can use the stem of the word along with setting the regex
argument to TRUE
. Execute a query using a pattern match that returns different variations of words based on “disaster” (such as disasters, disastrous, disastrously, etc.).