## Package version: 1.1.2
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
In this assignment, you will use R to understand and apply document classification and supervised scaling using R and quanteda.
We will start with a classic computer science dataset of movie reviews, (Pang and Lee 2004). The movies corpus has an attribute Sentiment
that labels each text as either pos
or neg
according to the original imdb.com archived newspaper review star rating. We will begin by examining the conditional probabilities at the word level.
require(quanteda, warn.conflicts = FALSE, quietly = TRUE)
data(data_corpus_movies, package = "quanteda.corpora")
summary(data_corpus_movies, 10)
## Warning in nsentence.character(object, ...): nsentence() does not correctly
## count sentences in all lower-cased text
## Corpus consisting of 2000 documents, showing 10 documents:
##
## Text Types Tokens Sentences Sentiment id1 id2
## neg_cv000_29416 354 841 9 neg cv000 29416
## neg_cv001_19502 156 278 1 neg cv001 19502
## neg_cv002_17424 276 553 3 neg cv002 17424
## neg_cv003_12683 314 564 2 neg cv003 12683
## neg_cv004_12641 380 842 2 neg cv004 12641
## neg_cv005_29357 328 749 1 neg cv005 29357
## neg_cv006_17022 331 643 5 neg cv006 17022
## neg_cv007_4992 325 676 6 neg cv007 4992
## neg_cv008_29326 441 797 10 neg cv008 29326
## neg_cv009_29417 401 965 23 neg cv009 29417
##
## Source: /Users/kbenoit/Dropbox/QUANTESS/quantedaData_kenlocal_gh/* on x86_64 by kbenoit
## Created: Sat Nov 15 18:43:25 2014
## Notes:
What is the overall probability of the class pos
in the corpus? Are the classes balanced? (Hint: Use table()
on the docvar of Sentiment
.)
Make a dfm from the corpus, grouping the documents by the Sentiment
docvar.
Words with very low overall frequencies in a corpus of this size are unlikely to be good general predictors. Remove words that occur less than twenty times using dfm_trim
.
pos
and neg
.) What are the word likelihoods for "good"
and “great
”? What do you learn? Use kwic()
to find out the context of "good"
.Clue: you don’t have to compute the probabilities by hand. You should be able to obtain them using dfm_weight
.
Now we will use quanteda
’s naive bayes textmodel_nb()
to run a prediction on the movie reviews.
corpus_sample*()
function:set.seed(1234) # use this just before the command below
moviesShuffled <- corpus_sample(data_corpus_movies, size = 2000)
Next, make a dfm from the shuffled corpus, and make training labels. In this case, we are using 1500 training labels, and leaving the remaining 500 unlabelled to use as a test set. Trim the dataset to remove rare features.
movieDfm <- dfm_trim( dfm(moviesShuffled, verbose = FALSE), min_count = 10)
## Warning in dfm_trim.dfm(dfm(moviesShuffled, verbose = FALSE), min_count =
## 10): min_count is deprecated, use min_termfreq
trainclass <- factor(c(docvars(moviesShuffled, "Sentiment")[1:1500], rep(NA, 500)))
table(trainclass, useNA = "ifany")
## trainclass
## neg pos <NA>
## 737 763 500
(7.5 points) Now, run the training and testing commands of the Naive Bayes classifier, and compare the predictions for the documents with the actual document labels for the test set using a confusion matrix.
Compute the following statistics for the last classification:
Hint: Computing precision and recall is not the same if we are considering the “true positive” to be predicting positive for a true positive, versus predicting negative for a true negative. Since the factors of Sentiment
are ordered alphabetically, and since the table command puts lower integer codes for factors first, movtable
by default puts the (1,1) cell as the case of predicting negative reviews as the “true positive”, not predicting positive reviews. To get the positive-postive prediction you will need to reverse index it, e.g. movTable[2:1, 2:1]
.
\(F1\) from the above; and
accuracy.
good
and great
. Do the results confirm your previous finding? Clue: look at the documentation for textmodel_nb()
for how to extract the posterior class probabilities.We’ll start by running the classification task using a lasso regression using the cv_glmnet()
function in the glmnet
package.
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-13
lasso <- cv.glmnet(x = movieDfm[1:1500,], y = trainclass[1:1500],
alpha = 1, nfolds = 5, family = "binomial")
Show the graph with the cross-validated performance of the model based on the number of features included. You should find a curvilinear pattern. Why do you think this pattern emerges?
Predict the scores for the remaining 500 reviews in the test set and then compute precision and recall for the positive-positive, the F1 score, and the accuracy. Do the results improve?
Look at the coefficients with the highest and lowest values in the best cross-validated model. What type of features is the classifier relying on to make predictions? Do you think this is a good model?
This exercise uses amicus curiae briefs from US Supreme Court cases on affirmative action in college admissions. (Evans et al 2007). Amicus curiae are persons or organizations not party to a legal case who are permitted by the court to offer it advice in the form of an amicus brief. The amicus briefs in this corpus are from an affirmative action case in which an applicant to a university who was denied a place petitioned the Supreme Court, claiming that they were unfairly rejected because of affirmative action policies. Amicus curiae could advise the court either in support of the petitioner, therefore opposing affirmative action, or in favour of the respondent — the University— therefore supporting affirmative action.
We will use the original briefs from the Bolinger case examined by Evans et al (2007) for the training set, and the amicus briefs as the test set.
data(data_corpus_amicus, package = "quanteda.corpora")
summary(data_corpus_amicus, 5)
## Corpus consisting of 102 documents, showing 5 documents:
##
## Text Types Tokens Sentences trainclass testclass
## sP1.txt 2384 13878 616 P <NA>
## sP2.txt 2674 15715 635 P <NA>
## sR1.txt 3336 16144 608 R <NA>
## sR2.txt 3021 14359 516 R <NA>
## sAP01.txt 1822 7795 228 <NA> AP
##
## Source: /Users/kbenoit/Dropbox/QUANTESS/quantedaData_kenlocal_gh/* on x86_64 by kbenoit
## Created: Mon Sep 15 09:00:59 2014
## Notes:
The first four texts will be our training set - these are already set in the docvars to the amicusCorpus
object.
# set training class
trainclass <- docvars(data_corpus_amicus, "trainclass")
# set test class
testclass <- docvars(data_corpus_amicus, "testclass")
Construct a dfm, and then predict the test class values using the Naive Bayes classifer.
Compute accuracy, and precision and recall for both categories
Now rerun steps 2-3 after weighting the dfm using tf-idf, and see if this improves prediction.
You might find the following code useful for computing precision and recall:
precrecall <- function(mytable, verbose=TRUE) {
truePositives <- mytable[1,1]
falsePositives <- sum(mytable[1,]) - truePositives
falseNegatives <- sum(mytable[,1]) - truePositives
precision <- truePositives / (truePositives + falsePositives)
recall <- truePositives / (truePositives + falseNegatives)
if (verbose) {
print(mytable)
cat("\n precision =", round(precision, 2),
"\n recall =", round(recall, 2), "\n")
}
invisible(c(precision, recall))
}