Assignment 4

Machine Learning for Text

In this assignment, you will use R to understand and apply document classification and supervised scaling using R and quanteda.

1. Classifying movie reviews, Part 1 (20 points)

We will start with a classic computer science dataset of movie reviews, (Pang and Lee 2004). The movies corpus has an attribute Sentiment that labels each text as either pos or neg according to the original imdb.com archived newspaper review star rating. We will begin by examining the conditional probabilities at the word level.

Load the movies dataset and examine the attributes:

require(quanteda, warn.conflicts = FALSE, quietly = TRUE)
data(data_corpus_movies, package = "quanteda.corpora")
summary(data_corpus_movies, 10)
## Warning in nsentence.character(object, ...): nsentence() does not correctly
## count sentences in all lower-cased text
## Corpus consisting of 2000 documents, showing 10 documents:
## 
##             Text Types Tokens Sentences Sentiment   id1   id2
##  neg_cv000_29416   354    841         9       neg cv000 29416
##  neg_cv001_19502   156    278         1       neg cv001 19502
##  neg_cv002_17424   276    553         3       neg cv002 17424
##  neg_cv003_12683   314    564         2       neg cv003 12683
##  neg_cv004_12641   380    842         2       neg cv004 12641
##  neg_cv005_29357   328    749         1       neg cv005 29357
##  neg_cv006_17022   331    643         5       neg cv006 17022
##   neg_cv007_4992   325    676         6       neg cv007  4992
##  neg_cv008_29326   441    797        10       neg cv008 29326
##  neg_cv009_29417   401    965        23       neg cv009 29417
## 
## Source: /Users/kbenoit/Dropbox/QUANTESS/quantedaData_kenlocal_gh/* on x86_64 by kbenoit
## Created: Sat Nov 15 18:43:25 2014
## Notes:

What is the overall probability of the class pos in the corpus? Are the classes balanced? (Hint: Use table() on the docvar of Sentiment.)
Make a dfm from the corpus, grouping the documents by the Sentiment docvar.

Words with very low overall frequencies in a corpus of this size are unlikely to be good general predictors. Remove words that occur less than twenty times using dfm_trim.

Calculate the word-level likelihoods for each class, from the reduced dfm. (This is the probability of a word given the class pos and neg.) What are the word likelihoods for "good" and “great”? What do you learn? Use kwic() to find out the context of "good".

Clue: you don’t have to compute the probabilities by hand. You should be able to obtain them using dfm_weight.

2. Classifying movie reviews, Part 2. (30 points)

Now we will use quanteda’s naive bayes textmodel_nb() to run a prediction on the movie reviews.

The movie corpus contains 1000 positive examples followed by 1000 negative examples. When extracting training and testing labels, we want to get a mix of positive and negative in each set, so first we need to shuffle the corpus. You can do this with the corpus_sample*() function:

set.seed(1234)  # use this just before the command below
moviesShuffled <- corpus_sample(data_corpus_movies, size = 2000)

Next, make a dfm from the shuffled corpus, and make training labels. In this case, we are using 1500 training labels, and leaving the remaining 500 unlabelled to use as a test set. Trim the dataset to remove rare features.

movieDfm <- dfm_trim( dfm(moviesShuffled, verbose = FALSE), min_count = 10)
## Warning in dfm_trim.dfm(dfm(moviesShuffled, verbose = FALSE), min_count =
## 10): min_count is deprecated, use min_termfreq
trainclass <- factor(c(docvars(moviesShuffled, "Sentiment")[1:1500], rep(NA, 500)))
table(trainclass, useNA = "ifany")
## trainclass
##  neg  pos <NA> 
##  737  763  500

(7.5 points) Now, run the training and testing commands of the Naive Bayes classifier, and compare the predictions for the documents with the actual document labels for the test set using a confusion matrix.
Compute the following statistics for the last classification:

precision and recall, for the positive-positive prediction;

Hint: Computing precision and recall is not the same if we are considering the “true positive” to be predicting positive for a true positive, versus predicting negative for a true negative. Since the factors of Sentiment are ordered alphabetically, and since the table command puts lower integer codes for factors first, movtable by default puts the (1,1) cell as the case of predicting negative reviews as the “true positive”, not predicting positive reviews. To get the positive-postive prediction you will need to reverse index it, e.g. movTable[2:1, 2:1].

\(F1\) from the above; and
accuracy.

Extract the posterior class probabilities of the words good and great. Do the results confirm your previous finding? Clue: look at the documentation for textmodel_nb() for how to extract the posterior class probabilities.

3. Classifying movie reviews, Part 3 (20 points)

We’ll start by running the classification task using a lasso regression using the cv_glmnet() function in the glmnet package.

library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-13
lasso <- cv.glmnet(x = movieDfm[1:1500,], y = trainclass[1:1500], 
                   alpha = 1, nfolds = 5, family = "binomial")

Show the graph with the cross-validated performance of the model based on the number of features included. You should find a curvilinear pattern. Why do you think this pattern emerges?
Predict the scores for the remaining 500 reviews in the test set and then compute precision and recall for the positive-positive, the F1 score, and the accuracy. Do the results improve?
Look at the coefficients with the highest and lowest values in the best cross-validated model. What type of features is the classifier relying on to make predictions? Do you think this is a good model?

4. Classifying amicus briefs using Naive Bayes. (30 points)

This exercise uses amicus curiae briefs from US Supreme Court cases on affirmative action in college admissions. (Evans et al 2007). Amicus curiae are persons or organizations not party to a legal case who are permitted by the court to offer it advice in the form of an amicus brief. The amicus briefs in this corpus are from an affirmative action case in which an applicant to a university who was denied a place petitioned the Supreme Court, claiming that they were unfairly rejected because of affirmative action policies. Amicus curiae could advise the court either in support of the petitioner, therefore opposing affirmative action, or in favour of the respondent — the University— therefore supporting affirmative action.

We will use the original briefs from the Bolinger case examined by Evans et al (2007) for the training set, and the amicus briefs as the test set.

data(data_corpus_amicus, package = "quanteda.corpora")
summary(data_corpus_amicus, 5)
## Corpus consisting of 102 documents, showing 5 documents:
## 
##       Text Types Tokens Sentences trainclass testclass
##    sP1.txt  2384  13878       616          P      <NA>
##    sP2.txt  2674  15715       635          P      <NA>
##    sR1.txt  3336  16144       608          R      <NA>
##    sR2.txt  3021  14359       516          R      <NA>
##  sAP01.txt  1822   7795       228       <NA>        AP
## 
## Source: /Users/kbenoit/Dropbox/QUANTESS/quantedaData_kenlocal_gh/* on x86_64 by kbenoit
## Created: Mon Sep 15 09:00:59 2014
## Notes:

The first four texts will be our training set - these are already set in the docvars to the amicusCorpus object.

# set training class
trainclass <- docvars(data_corpus_amicus, "trainclass")
# set test class
testclass  <- docvars(data_corpus_amicus, "testclass")

Construct a dfm, and then predict the test class values using the Naive Bayes classifer.
Compute accuracy, and precision and recall for both categories
Now rerun steps 2-3 after weighting the dfm using tf-idf, and see if this improves prediction.

Hints

You might find the following code useful for computing precision and recall:

precrecall <- function(mytable, verbose=TRUE) {
    truePositives <- mytable[1,1]
    falsePositives <- sum(mytable[1,]) - truePositives
    falseNegatives <- sum(mytable[,1]) - truePositives
    precision <- truePositives / (truePositives + falsePositives)
    recall <- truePositives / (truePositives + falseNegatives)
    if (verbose) {
        print(mytable)
        cat("\n precision =", round(precision, 2), 
            "\n    recall =", round(recall, 2), "\n")
    }
    invisible(c(precision, recall))
}