1. Using regular expressions (5 pts)

    Regular expressions are very important concepts in text processing, as they offer tools for searching and matching text strings based on symbolic representations. For the dictionary and thesaurus features, we can define equivalency classes in terms of regular expressions. There is an excellent tutorial on regular expressions at http://www.regular-expressions.info.

    This provides an easy way to recover syntactic variations on specific words, without relying on a stemmer. For instance, we could query a regular expression on tax-related words, using:

    library("quanteda")
    kwic(data_corpus_inaugural, "tax", valuetype = "regex")

    What is the result between that command, and the version kwic(data_corpus_inaugural, "tax")?

    What if we on wanted to construct a regular expression to query only “valued” and “values” but not other variations of the lemma “value”? Could we construct a “glob” pattern match for the same two words?

  2. Descriptive statistics (5 pts)

    summary.corpus() provides a method to return summary statistics that can be saved to an object. Save the results of calling this method on data_corpus_irishbudget2010 and use it to compute a TTR for each document.

  3. Readability. (10 pts)

    Compare the readability of US presidents grouped by party. You can do this by calling textstat_readability() on a character vector created by grouping the texts by party, using

    data(data_corpus_sotu, package = "quanteda.corpora")
    partyTexts <- texts(data_corpus_sotu, groups = "party")
  4. Lexical Diversity over Time (15 pts)

    1. We can plot the type-token ratio of the Irish budget speeches over time. To do this, begin by extracting a subset of iebudgets that contains only the first speaker from each year:

      require(quanteda, warn.conflicts = FALSE, quietly = TRUE)
      data(data_corpus_irishbudgets, package = "quanteda.corpora")
      finMins <- corpus_subset(data_corpus_irishbudgets, number == "01")
      tokeninfo <- summary(finMins)

      Note the quotation marks around the value condition for the number document variable. Why are these required here?

    2. Get the type-token ratio for each text from this subset, and plot the resulting vector of TTRs as a function of the year.

    3. Now compare the results from the textstat_lexdiv() function applied to a dfm constructed from the same documents. Are the results the same?

  5. Weighting strategies

    Consider the following matrix:

    m <- matrix(c(0, 1, 3, 0, 1, 0, 5, 0, 2, 0, 6, 4), nrow = 3,
                dimnames = list(docs = paste0("doc", 1:3),
                                features = LETTERS[1:4]))
    m
    ##       features
    ## docs   A B C D
    ##   doc1 0 0 5 0
    ##   doc2 1 1 0 6
    ##   doc3 3 0 2 4
    1. Compute, using “manual” computations, the following:
    • relative term frequency (within document)
    • the document frequency of each feature
    • the tf-idf using a base 10 logarithm and unnormalized term frequencies.
    1. Coerce the object to a dfm format, and use dfm_weight() to verify your calculations.