Exercise summary

This exercise is designed to get you working with data in R, and to increase your familiarity with some of the concepts from Day 1.
The focus will be on exploring some of the data structures in R and on implementing some of the data restructuring from the dplyr and the reshape2 packages.

Please submit this exercise by email prior to the deadline of Wednesday, Feb 25. I recommend that you use the Monday time that we would normally hold class for completing it. Because this is harder than it looks, I will not count this first exercise for a grade, but you must complete it if you are taking the course for credit. (However, if you are good with R and have used the packages in question, this will not take you very long to complete.)

  1. Working with data structures in R

    1. Execute and example the following object:

      obj1_1 <- read.table(text = "
                           a  b    c    d 
                           1  2  4.3  Yes
                           3 4L  5.1   No
                           ")

      Was this what you were expecting? Why not?

      Probably not, since the a - d were values rather than variable names.

    2. Modify the above command and rerun it with the header=TRUE argument, assigning the result to a new object obj2_1. Examine the object’s structure using str(obj2_1). Was this what you were expecting? Try correcting the input by specifying a stringsAsFactors argument to read.table.

      obj2_1 <- read.table(text = "
                           a  b    c    d 
                           1  2  4.3  Yes
                           3 4L  5.1   No
                           ", header=TRUE)

      stringsAsFactors=TRUE ** reads in the non-numeric data as type character rather than creating factors from them.**

    3. Modify the object so that:
      • b is integer
      • d is a factor

      For this you can use as.integer – but be careful that this results in the conversion that you were expecting – and factor.

      obj3_1 <- read.table(text = "
                           a  b    c    d 
                           1  2  4.3  Yes
                           3 4L  5.1   No
                           ", header=TRUE, stringsAsFactors=FALSE)
      obj3_1$b <- as.integer(obj3_1$b)
      ## Warning: NAs introduced by coercion
      obj3_1$d <- factor(obj3_1$d)
      str(obj3_1)
      ## 'data.frame':    2 obs. of  4 variables:
      ##  $ a: int  1 3
      ##  $ b: int  2 NA
      ##  $ c: num  4.3 5.1
      ##  $ d: Factor w/ 2 levels "No","Yes": 2 1
    4. Did you have trouble getting b to coerce to an integer, try first removing the “L” using gsub() to replace the "L" with "". Get help on this using ?gsub.

      obj4_1 <- read.table(text = "
                           a  b    c    d 
                           1  2  4.3  Yes
                           3 4L  5.1   No
                           ", header=TRUE, stringsAsFactors=FALSE)
      tmp <- gsub("L", "", obj4_1$b)
      obj4_1$b <- as.integer(tmp)
      str(obj4_1)
      ## 'data.frame':    2 obs. of  4 variables:
      ##  $ a: int  1 3
      ##  $ b: int  2 4
      ##  $ c: num  4.3 5.1
      ##  $ d: chr  "Yes" "No"
    5. Finally, make this object into a data.frame, using data.frame. Print the output. Does it look correct?

      obj5_1 <- data.frame(obj4_1)
      str(obj5_1)
      ## 'data.frame':    2 obs. of  4 variables:
      ##  $ a: int  1 3
      ##  $ b: int  2 4
      ##  $ c: num  4.3 5.1
      ##  $ d: chr  "Yes" "No"

      Actually, it was already a data.frame.

  2. Working with the dplyr package

    For this part and the next, you should work with the file dail2002.dta from the article Kenneth Benoit and Michael Marsh. 2008. “The Campaign Value of Incumbency: A New Solution to the Puzzle of Less Effective Incumbent Spending.American Journal of Political Science 52(4, October): 874-890.

    1. Load the Stata dataset used in this paper, available here. To load this into R, you will need the read.dta command from the foreign package. (Note that you can load straight from the URL using this command.) Call this data object dail2002. What sort of object is this? How can you tell what sort of object it is?

      require(foreign)
      ## Loading required package: foreign
      dail2002 <- read.dta("http://www.kenbenoit.net/files/dail2002.dta")
    2. Filtering: Select only the Fianna Fail candidates using filter(), and assign the filtered data.frame to dail2002FF. Note that you might want to first find out what are the labels for party by using summary() on the party variable.

      require(dplyr)
      ## Loading required package: dplyr
      ## 
      ## Attaching package: 'dplyr'
      ## 
      ## The following object is masked from 'package:stats':
      ## 
      ##     filter
      ## 
      ## The following objects are masked from 'package:base':
      ## 
      ##     intersect, setdiff, setequal, union
      dail2002FF <- filter(dail2002, party=="ff")
      summary(dail2002FF$party)
      ## csp  ff  fg  gp ind lab  pd  sf  sp swp  wp 
      ##   0 106   0   0   0   0   0   0   0   0   0

      How many FF candidates were there in the 2002 election? ** 106**

    3. Summarizing FF candidates per constituency. On the new data frame dail2002FF, summarize the median spending (spend_total) for FF candidates using the dplyr function summarise. Use “pipes” for extra credit!

      FFspend <- select(dail2002FF, spend_total, constituency) %>%
                    group_by(constituency) %>% 
                        summarise(medspend = median(spend_total))

      Sort and plot the 42 median spending values using an index plot.

      plot(sort(FFspend$medspend), ylab="Median constituency spending for FF")

      For extra credit, do the same using aggregate instead of dplyr.

      FFspend2 <- aggregate(dail2002FF$spend_total, 
                            list(constituency=dail2002FF$constituency), 
                            median)
  3. Working with the reshape2 package

    The count2 - count16 variables are currently in “wide” format. Use melt to create a candidate-count unit dataset, and then produce a table of the 42 constituencies by their maximum count.

    Hint: First rename the votes1st variable to count1, so that it will be consistent with the others. Then melt the data using reshape2, creating a new variable called count for the new value. Then filter to remove any count variable that is zero. Then group_by constituency, and summarise a count using n().

    You will probably need to consult both the package vignettes and the help pages to accomplish this. It seems complicated but it’s well worth the effort to master these reshaping and summarizing skills – this sort of manipulation and summary of the data is a core part of the activities of data mining and data analysis.

    library(reshape2)
    # rename votes1st
    names(dail2002)[which(names(dail2002FF)=="votes1st")] <- "count1"
    dail2002melted <- melt(select(dail2002, wholename, district, count1, count2:count16, m), 
                           id.vars = c("wholename", "district", "m"), 
                           variable.name= "count", 
                           value.name = "votes")
    # strip off the number after "count" in the count variable
    dail2002melted$ncount <- as.numeric(gsub("count", "", as.character(dail2002melted$count)))
    dail2002maxcount <- filter(dail2002melted, votes>0) %>%
                            group_by(district, m) %>% 
                                summarise(maxcount = max(ncount))
    # clear relationship between constituency size and number of counts
    with(dail2002maxcount, table(m, maxcount))
    ##    maxcount
    ## m   3 4 5 6 7 8 9 10 11 12 16
    ##   3 3 3 1 5 2 1 0  1  0  0  0
    ##   4 0 1 2 3 3 2 0  1  0  0  0
    ##   5 0 0 0 1 1 2 3  2  3  1  1