Exercise summary

This exercise is designed to get you working with data in R, and to increase your familiarity with some of the concepts from Day 1.
The focus will be on exploring some of the data structures in R and on implementing some of the data restructuring from the dplyr and the reshape2 packages.

Please submit this exercise by email prior to the deadline of Wednesday, Feb 25. I recommend that you use the Monday time that we would normally hold class for completing it. Because this is harder than it looks, I will not count this first exercise for a grade, but you must complete it if you are taking the course for credit. (However, if you are good with R and have used the packages in question, this will not take you very long to complete.)

  1. Working with data structures in R

    1. Execute and example the following object:

      obj1_1 <- read.table(text = "
                           a  b    c    d
                           1  2  4.3  Yes
                           3 4L  5.1   No
                           ")

      Was this what you were expecting? Why not?

    2. Modify the above command and rerun it with the header=TRUE argument, assigning the result to a new object obj2_1.

      Examine the object’s structure using str(obj2_1). Was this what you were expecting? Try correcting the input by specifying a stringsAsFactors argument to read.table.

    3. Modify the object so that:
      • b is integer
      • d is a factor

      For this you can use as.integer – but be careful that this results in the conversion that you were expecting – and factor.

    4. Did you have trouble getting b to coerce to an integer, try first removing the “L” using gsub() to replace the "L" with "". Get help on this using ?gsub.

    5. Finally, make this object into a data.frame, using data.frame. Print the output. Does it look correct?

  2. Working with the dplyr package

    For this part and the next, you should work with the file dail2002.dta from the article Kenneth Benoit and Michael Marsh. 2008. “The Campaign Value of Incumbency: A New Solution to the Puzzle of Less Effective Incumbent Spending.American Journal of Political Science 52(4, October): 874-890.

    1. Load the Stata dataset used in this paper, available here. To load this into R, you will need the read.dta command from the foreign package. (Note that you can load straight from the URL using this command.) Call this data object dail2002. What sort of object is this? How can you tell what sort of object it is?

    2. Filtering: Select only the Fianna Fail candidates using filter(), and assign the filtered data.frame to dail2002FF. Note that you might want to first find out what are the labels for party by using summary() on the party variable.

      How many FF were there in the 2002 election?

    3. Summarizing FF candidates per constituency. On the new data frame dail2002FF, summarize the median spending (spend_total) for FF candidates using the dplyr function summarise. Use “pipes” for extra credit!

      Sort and plot the 42 median spending values using an index plot.

      plot(sort(FFspend$medspend), ylab="Median constituency spending for FF")

      Median constituency spending for FF

      For extra credit, do the same using aggregate instead of dplyr.

  3. Working with the reshape2 package

    The count2 - count16 variables are currently in “wide” format. Use melt to create a candidate-count unit dataset, and then produce a table of the 42 constituencies by their maximum count.

    Hint: First rename the votes1st variable to count1, so that it will be consistent with the others. Then melt the data using reshape2, creating a new variable called count for the new value. Then filter to remove any count variable that is zero. Then group_by constituency, and summarise a count using n().

    You will probably need to consult both the package vignettes and the help pages to accomplish this. It seems complicated but it’s well worth the effort to master these reshaping and summarizing skills – this sort of manipulation and summary of the data is a core part of the activities of data mining and data analysis.