Monday, September 24, 2012

Computing kook density in R

Do you ever see strange lights in the sky? Do you wonder what really goes on in Area 51? Would you like to use your R hacking skills to get to the bottom of the whole UFO conspiracy? Of course, you would!

UFO data from infochimps is the focus of a data munging exercise in Chapter 1 of Machine Learning for Hackers by Drew Conway and John Myles White, two social scientists with a penchant for statistical computing.

The exercise starts with slightly messy data, proceeds through cleaning up some dates. I think I slightly improved on the code given in the book. Have a look (gist:3775873) and see if you agree.

Dividing the data up by state (for sightings in the US), I noticed something funny. My home state of Washington has a lot of UFO sightings. Normalizing by population, this becomes even more pronounced.

I learned a neat trick from the chapter. The transform function helps to compute derived fields in a data.frame. I use transform to compute UFO sightings per capita, after merging in population data by state from the 2000 census.

sightings.by.state <- transform(
 sightings.by.state,
 state=state, state.name=name,
 sightings=sightings,
 sightings.per.cap=sightings/pop)

Creating the plot above, with a pile of ggplot code, we see that Washington state really is off the deep end when it comes to UFO sightings. Our northwest neighbors in Oregon come in second. I asked a couple fellow Washington residents what they thought. The first reasonably conjectured a relationship to the number of air bases. The second Washingtonian gave the explanation I favor: "High kook density".

If you'd like to the data, it's from Chapter 1 of Machine Learning for Hackers. Data and code can be found in John Myles White's github repo.

Thursday, September 13, 2012

OO in R

The R Project

"Is there a package for obfuscating code in #rstats?", someone asked. "The S4 object system?!" came the snarky reply. If you're smiling right now, you know that it wouldn't be funny if it weren't at least a little bit true.

Options: S3, S4 or R5?

There can be little doubt that object oriented programming in R is the cause of some confusion. We'll look at S4 classes more closely in a minute, but be warned that S4 classes are just one of at least three object systems available to the R programmer:

  • S3: simple and lightweight
  • S4: formal classes implemented by the methods package
  • R5: Reference classes

It's not super clear when to use which, at least not to me. It seems to depend strongly on style and personal preference. The Bioconductor folks, for example, make heavy use of S4 classes. Google, on the other hand, advises to "avoid S4 objects and methods when possible".

Here's the way it looks to me. S3 classes feel a bit like Javascript classes - easy, loose and informal. S4 classes are rigid, verbose and harder to understand. But, they offer a better separation between interface and implementation, along with some advanced features like multiple dispatch, validation and type coercion. Reference classes (aka R5) encapsulate mutable state and look more like familiar Java-style classes. They're new and pass-by-reference can violate expectations of R users.

An S4 class example

Now, let's return to S4 classes with a simple example. First, we define a class to represent people.

# define an S4 class for people
setClass(
  "Person",
  representation(name="character", age="numeric"),
  prototype(name=NA_character_, age=NA_real_)
)

A person has a name and an age, which default to NAs of their respective types - character string and numeric. For the sake of demonstrating polymorphism, let's define a couple subclasses.

# define subclasses for different types of people
setClass("Musician",
  representation(instrument="character"),
  contains="Person")

setClass("Programmer",
  representation(language="character"),
  contains="Person")

There's no reason not to write normal R functions that take S4 classes as arguments. Polymorphism is called for when a method has different implementations for different classes. In that case, we declare a generic method.

# create a generic method called 'talent' that
# dispatches on the type of object it's applied to
setGeneric(
  "talent",
  function(object) {
    standardGeneric("talent")
  }
)

The following code implements two subtypes of person, each with a talent for something.

setMethod(
  "talent",
  signature("Programmer"),
  function(object) {
    paste("Codes in", 
      paste(object@language, collapse=", "))
  }
)

setMethod(
  "talent",
  signature("Musician"),
  function(object) {
    paste("Plays the",
      paste(object@instrument, collapse=", "))
  }
)

Now, let's make some talented people.

# create some talented people
donald <- new("Programmer",
  name="Donald Knuth",
  age=74,
  language=c("MMIX"))

coltrane <- new("Musician",
  name="John Coltrane",
  age=40,
  instrument=c("Tenor Sax", "Alto Sax"))

miles <- new("Musician",
  name="Miles Dewey Davis",
  instrument=c("Trumpet"))

monk <- new("Musician",
  name="Theloneous Sphere Monk",
  instrument=c("Piano"))

talent(miles)
[1] "Plays the Trumpet"

talent(donald)
[1] "Codes in MMIX"

talent(coltrane)
[1] "Plays the Tenor Sax, Alto Sax"

Mutability

One common stumbling block with S4 classes concerns changes in state. For instance, we might want to give our hard-working employees a raise.

setClass("Employee",
  representation(boss="Person", salary="numeric"),
  contains = "Person"
)

setGeneric(
  "raise",
  function(object, percent=0) {
    standardGeneric("raise")
  }
)

setMethod(
  "raise",
  signature("Employee"),
  function(object, percent=0) {
    object@salary <- object@salary * (1+percent)
    object
  }
)

True to it's functional heritage, R deals with immutable values. Changes in state happen by making new objects. The trick is to return the new object from the mutator methods and capture it on the way out.

smithers <- new("Employee",
  name="Waylon Smithers",
  boss=new("Person",name="Mr. Burns"),
  salary=100000)

# doesn't work?!?!
raise(smithers, percent=15)
smithers@salary
[1] 100000

Setting a new salary creates a new value. Notice that we return the modified object from the raise function. Don't forget to catch it.


# remember to reassign smithers to the new value
smithers <- raise(smithers, percent=15)
smithers@salary
[1] 115000

Multiple Inheritance

Through the magic of multiple inheritance, the lowly Code Monkey is both a programmer and an employee. Just set the contains value to indicate its two parent classes.


setClass("Code Monkey",
  contains=c("Programmer","Employee"))

setMethod(
  "talent",
  signature("Code Monkey"),
  function(object) {
    paste("Codes in",
      paste(object@language, collapse=", "),
        "for", object@boss@name)
  }
)

chris <- new("Code Monkey",
  name="Chris",
  age=29,
  boss=new("Person", name="The Man"),
  salary=2L,
  language=c("Java", "R", "Python", "Clojure"))

talent(chris)
[1] "Codes in Java, R, Python, Clojure for The Man"

So, there you have it - encapsulation, polymorphism and inheritance in S4 classes. Complete code for this example is in gist:3670578.

OO in R resources

It's lucky that there are loads of places to go to learn about S4 classes.