Sunday, December 27, 2009

SQL group by in R

The R ProjectThe R statistical computing environment is awesome, but weird. How to do database operations in R is a common source of questions. The other day I was looking for an equivalent to SQL group by for R data frames. You need this to compute summary statistics on subsets of a data set divided up by some categorical variable. It turns out there are several ways to get the same effect, some more limited than others.

The best answer seems to be plyr. It automates the The split-apply-combine strategy for data analysis you'd use otherwise. The ddply splits a data frame into subset data frames, performs some function on the subsets, and returns the results as a recombined data frame.

Here's a few links: A Fast Intro to Plyr for R, Block-processing a data frame with plyr and Split, apply, and combine in R using PLYR.

This paper is worth reading. It introduces the library and also gives you a nice framework (split-apply-combine) for thinking about a whole class of data-munging problems. A coworker (thanks, Gustavo) pointed out that this is a lot like Google's MapReduce.

Some commands that get you part of the way there are: split, by, tapply (nicely explained here), aggregate. The R wiki has an entry on Performing calculations within sub-sets of a data-frame that uses the reshape library. You could always use sqldf or RSQLite. Several options are discussed here. You can cobble up a fully general process using split, some form of sapply, and unsplit. But, that's what plyr does automatically.

Side notes: While fooling around with this, I noticed that, for some crazy reason, splits matrices into nice subunits, but split has the ugly side-effect of reducing matrices to vectors. Also, Google has a style guide for R.

More R mini-tutorials:


Sunday, December 20, 2009

Layered architecture for a Java application

This is a template for an application architecture that came out of my genome browser project. This Java/Swing desktop application aspires to be very flexible - implementing new visualizations and accommodating task-specific components when necessary. I'm not sure how well I did at this goal, so don't take this as here's-what-you-should-do, just here's-what-one-poor-schmuck-did.

The Application object holds application scope state, and binds components together, performing dependency injection where necessary. The entry point is myapp.Main, which configures Options using command-line arguments and starts the application. The application manages a concurrent event dispatch queue.

Components communicate with each other by events placed on the event queue and have access to an application API. Like in Swing, component code may run in the thread of the event dispatch, or, if it is long running, spin up its own thread. Components are responsible for managing their own threading and synchronization.

Layering, in this template, works something like this:

|   UI   |
+-----------------+ +----------------+
|   Application   | |   components   |
+-----------------+ +----------------+
|   services   |
+------------+ +----------+
|   domain   | |   util   |
+------------+ +----------+

Higher layers can depend on lower layers, while dependencies may not flow up. Utilities and domain model can have no internal dependencies and anything may depend on them. Services are things like file I/O, persistence, and maybe algorithms. The application exposes an API to components and will in turn implement that API in terms of components. I think this circular dependency doesn't bother me too much. If it did, the Application could have no dependencies on the components and the components could have some organized dependency structure among themselves. The UI sits on top where nothing should depend on it.

Thursday, December 17, 2009

Joining data frames in R

The R ProjectWant to join two R data frames on a common key? Here's one way do a SQL database style join operation in R.

We start with a data frame describing probes on a microarray. The key is the probe_id and the rest of the information describes the location on the genome targeted by that probe.

> head(probes)
          probe_id sequence strand   start     end
1 mm_ex_fwd_000541      Chr      + 1192448 1192507
2 mm_ex_fwd_000542      Chr      + 1192453 1192512
3 mm_ex_fwd_000543      Chr      + 1192458 1192517
4 mm_ex_fwd_000544      Chr      + 1192463 1192522
5 mm_ex_fwd_000545      Chr      + 1192468 1192527
6 mm_ex_fwd_000546      Chr      + 1192473 1192532

> dim(probes)
[1] 241019      5

We also have a bunch of measurements in a numeric vector. For each probe (well, a few probes missing due to bad data) we have a value.

> head(value)
mm_fwd_000002 mm_fwd_000003 mm_fwd_000004 mm_fwd_000005 mm_fwd_000006 mm_fwd_000007 
   0.05294899    0.11979251    0.28160017    0.57284569    0.74402510    0.78644199 

> length(value)
[1] 241007

Let's join up these tables, er data frame and vector. We'll use the match function. Match returns a vector of positions of the (first) matches of its first argument in its second (or NA if there is no match). So, we're matching our values into our probes.

> joined = cbind(probes[match(names(value), probes$probe_id),], value)

> dim(joined)
[1] 241007      6

> head(joined)
          probe_id sequence strand start end         value
3695 mm_fwd_000002      Chr      +    15  74 0.05294899
3696 mm_fwd_000003      Chr      +    29  88 0.11979251
3697 mm_fwd_000004      Chr      +    43 102 0.28160017
3698 mm_fwd_000005      Chr      +    57 116 0.57284569
3699 mm_fwd_000006      Chr      +    71 130 0.74402510
3700 mm_fwd_000007      Chr      +    85 144 0.78644199

Merge is probably more similar to a database join.

Inner joinmerge(df1, df2, by="common_key_column")
Outer joinmerge(df1, df2, by="common_key_column", all=TRUE)
Left outermerge(df1, df2, by="common_key_column", all.x=TRUE)
Right outermerge(df1, df2, by="common_key_column", all.y=TRUE)

If we have two data frames, we can use merge. Let's convert our vector tp to a data frame and merge, getting the same result (in a different sort order).

> tp.df = data.frame(probe_id=names(tp), value=tp)

> head(tp.df)
                   probe_id      value
mm_fwd_000002 mm_fwd_000002 0.05294899
mm_fwd_000003 mm_fwd_000003 0.11979251
mm_fwd_000004 mm_fwd_000004 0.28160017
mm_fwd_000005 mm_fwd_000005 0.57284569
mm_fwd_000006 mm_fwd_000006 0.74402510
mm_fwd_000007 mm_fwd_000007 0.78644199

> m = merge(probes, tp.df, by="probe_id")

> dim(m)
[1] 241007      6

> head(mmm)
          probe_id sequence strand   start     end     value
1 mm_ex_fwd_000541      Chr      + 1192448 1192507 0.1354668
2 mm_ex_fwd_000542      Chr      + 1192453 1192512 0.1942794
3 mm_ex_fwd_000543      Chr      + 1192458 1192517 0.1924457
4 mm_ex_fwd_000544      Chr      + 1192463 1192522 0.2526351
5 mm_ex_fwd_000545      Chr      + 1192468 1192527 0.1922655
6 mm_ex_fwd_000546      Chr      + 1192473 1192532 0.2610747

There's a good discussion of merge on Stack Overflow, which includes right, left, inner and outer joins. Also the R wiki covers both match and merge. See also, the prior entry on select operations on R data frames.

Wednesday, December 16, 2009

Modern Information Management in Bioinformatics

Jon Udell talks bioinformatics with Randy Julian of Indigo BioSystems on the topic of flexible data repositories.

… without buying into all the hype around semantic web and so on, you would argue that a flexible schema makes more sense in a knowledge gathering or knowledge generation context than a fixed schema does.

His contention is that fixed schemas don't work for knowledge discovery, instead the right tools are flexible schemas and linked data. Also, it's not enough to represent experimental data in standard ways. We also need to describe the experimental design that provides the context for that data. To accomplish this use documents annotated with RDF style triples or XML plus (not-quite-free-text) descriptions built from controlled vocabulary. Use virtualization to archive complete data analysis environments for reproducability.

On the IndigoBio blog, there's a couple posts about interoperable data that make use of R and Cytoscape. Sounds like territory familiar to my current project/nemesis Gaggle.

The conversation then turns to the increasingly distributed nature of the drug industry and the IT challenges of strictly proscribed data sharing between highly paranoid competitors. The goal is to produce portable data assets with the ability to merge with any clients knowledge base -- mapping into the other's terms.

Related Links:

Thursday, December 10, 2009


Jeff Atwood, who writes the well-known Coding Horror blog, took on the topic of Microformats recently. His misguided comments about the presumed hackiness of overloading CSS classes with semantic meaning (actually their intended purpose) had people quoting the HTML spec:

The class attribute, on the other hand, assigns one or more class names to an element; the element may be said to belong to these classes. A class name may be shared by several element instances. The class attribute has several roles in HTML:
  • As a style sheet selector (when an author wishes to assign style information to a set of elements).
  • For general purpose processing by user agents.

Browsers work great for navigation and presentation, but we can only really compute with structured data. Microformats combine the virtues of both.

There are at least a couple of ways in which the ability to script interaction with web applications comes in handy. For starters, microformats are a huge advance compared to screen-scraping. The fact that so many people suffered through the hideous ugliness of screen-scraping proves that there must be some utility to be had there.

Also, web-based data sources have a browser-based front-end and also often expose a web service. Microformats link these together. A user can find records of interest by searching in the browser, embedded microformats allow the automated construction of a web service call to retrieve the data in structured form.

Microformats aren't anywhere near the whole answer. But, the real question is how to do data integration at web scale using the web as a channel for structured data.

See also

Wednesday, December 09, 2009

Distilling Free-Form Natural Laws from Experimental Data

Eureqa software implements the symbolic regression technique described in this paper:

Schmidt M., Lipson H. (2009) "Distilling Free-Form Natural Laws from Experimental Data," Science, Vol. 324, no. 5923, pp. 81 - 85.