Thursday, January 28, 2010

Analytics in Seattle

Seattle hosts a lot of activity in analytics (aka data analysis, visualization, and business intelligence).

Robert Gentleman, one of the principles of the R project, was for a long time associated with Fred Hutchinson Cancer Research Institute, just across the lake from where I sit. He got hired in September of 2009 by Genentech (now a subsidiary of Roche), and was also recently named to the board of REvolution computing. REvolution's product development office is here in seattle.

R is an open source implementation of the S language, which originated at Bell Labs. S-plus, a commercial offspring of S, was introduced in 1988 by UW professor R. Douglas Martin's company Statistical Sciences, Inc. That was bought in 1993 by MathSoft, maker of MathCAD, then sold off again in 2001 as Insightful corporation. MathSoft continued to struggle and was itself acquired in 2006 by PTC Corporation. TIBCO bought Insightful in 2008. TIBCO, as of 2007, also owns SpotFire.

Tableau, maker of analytics and visualization software that competes with SpotFire, is next to Google in Fremont. Probably doesn't hurt that UW has a strong statistics, biostats, and computer science departments.

Just last year, Microsoft nabbed former South Lake Union residents Rosetta biosoftware for integration into the spooky sounding Microsoft Amalga Unified Intelligence System.

Vaguely related

Monday, January 18, 2010

Split with escaped delimiters

I have a lot of little scripts that read tab-delimited text files, 'cause that's the way things get done in the creaky world of bioinformatics. Due to the limitations of flat files, you often see an attributes field, which holds key-value pairs in some more-or-less parsable way, typically something like this:

key1=value1;key2=value2;

Parsing this with regular expressions is pretty easy. Take this fragment of Ruby:

 attributes = {}
 File.foreach('attributes.txt') do |line|
   line.chomp!()
   line.scan(/([^;=]+)=([^;]+)/) do |key, value|
     attributes[key] = value
   end
 end

 attributes.each_pair do |key, value|
   puts("#{key}=#{value}")
 end

Because I'm a knucklehead, I decided to make it hard. What if we wanted to allow escaped delimeter characters? RegexGuru tells you that Split() is Not Always The Best Way to Split a String. This will work.

 def unescape(string)
   string.gsub("\\\\","\\").gsub("\\=","=").gsub("\\;",";")
 end

 attributes = {}
 File.foreach('attributes.txt') do |line|
   line.chomp!()
   line.scan(/((?:\\\\|\\=|\\;|[^;=])+)=((?:\\\\|\\;|[^;])+)/) do |key, value|
     attributes[unescape(key)] = unescape(value)
   end
 end

 attributes.each_pair do |key, value|
   puts("#{key}=#{value}")
 end

Take home message: This is way more trouble than it's worth. Also, don't forget that split works differently with regards to empty fields from one language to another.

Friday, January 15, 2010

State of data integration for biology

The field of biology has a proliferation of data repositories and software tools. Probably, it's inevitable. Lab techniques, computational methods, and algorithms advance rapidly. Technology trends come and go. And funding for sustained maintenance is hard to come by. Biological data comes in a diversity in types and varying quality driving the need to coordinate several lines of evidence.

Data integration has been a persistent problem. Lincoln Stein's Creating a bioinformatics nation and Integrating biological databases give the background. A pair of reviews update this topic.

State of the nation in data integration for bioinformatics

In State of the nation in data integration for bioinformatics, Carole Goble and Robert Stevens give an update (to 2007) with a semantic web spin. The complexity of biological data arises from:

  • diversity of data types
  • varying quality
  • describing a sample and its originating context
  • ridiculous situation with identifiers: Every conceivable pathology of identifiers is standard practice in biology - multiple names for the same thing, the same name for different things. Equality is often hazy and context dependent.

To address this complication, various strategies that have been tried, which they categorize as:

  • Service oriented architectures: CORBA, WS, ReST
  • Link integration
  • Data warehousing
  • View integration (distributed queries)
  • Model-driven service oriented architecture (caBIG, Gaggle?)
  • Integration applications (not clear to me what these are)
  • Workflows (Taverna)
  • Mashups
  • Semantic Web (smashups):
    • RDF
    • Ontologies
    • Linked data sets
    • Semantic mapping

Messaging probably belongs on this list. They call Gaggle a Model-driven service oriented architecture, which I'm not sure is the right classification. Message passing fits Gaggle better. Messaging tends to get overlooked, in spite of lots of work on message-oriented middleware and enterprise application integration in the business domain. (See Gregor Hohpe.) Messaging has evolved from a basis on the remote procedure call to a more document-centric view, but it seems that shift is not always recognized.

They use the term touch-points to mean common keys on which to join data.

They integrate on a range of common or corresponding touch-points: data values, names, identities, schema properties, ontology terms, keywords, loci, spatial-temporal points, etc. These touch-points are the means by which integration is possible, and a great deal of a bioinformatician’s work is the mapping of one touch-point to another.

They mention BioMoby, a tool for composing web-services into workflows based on describing services and their input and output data types with controlled vocabularies. The founders of BioMoby seem to have drifted steadily towards the semantic web, but at one time they said this:

[...] interoperability in the domain of bioinformatics is, unexpectedly, largely a syntactic rather than a semantic problem. That is to say, interoperability between bioinformatics Web Services can be largely achieved simply by specifying the data structures being passed between the services (syntax) even without rich specification of what those data structures mean (semantics).

Microformats get a brief mention as a means to enrich web content with semantics - stuctured representation, links to supporting evidence or related data, provenance, etc.

Goble and Stevens also cite the late great SIMILE Piggy Bank, a firefox extension allowing users to add their own semantic markup to web pages - a key source of inspiration for Firegoose. Another interesting reference is Alon Halevy's Principles of dataspace systems (2006), an effort to think about (just-in-time) data integration at web scale. (See Dynamic Fusion of Web Data for more.)

State of the cyberinfrastructure

Lincoln Stein's Towards a cyberinfrastructure for the biological sciences (2008) assesses the state of computing in biology with the aim of “the ability to create predictive, quantitative models of complex biological processes”. He defines cyberinfrastructure as consisting of data sources, computing resources, communication (not just networks, but syntactic and semantic connectivity as well), and human infrastructure (skills and cultural aspects).

The current biology cyberinfrastructure has a strong data infrastructure, a weak to non‑existent computational grid, patchy syntactic and semantic connectivity, and a strengthening human infrastructure.

He seems to conclude that the semantic web is promising, but not quite there yet. Almost echoing the quote above from the BioMoby paper, Stein says:

[...] in the short term, [...] we can gain many of the benefits of more sophisticated systems by leveraging the human capacity to make sense of noisy and contradictory information. Semantic integration will occur in the traditional way: many human eyes interpreting what they read and many human hands organizing and reorganizing the information.

Both of these papers suggest that the semantic web may finally solve these persistent data integration problems, but this cynical reader was left with a picture of a string of technologies that promised the same and never lived up to the hype. Semantic markup is a significant effort, whose costs fall on the data provider but whose benefits go mainly to others. Maybe tools will come along that reduce that effort, but I can't imagine there will ever be a day when quick-and-dirty looses its appeal. This is research, after all. In an exploratory analysis, a scientist is likely to try several false starts before finding a fruitful path. If integration at the (quicker and dirtier) syntactic level is even a little easier, it seems like a better choice for this purpose. The opposite is probably true for final published results.

In both articles, mashups and just-in-time integration show signs of getting some respect, which I like. Same for ideas related to linked data.

The really important goal

There is probably a small set of principles of data organization within a given domain that would impose little or no burden on data providers. And if we all accepted those principles up front, any two arbitrary pieces of data from different sources could be joined together either automatically or with reasonably little effort, and without any need for further central coordination.

Mo' links

Saturday, January 09, 2010

Pivot tables in R

The R ProjectA common data-munging operation is to compute cross tabulations of measurements by categories. SQL Server and Excel have a nice feature called pivot tables for this purpose. Here we'll figure out how to do pivot operations in R.

Let's imagine an experiment where we're measuring the gene activity of an organism under different conditions -- exposure to different nutrients and toxins. Our conditions are silly: copper, beer, pizza, and cheetos. First we make a list of genes. Then expand.grid generates all combinations of genes and conditions. Finally, we tack on a column of randomly generated measurements.

> genes = paste('MMP', sprintf("%04d",1:10), sep="")
> data = expand.grid(gene=genes, condition=c('copper', 'cheetos', 'beer', 'pizza'))
> data$value = rnorm(40)
> data
      gene condition       value
1  MMP0001    copper  0.90412805
2  MMP0002    copper  0.92664376
3  MMP0003    copper  0.27772147
4  MMP0004    copper  0.08958930
5  MMP0005    copper -0.20132304
6  MMP0006    copper  0.34524729
7  MMP0007    copper -0.33910206
8  MMP0008    copper  1.21006486
9  MMP0009    copper  0.78008022
10 MMP0010    copper  1.05364315
11 MMP0001   cheetos -2.31796229
12 MMP0002   cheetos  0.76706591
13 MMP0003   cheetos -2.93692935
14 MMP0004   cheetos  0.25452306
15 MMP0005   cheetos  0.24168329
16 MMP0006   cheetos  0.28739734
17 MMP0007   cheetos  0.69233543
18 MMP0008   cheetos  0.48865250
19 MMP0009   cheetos -0.11129319
20 MMP0010   cheetos  0.53322842
21 MMP0001      beer -0.74965948
22 MMP0002      beer  0.27105205
23 MMP0003      beer -0.99261363
24 MMP0004      beer  0.65143639
25 MMP0005      beer -0.35589696
26 MMP0006      beer  1.40147484
27 MMP0007      beer  0.37492710
28 MMP0008      beer  0.64453865
29 MMP0009      beer  0.35925345
30 MMP0010      beer  0.96394785
31 MMP0001     pizza -1.91818504
32 MMP0002     pizza  0.31690523
33 MMP0003     pizza -1.20566043
34 MMP0004     pizza -1.91750166
35 MMP0005     pizza  1.98010023
36 MMP0006     pizza  0.90468249
37 MMP0007     pizza  0.04284970
38 MMP0008     pizza -0.08141461
39 MMP0009     pizza -0.72471771
40 MMP0010     pizza -0.01085060

We want to pivot the conditions into columns so that we end up with one column for each condition and one row for each gene. The easy way is to use the reshape package by Hadley Wickham, which is made for restructuring data and does this job nicely. If you don't already have it, you'll have to run install.packages, then load the library.

> install.packages('reshape')
> library(reshape)

Using cast to move conditions into columns is a snap.

> cast(data, gene ~ condition)
      gene     copper    cheetos       beer       pizza
1  MMP0001  0.9041281 -2.3179623 -0.7496595 -1.91818504
2  MMP0002  0.9266438  0.7670659  0.2710521  0.31690523
3  MMP0003  0.2777215 -2.9369294 -0.9926136 -1.20566043
4  MMP0004  0.0895893  0.2545231  0.6514364 -1.91750166
5  MMP0005 -0.2013230  0.2416833 -0.3558970  1.98010023
6  MMP0006  0.3452473  0.2873973  1.4014748  0.90468249
7  MMP0007 -0.3391021  0.6923354  0.3749271  0.04284970
8  MMP0008  1.2100649  0.4886525  0.6445386 -0.08141461
9  MMP0009  0.7800802 -0.1112932  0.3592535 -0.72471771
10 MMP0010  1.0536432  0.5332284  0.9639479 -0.01085060

Done!

That was too easy

Just as an exercise, what would we have to do without reshape? And, just to keep ourselves honest, let's make sure we can deal with missing data (as reshape can). Make some data go missing:

> data.incomplete <- data[data$value > -1.0,]
> dim(data.incomplete)
[1] 35  3

Now, split the data frame up by condition. This produces a list where each element is a data frame containing a subset of the data for each condition. Notice that the cheetos data frame has values for 8 of the 10 genes.

> data.by.condition <- split(data.incomplete, data.incomplete$condition)
> typeof(data.by.condition)
[1] "list"
> names(data.by.condition)
[1] "copper"  "cheetos" "beer"    "pizza"  
> data.by.condition$cheetos
      gene condition      value
12 MMP0002   cheetos  0.7670659
14 MMP0004   cheetos  0.2545231
15 MMP0005   cheetos  0.2416833
16 MMP0006   cheetos  0.2873973
17 MMP0007   cheetos  0.6923354
18 MMP0008   cheetos  0.4886525
19 MMP0009   cheetos -0.1112932
20 MMP0010   cheetos  0.5332284

We're going to recombine the data into a data frame with one row for each gene, so let's get that started:

> result = data.frame(gene=genes)

Now comes some executable line noise. We're going to loop through the list and add a column to the result data frame during each iteration of the loop. We pull the column out of the data frame in the list, but we have to make sure the column has an element for each gene. Merging with the all parameter set is like an outer join. We get a row for each gene, inserting NA's where there data is missing.

> for (i in seq(along=data.by.condition)) { result[[names(data.by.condition)[i]]] <- merge(data.by.condition[[i]], genes, by.x='gene', by.y=1, all=T)$value }

> result
      gene     copper    cheetos       beer       pizza
1  MMP0001  0.9041281         NA -0.7496595          NA
2  MMP0002  0.9266438  0.7670659  0.2710521  0.31690523
3  MMP0003  0.2777215         NA -0.9926136          NA
4  MMP0004  0.0895893  0.2545231  0.6514364          NA
5  MMP0005 -0.2013230  0.2416833 -0.3558970  1.98010023
6  MMP0006  0.3452473  0.2873973  1.4014748  0.90468249
7  MMP0007 -0.3391021  0.6923354  0.3749271  0.04284970
8  MMP0008  1.2100649  0.4886525  0.6445386 -0.08141461
9  MMP0009  0.7800802 -0.1112932  0.3592535 -0.72471771
10 MMP0010  1.0536432  0.5332284  0.9639479 -0.01085060

Extra finesse points if you can figure out how to do that last step with Reduce instead of a loop.