Digithead's Lab Notebook: 2014

Wednesday, December 24, 2014

What the #@$% is a Monad?

Monads are like fight club. The first rule of monads is don't blog about monads.

Kind of a design pattern for functional programming, monads are already the subject of more than enough well intentioned but confusing tutorials. We'll not commit the monad tutorial fallacy here. But, monads are needed for a couple of the labs from FP101x, an online class in Haskell - labs with a throw-'em-into-the-deep-end quality to them.

Here's a quick list of some of the better resources I found, while struggling to get a handle on these super-abstract objects of mystery.

Starting points

Functors, Applicatives, And Monads In Pictures, by Aditya Bhargava, is a friendly place to start.
Brian Beckman: Don't fear the Monad
FPComplete's School of Haskell explains what monads are and why to use them.

Phillip Wadler

It's been said that "Monads are hard because there are so many bad monad tutorials getting in the way of finally finding Wadler’s nice paper." Find it here:

Monads for functional programming

Need more?

Those got me over the first hump, but here are some I may want to come back to later:

You Could Have Invented Monads! (And Maybe You Already Have)
Monads for the Curious Programmer by Bartosz Milewski who used to be in Black Sabbath. See also Milewki's easiest way to understand Monads.
Monads and Gonads a Google Tech-Talk by Douglas Crockford
Is There Anything Left to Say about Monads?
Monads in small bites
Monads for Dummies
What are monads and why are they useful?

To put monads in a more general context, here's a really great guide to Getting started with Haskell.

Wednesday, December 03, 2014

Lee Edlefsen on Big Data in R

Lee Edlefsen, Chief Scientist at Revolution Analytics, spoke about Big Data in R at the FHCRC a week or two back. He introduced the PEMA or parallel external memory algorithm.

“Parallel external memory algorithms (PEMA's) allow solution of both capacity and speed problems, and can deal with distributed and streaming data.”

When a problem is too big to fit in memory, external memory algorithms come into play. The data to be processed is chunked and loaded into memory a chunk at a time and partial results from each chunk combined into a final result:

initialize
process chunk
update results
process results

Edlefsen made a couple of nice observations about these steps. Processing an individual chunk can often be done independently of other chunks. In this case, it's possible to parallelize. If updating results can be done as new data arrives, you get streaming.

Revolution has developed a framework for writing parallel external memory algorithms in R, RevoPemaR, making use of R reference classes.

I couldn't find Edlefsen's exact slides, but these decks on parallel external memory algorithms and another from UseR 2011 on Scalable data analysis in R seem to cover everything he talked about.

Saturday, November 22, 2014

Haskell class, so far

Well, I'm about 5 weeks into Introduction to Functional Programming, a.k.a. FP101x, an online class taught in Haskell by Eric Meijer. The class itself is a couple weeks ahead of that; I'm lagging a bit. So, how is it so far, you ask?

The first 4 weeks covered basic functional concepts and how to express them in Haskell, closely following chapters 1-7 of the book, Graham Hutton's Programming in Haskell:

Defining and applying functions
Haskell's type system
- parametric types
- type classes
- type signatures of curried functions
pattern matching
list comprehensions
recursion
higher-order functions

Haskell's hierarchy of type classes is elegant, but some obvious things seem to be missing. For example, you can't show a function. But, it would be really helpful to show something like a docstring, or at least the function's type signature. Also machine-word sized Int's don't automatically promote, so if n is an Int, n/5 produces a type error.

Most of the concepts were familiar already from other functional languages, Scheme via SICP, OCAML via Dan Grossman's programming languages class, and Clojure via The Joy of Clojure. So, this early part was mostly a matter of learning Haskell's syntax.

Some nifty examples

a recursive definition of factorial:

factorial :: Integer -> Integer
factorial 0 = 1
factorial n = n * (factorial (n-1))

sum of the first 8 powers of 2:
```
sum (map (2^) [0..7])
```

a recursive definition of map:

map :: (a -> b) -> [a] -> [b]
map f [] = []
map f (x:xs) = f x : map f xs

get all adjacent pairs of elements from a list:

pairs :: [a] -> [(a,a)]
pairs xs = zip xs (tail xs)

check if a list of elements that can be ordered is sorted by confirming that each pair of elements is ordered:
```
sorted :: Ord a => [a] -> Bool
sorted xs = and [x <= y |(x,y) <- pairs xs]
```

Haskell attains its sparse beauty by leaving a lot implied. One thing I figured out during my brief time with OCAML also seems to apply to Haskell. Although these languages lack the forest of parentheses you'll encounter in Lispy languages, it's not that the parentheses aren't there; you just can't see them. A key to reading Haskell is understanding the rules of precedence, associativity and fixity that imply the missing parentheses.

Pre- cedence	Left associative	Non- associative	Right associative
9	`!!`		`.`
8			`^`,`^^`,`**`
7	`*`, `/`, `div`, `mod`, `rem`, `quot`
6	`+`,`-`
5			`:`,`++`
4		`==`, `/=`, `<`, `<=`, `>`, `>=`, `elem`, `notElem`
3			`&&`
2			`\|\|`
1	`>>`, `>>=`
0			`$`, `$!`, `seq`

Another key is reading type signatures of curried functions, as currying is the default in Haskell and is relied upon extensively in composing functions, particularly in the extra terse "point-free" style.

Currently, I'm trying to choke down Graham Hutton's Addendum on Monads. If I end up understanding that, it'll get me a code-monkey merit badge, for sure.

Tuesday, November 11, 2014

The DREAM / RECOMB Conference 2014

The RECOMB/ISCB Conference on Regulatory and Systems Genomics, with DREAM Challenges and Cytoscape Workshops is running this week in San Diego.

A bunch of us from Sage Bionetworks are here to connect with the DREAM community. In introductory remarks, Stephen Friend framed the challenges as piloting new modes of collaboration and engagement addressing multidimensional problems based on the idea that open innovation will trump closed silos.

Lincoln Stein: The Future of Genomic Databases

I first heard Lincoln Stein speak at an O'Reilly conference in 2002, on building a bioinformatics nation. The same themes of openness and integration reappeared in Stein's talk on The Future of Genomic Databases.

Stein asks, "Open Data + open source = reproducible science?" Not exactly. Stein presents some emerging solutions to the remaining obstacles: big data sets, complex workflows, unportable code and data access restrictions.

Cloud computing, specifically colocation of data and compute, enables handling big data. Containers (ie Docker) address the problem of code portability. The Global Alliance is working towards providing APIs both to encapsulate technical complexity and to provide a control point at which to enforce restrictions.

In case we're wondering what to do with all the machine cycles made available by Amazon and Google, bioinformatics workflows are growing in complexity. Workflow managers like Seqware and Galaxy provide a formalized description of multistep processes and manage tools and their dependencies.

Legal restrictions hinder data integration. But, donors want their samples to contribute to research. Licensure for data access combined with uniform consent could reduce the friction resulting in a streamlined data access process. On the other hand, technical solutions involve homomorphic encryption and agent based federated queries.

As a parting thought, Stein notes that digital infrastructure enables experiments in incentive structures and economic models, citing micropayments, ratings, and challenges.

Andrea Califano

Andrea Califano spoke on the genotype to phenotype linkage in cancer. Thinking of the cell as an integrator of signals, Califo's group traces from gene or protein expression signatures of cell states (normal, neoplastic, metastatic) back through the network to the master regulators responsible for that signature. One related paper is Identification of Causal Genetic Drivers of Human Disease through Systems-Level Analysis of Regulatory Networks.

Paul Boutros Somatic Mutation Calling Challenge

Paul Boutros presented the Somatic Mutation Calling Challenge (SMC-DNA). He announced the intention for the SMC challenges to become a living benchmark, an objective standard against which future methods will be tested.

Paul also crowned the Broad Institute's MuTect (single nucleotide) and novoBreak (structural variants) by Ken Chen's lab at MD Anderson the winners of the synthetic tumor phase of SMC-DNA. The plan is to announce winners on real tumor data in February after experimental validation.

The Winners

The SMC challenge is a bit unique for DREAM in its level of specialization. In the other challenge, a couple of methods were highlighted: Gaussian process regression and dictionary learning for sparse representation.

But, increasingly, the main differentiator is application of biological domain knowledge, especially with respect to selecting and processing features. Li Liu of Arizona State's Biodesign Institute, for example, won part of the Accute Myoloid Leukemia challenge by weighting proteins based on their evolutionary conservation.

Another theme is that genetic features seem to have poor signal compared to more downstream features, gene expression or clinical variables. Peddinti Gopalacharyulu, a top performer in the Gene Essentiality Challenge, commented that perhaps the way to use genetics is to extract the component of gene expression that is not explained by genetic features.

DREAM 9.5

Two of the Dream 9.5 challenges are follow-ups to the Somatic Mutation Calling challenge from the 8.5 round. The SMC empire expands into RNA and tumor heterogeneity. In the olfaction challenge, the goal is to predict, from molecular features, odor as described by human subjects. The Prostate cancer challenge asks participants to classify patients according to survival using data sourced from the comparator arms of clinical trials.

For the DREAM 10 round, there's an imaging challenge in the works and a sequel to the ALS challenge challenge from DREAM 7.

On to RECOMB

That's just the DREAM part of the meeting, or, really, the subset that fit into my brain. As an added bonus, there were several representatives from Cytoscape-related projects and some conversation about the Global Alliance for Genomics and Health.

Tuesday, October 14, 2014

Let's learn us a Haskell

Let's say you've been meaning to learn Haskell for a long time, secretly yearning for purely functional programming, laziness and a type system based on more category theory than you can shake a functor at.

Now's your chance. Erik Meijer is teaching an online class Introduction to Functional Programming on edX, about which he says, "This course will use Haskell as the medium for understanding the basic principles of functional programming."

It starts today, but I've gotten a head start by working through the first few chapters of Learn You a Haskell for Great Good! which most agree is the best place to get started with Haskell.

Anyone up for a Seattle study group?

Tutorials

Books

Other resources

Typeclassopedia
H-99: Ninety-Nine Haskell Problems
"Programming with Arrows" paper by John Hughes.
C9 Lectures: Erik Meijer - Functional Programming Fundamentals
What I Wish I Knew When Learning Haskell

It must be some pack-rat instinct that makes me compile these lists.

Thursday, July 03, 2014

Galaxy Community Conference

The 2014 Galaxy Community Conference (GCC2014) wrapped up yesterday at Johns Hopkins University in Baltimore. The best part for me was the pre-conference hackathon.

If you're not familiar with Galaxy, it's a framework for wrapping command line bioinformatics tools in a web UI. On top of that, Galaxy adds lots of sophistication around job scheduling, configurable to run its job on SGE, SLURM, and other queuing systems and to run on virtual clusters with Cloudman. Galaxy users can design reproducible workflows and manage tool versioning and dependencies through the Toolshed.

The project has attracted a vibrant community supported by an active Q&A site and an IRC channel. Vendors offer Galaxy appliances, cloud deployments and consulting.

Hackathon

For the hackathon, 40 developers gathered in James Taylor's new computing lab in one of the red brick buildings interspersed with lawns and groves of trees that make up the Hopkins campus. This was a great setting for an accelerated ramp-up on the Galaxy codebase. Out of 18 proposed projects, 9 got far enough to be called finished. I learned how to write tool wrappers and how to access the Galaxy API.

Photo credit - someone on the Galaxy team

Galaxy + Synapse

Specifically, I was there to put together some integration between Synapse and Galaxy, bulding on what Kyle Ellrott had already started. Reproducibile analysis, provenance and annotation are core concerns of both projects, so it seems like a good fit. After data exchange, which Kyle got working, exchanging provenance and metadata sound like logical next steps.

Galaxy histories should be able to refer to Synapse entities and receive annotations along with data objects.

Serializing Galaxy histories, workflows etc. out to Synapse might allow a usage model where a cloud instances of Galaxy can be spun up and shut down on demand, with all the important details preserved in Synapse in between times.

Portable provenance might be a longer term goal. It would be neat to be able to thread provenance through some arbitrary combination of data sources and analysis platforms. All of these provenance-aware platforms - Figshare, Dryad, Synapse, Arvados, Galaxy, etc - ought to be able to share provenance between themselves.

Next year

The whole event was well put together in general. Next year's Galaxy Community Conference is scheduled for 6-8 July 2015 in Norwich, England.

Anton Nekrutenko group at Penn State
James Taylor's group which recently moved from Emory to Johns Hopkins

Tuesday, June 17, 2014

Ada's Technical Books

If you're in Seattle, you owe it to yourself to spend some time in Ada's Technical Books in my old neighborhood. It's on the top of Capitol Hill on 15th between Republican and Harrison.

The shop features books on all sorts of techy topics - science, math, engineering and, of course, computers along with various nifty maker-type gadgets and a cafe full of tasty treats. Geek heaven!

Thursday, May 08, 2014

How to grep git commits

From the There's-got-to-be-a-better-way Department, comes this tale of subterranian Git spelunking...

It all started because I lost some code. I had done my duty like a good little coder. I found a bug, wrote a test that reproduced it and filed an issue noting the name of the failing test. Then, some travel and various fire drills intervened. Memory faded. Finally, getting back to my routine, I had this bug on my docket. Already had the repro; should be a simple fix, right?

But where was my test? I asked Sublime to search for it, carefully turning off the annoying find-in-selection feature. Nothing. Maybe I stashed it. Do git stash list. Nothing that was obviously my test. Note to self: Use git stash save <message> instead of just git stash. I did git show to a several old stashes, to no avail. How am I going to find this thing?

A sensible person would have stopped here and rewritten the test. But, how much work are we willing to do to avoid doing any work? Lots, apparently.

So, I wrote this ridiculous bit of hackery instead. I call it grep_git_objects:

import os
import sys
import subprocess

if len(sys.argv) < 2:
    print "Usage: python grep_git_objects.py <target>"
    sys.exit()

target = sys.argv[1]

hashes = []
for i in range(0,256):
    dirname = ".git/objects/%02x" % i
    #print dirname
    for filename in os.listdir(dirname):
        h = "%02x%s" % (i, filename)
        hashes.append(h)

        cmd = "git cat-file -p %s" % h
        output = subprocess.check_output(cmd.split(' '))
        if target in output:
            print "found in object: ", h
            print output

Gist: grep_git_objects.py

Run it in the root of a repo. It searches through the .git/objects hierarchy to find your misplaced tests or anything else you've lost.

You'd think git grep would do exactly that. Maybe it does and I'm just clueless. I hope so, 'cause otherwise, why would you have a git grep command? In any case, git grep didn't find my missing test. The above bit of Python code did.

Once I found it, I had the hash code for a Git object. Objects are the leaves of Git trees. So, I grepped again for the hash code of the object and found the tree node it belonged to, working my way up a couple more levels to the commit.

With commit in hand, I could see by the "WIP on develop" message that is was indeed a stash, which I must have dropped by mistake. Anyway, that was fun... sorta.

Tuesday, April 22, 2014

The damaging effects of hypercompetition

In Rescuing US biomedical research from its systemic flaws (PNAS 2014), authors Bruce Alberts, Marc Kirschner, Shirley Tilghman and Harold Varmus recount the ways in which research funding is broken, in particular for life sciences.

PI's pour vast amounts of time and energy into competing for a shrinking pool of funds. Pressure for results causes a shift to short-term thinking and engenders a conservatism poorly suited to producing breakthroughs. In much the same way as the corporate sector abandoned fundamental research in favor of product development, the current funding climate overvalues "translational" research over basic science. The situation is especially dire for young scientists facing the prospect of years as an underpaid and overworked post-doc with faint hopes of ever landing a faculty position. The internet is littered with goodbye-academia letters. [1 2 3 4 5 6]

The Alberts paper proposes a few solutions, some sensible and one in particular that doesn't make much sense to me.

Broadening career paths for scientists is a sound idea. In this respect, life science can borrow from information technology. Computer science departments and the tech industry have a long history of exchange. The barriers in biotech are higher, but the flow of people and ideas between academics and industry can only be a good thing. Probably the biggest factor in diversifying away from the shrinking tenure track, is matching expectation with reality.

Managing expectations seems more sensible than trying to match supply and demand by restricting entry into PhD programs based on the reasoning that the system is training too many PhDs. Isn't too many PhD's a good problem to have? Is educating people a bad thing? Scientifically trained people are a tremendous asset. If we're not using a valuable resource effectively, having less of that resource doesn't seem like an ideal solution. With some creative thinking, productive uses for that talent can and are being found.

I'm not convinced that the pyramid shaped structure of science is the problem. Harnessing the idealism, curiosity and naive overconfidence of youth works. What doesn't work in absence of perpetual growth is the expectation that everyone at the bottom of the pyramid will eventually be at the top. But, what's wrong with a system in which people are supported for a time on research grants then continue onto many diverse career paths?

Getting the mix of competition and camaraderie right can be the difference between a situation that's nurturing rather than toxic. Economics teaches that incentives matter. But, re-engineering science will require a nuanced understanding of what those incentives are, both in crass terms of money and prestige but also the more subtle psychology around freedom, inner drive and higher purpose.

Since nobody asked me

If I was in charge of funding science, I'd place the majority of my bets on longer term grants for small focused labs where the PI does hands-on science and training of new scientists. Longer term grants of 5 years or so would give researchers time to do actual science in teams of 2-8 people. The article recognizes the success of the Howard Hughes Institutes model of explicitly encouraging scientists to pursue risky ideas - "the free play of free intellects".

I'd be reluctant to fund 40 person labs because neither mentoring nor creative thinking scales up to that size particularly well. I'd avoid large multi-center grants. Funding agencies seem to favor this sort of thing perhaps as a safer bet, they're a recipe for infighting and politicking. They risk getting all the disadvantages of both collaboration and competition with little of the benefits.

Research has a great track record of paying off in the long run. But, Bill Janeway, author of Doing Capitalism in the Innovation Economy says that there has to be a surplus in the funding system to absorb the unavoidable costs of uncertainty, both in terms of when the payoff comes and who captures the gains. In the past, this surplus has come from monopolistic corporations like Bell or Xerox, from the taxpayer, or from flush venture capital. In The Great Stagnation, Tyler Cowen posits that we've picked the low-hanging fruit of the current wave of technology - that we're in a period of stagnation while the last wave is fully digested and the stage can be set for the next wave. To that end, maybe some hedge-fund wizard can cook up some financial innovation for reliably financing risky, long-term projects with unevenly distributed payouts.

[1] Goodbye academia, I get a life
[2] Goodbye Academia
[3] Goodbye academia? Hello, academia
[4] The Big Data Brain Drain
[5] Why So Many Academics Quit and Tell
[6] On Leaving Academe
[7] Science: The Endless Frontier, Vannevar Bush, 1945
[8] The Decline of Unfettered Research, Andrew Odlyzko, 1995

Friday, April 11, 2014

Clojure Koans

In an attempt to reach a bit higher plane of enlightenment with respect to Clojure, I did the Clojure Koans. What a great way to get familiar with a new language.

It might be worth watching the video solutions: Clojure Koans Walkthrough in Light Table.

Thursday, February 20, 2014

Regression with multiple predictors

Now that I'm ridiculously behind in the Stanford Online Statistical Learning class, I thought it would be fun to try to reproduce the figure on page 36 of the slides from chapter 3 or page 81 of the book. The result is a curvaceous surface that slices neatly through the data set.

I'm not sure what incurable chemical imbalance explains why such a thing sounds like fun to me. But let's go get us some data, specifically the Advertising data set from the book's website:

Advertising <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv", 
    row.names = 1)

Previously, we fooled around with linear models with one predictor. We now extend regression to the case with multiple predictors.

\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 +···+ \hat{\beta}_p x_p \]

Or, in matrix notation:

\[ \hat{Y} = X \cdot \hat{\beta} + \hat{\beta_0} \]

Where the dimensions of X are n by p, and $ \beta $ is p by 1 resulting in a Y of size n by 1, a prediction for each of the n samples.

Find a fit for Sales as a function of TV and Radio advertising plus the interaction between the two.

fit2 = lm(Sales ~ TV * Radio, data = Advertising)

Before making the plot, define a helper function to evenly divide the ranges. This probably exists somewhere in R already, but I couldn't find it.

evenly_divide <- function(series, r = range(series), n = 10) {
    c(r[1], 1:n/n * (r[2] - r[1]) + r[1])
}

Generate a 2D grid over which we'll predict Sales.

x = evenly_divide(Advertising$TV, n = 16)
y = evenly_divide(Advertising$Radio, n = 16)

Using the coefficients of the fit, create a function f that computes the predicted response. It would be nice to use predict for this purpose. From a functional programming point of view, it seems like there should be a function that takes a fit and returns a function equivalent to f. If such a thing exists, I'd like to find it.

beta = coef(fit2)
f <- function(x, y) beta[1] + beta[2] * x + beta[3] * y + beta[4] * x * y
z = outer(x, y, f)

I copied the coloring of the regression surface from the examples for persp. I'm guessing the important part is create a color vector of the right size to cycle through the facets correctly.

nrz <- nrow(z)
ncz <- ncol(z)
nbcol <- 100

# Create color palette
palette <- colorRampPalette(c("blue", "cyan", "green"))
color <- palette(nbcol)

# Compute the z-value at the facet centres
zfacet <- z[-1, -1] + z[-1, -ncz] + z[-nrz, -1] + z[-nrz, -ncz]

# Recode facet z-values into color indices
facetcol <- cut(zfacet, nbcol)

OK, finally we get to the plotting. A call to persp sets up the coordinate system and renders the regression surface.

With the coordinate system we got in the return value, we plot the line segments representing the residuals, using a higher transparency for points under the surface. We do the same for the actual points, so they look a bit like they're underneath the surface.

# Draw the perspective plot
res <- persp(x, y, z, theta = 30, phi = 20, col = color[facetcol], xlab = "TV", 
    ylab = "Radio", zlab = "Sales")

# Draw the residual line segments
xy.true = trans3d(Advertising$TV, Advertising$Radio, Advertising$Sales, pmat = res)
xy.fit = trans3d(Advertising$TV, Advertising$Radio, predict(fit2, Advertising), 
    pmat = res)
colors = rep("#00000080", nrow(Advertising))
colors[residuals(fit2) < 0] = "#00000030"
segments(xy.true$x, xy.true$y, xy.fit$x, xy.fit$y, col = colors)

# Draw the original data points
colors = rep("#cc0000", nrow(Advertising))
colors[residuals(fit2) < 0] = "#cc000030"
points(trans3d(Advertising$TV, Advertising$Radio, Advertising$Sales, pmat = res), 
    col = colors, pch = 16)

Fit surface

To be honest, this doesn't look much like the figure in the book and there's reason to doubt the information conveying ability of 3D graphics, but whatever - it looks pretty cool!