Wednesday, December 24, 2014

What the #@$% is a Monad?

Monads are like fight club. The first rule of monads is don't blog about monads.

Kind of a design pattern for functional programming, monads are already the subject of more than enough well intentioned but confusing tutorials. We'll not commit the monad tutorial fallacy here. But, monads are needed for a couple of the labs from FP101x, an online class in Haskell - labs with a throw-'em-into-the-deep-end quality to them.

Here's a quick list of some of the better resources I found, while struggling to get a handle on these super-abstract objects of mystery.

Starting points

Phillip Wadler

It's been said that "Monads are hard because there are so many bad monad tutorials getting in the way of finally finding Wadler’s nice paper." Find it here:

Need more?

Those got me over the first hump, but here are some I may want to come back to later:

To put monads in a more general context, here's a really great guide to Getting started with Haskell.

Wednesday, December 03, 2014

Lee Edlefsen on Big Data in R

Lee Edlefsen, Chief Scientist at Revolution Analytics, spoke about Big Data in R at the FHCRC a week or two back. He introduced the PEMA or parallel external memory algorithm.

“Parallel external memory algorithms (PEMA's) allow solution of both capacity and speed problems, and can deal with distributed and streaming data.”

When a problem is too big to fit in memory, external memory algorithms come into play. The data to be processed is chunked and loaded into memory a chunk at a time and partial results from each chunk combined into a final result:

  1. initialize
  2. process chunk
  3. update results
  4. process results

Edlefsen made a couple of nice observations about these steps. Processing an individual chunk can often be done independently of other chunks. In this case, it's possible to parallelize. If updating results can be done as new data arrives, you get streaming.

Revolution has developed a framework for writing parallel external memory algorithms in R, RevoPemaR, making use of R reference classes.

I couldn't find Edlefsen's exact slides, but these decks on parallel external memory algorithms and another from UseR 2011 on Scalable data analysis in R seem to cover everything he talked about.