Digithead's Lab Notebook: 12/01/2008

Sunday, December 28, 2008

Five deep questions in Computing

In the January 2008 issue of Communications of the ACM, Jeannette Wing of Carnegie Mellon University poses these questions:

P=NP?
What is computable?
What is intelligence?
What is information?
(How) can we build complex systems simply?

Saturday, December 27, 2008

Functional Programming

I've harbored a secret desire to learn Haskell for a few years now. Simon Peyton-Jones is one of the key people behind Haskell. His web site at MSR has tons of papers, a tutorial on concurrent programming in Haskell, and a video lecture of A taste of Haskell. There's also a Simon Peyton-Jones podcast at SE-Radio.

What is Haskell

Haskell is a programming language that is

purely functional

lazy

higher order

strongly typed

general purpose

Why should I care?

Functional programming will make you think differently about programming

Mainstream languages are all about state

Functional programming is all about values

Whether or not you drink the Haskell Kool-Aid, you'll be a better programmer in whatever language you regularly use

I should read a Haskell book or two, and, in related functional goodness, I keep reading how great Practical Common Lisp is. I also need to fulfill my quest to finish SICP. I've read the first three chapters twice, doing the examples once in Scheme and again in OCAML. I've read chapter 4 on interpreters. I need to work through the examples in that chapter and take in the final fifth chapter.

Monday, December 22, 2008

Wordle

wordle

Sunday, December 21, 2008

Principle of least surprise my ass

Let's start off by saying I'm not anti-Ruby. I like Ruby. Ruby is cool. Matz is cool. But, a while back I was wondering, What is a Ruby code block? My feeble curiosity has been revealed for the half-assed dilettantery it is by Paul Cantrell. Mr. Cantrell chases this question down, grabs it by the scruff of the neck, and wrings it out like a bulldog with a new toy. He also rocks on the piano, by the way.

So in fact, there are no less than seven -- count 'em, SEVEN -- different closure-like constructs in Ruby:

block (implicitly passed, called with yield)

block (&b => f(&b) => yield)

block (&b => b.call)

Proc.new

proc

lambda

method

This is quite a dizzing array of syntactic options, with subtle semantics differences that are not at all obvious, and riddled with minor special cases. It's like a big bear trap from programmers who expect the language to just work. Why are things this way? Because Ruby is:

designed by implementation, and

defined by implementation.

Again, neither I nor Mr. P.C. are bashing Ruby. He shows how to pull off some tasty functional goodness like transparent lazy lists later in the article. Thanks to railspikes for the link.

Saturday, December 20, 2008

Random Introduction to Bioinformatics

I was asked recently to recommend some introductory reading material about bioinformatics, which made me realize I haven't read enough myself. Here's what I came up with plus a few additions.

An Introduction to Bioinformatics Algorithms, Jones and Pevzner: a fun and easy read, if you already have a CS background.
Genetics: From Genes to Genomes
Algorithms on Strings, Trees and Sequences, by Daniel Gusfield: Who doesn't love string algorithms?

If you're into stats, both of these are highly regarded, but miles over my head.

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids by Durbin and Eddy
Statistical Methods in Bioinformatics: An Introduction by Warren J. Ewens and Gregory Grant

Papers

Uri Alon's Network motifs: theory and experimental approaches
Creating a bioinformatics nation by Lincoln Stein is entertaining reading and may let you know the kind of morass you're getting yourself into.
Integrating biological databases, also by Lincoln Stein
An overview of sequence comparison algorithms in molecular biology, Tech. Rep. TR-91-29, Dept. of Computer Science, Univ. of Arizona. Eugene Myers
SR Eddy, "What is Bayesian statistics?" Nat. Biotechnol., 22, #9 (2004) 1177-8.
SR Eddy, "What is a hidden Markov model?" Nat. Biotechnol., 22, #10 (2004) 1315-6.
Foundations for engineering biology, Drew Endy
Engineering life: Building a fab for biology, from Scientific American, 2006
Tangentially related and well worth reading is: Software Design Patterns for Information Visualization by Jeffrey Heer
Executable Cell Biology

Dr. Larry Ruzzo at UW teaches a Computational Biology course. Some of the links above are from his reading list, particularly the Sean Eddy Primer articles from Nature Biotechnology.

In winter of 2008, some UW CS grad students held a seminar course on data management issues in life sciences. In case that link doesn't stay up forever, here's some of the reading list:

Intro to Biology

Overview on biological data integration

Specific tools and techniques

Also on the subject of data: Dynamic Fusion of Web Data.

Books I wanna read

Finally, here are some books that I haven't read, will probably never get the time to read, but I wish I would read.

An Introduction to Systems Biology: Design Principles of Biological Circuits by Uri Alon
Systems Biology: Properties of Reconstructed Networks by Bernhard O. Palsson
Dynamic Models in Biology by Stephen P. Ellner, John Guckenheimer
Evolutionary Dynamics: Exploring the Equations of Life by Martin A. Nowak
System Modeling in Cellular Biology: From Concepts to Nuts and Bolts by Zoltan Szallasi
The Music of Life: Biology beyond the Genome by Denis Noble
Bioinformatics and Functional Genomics By Jonathan Pevsner

Friday, December 12, 2008

Educating Software Developers

Slashdot linked to Bjarne Stroustrup on Educating Software Developers which follows up on an earlier article, The 'Anti-Java' Professor and the Jobless Programmers. The Anti-Java professor is Robert Dewar at NYU, who coauthored a short paper, Computer Science Education: Where Are the Software Engineers of Tomorrow? They contend that computer science curricula have been dumbed down to counter falling enrollment post-dot-com-crash and partially blame Java, which fosters reliance on libraries and garbage collection. But, not all of their critique can be written off as language bigotry. The result?

We are training easily replaceable professionals.

Dewar advocates:

mentoring
reading code
working in groups
learning to reuse code

Those sound like solid points to me. One thing the field of medicine really gets right is an emphasis on mentoring. Mentoring is the heart of residency, which depending on specialty can last from 3 to 7 years. By the time a physician graduates from residency, they will have performed hundreds of procedures and seen thousands of patients under the guidance of an attending physician. I've often wished there was more of this in the computing field.

Over the years, I've accumulated a list of topics I wish I'd been exposed to as a CS undergrad.

source control
command line foo
linking and dependency management
security
software design: interfaces, refactoring, patterns, components, APIs
software architecture: design with large components - app servers, databases, message queues, transactions, etc.
data modeling: schema design and not just relational DBs. Hierarchical (XML) and graph (OO, RDF) representation
models of computation
- imperative
- OO
- functional
- logical
- pipes and filters
- and how they're are related.
scientific/engineering computing: MatLab, R
probability and statistics, statistical computing, analytics
writing

Of course, then my undergraduate degree would have taken 7 years... On second thought, only my Dad would have complained.

What do you wish you'd learned in college? Post a comment!

Tuesday, December 09, 2008

Firefox and bioinformatics

I wondered who else might be working on bioinformatics related extensions for Firefox besides Firegoose. One interesting project is iHOPerator, which builds on Greasemonkey. And, there's a hint of something to come here.

It seems like there was a flurry of interest around 2005, in the early days of AJAX and mash-ups, which produced biobar along with two now-dead projects - bioFox and NCBI Search Toolbar. Back in those days, John Udell asked, How do you design a remixable Web application? Nifty developments like the REST API in EMBL's STRING 8.0 are starting to provide answers.

Bioinformatics as a Queryable Knowledge Map: the Pygr Project

Pygr is a hypergraph database in Python with applications in bioinformatics written by Christopher Lee, a faculty member at UCLA. There's a 30 minute video of talk about Pygr and a bunch of other resources on the Lee Lab website and Lee's thinking bioinformatics blog.

Thesis: Hypergraphs are a general model for bioinformatics and Python’s core models are already a good model of Bioinformatics Data
Sequence: protein and nucleic acid sequences
Mapping / Graphs: alignment, annotation
Attributes: schema, i.e. relations between data
Namespace (import): the ontology of all bioinformatics data
Pygr aims to show that these Pythonic patterns are a general and scalable solution for bioinformatics.

The general idea is not entirely different from the data types behind Gaggle, especially in the emphasis on basic data structures without a heavy semantic component.

Dr. Lee is also writing a textbook on probabilistic inference.

Saturday, December 06, 2008

Dynamic Fusion of Web Data

I happened across a very cool project on web data integration at the University of Leipzig. Their paper Dynamic Fusion of Web Data is worth a look. They're working towards a theory of on-the-fly data integration for mashup applications that they refer to as dynamic data fusion. Data integration in mashups is dynamic in that it occurs as runtime. This provides for a pay-as-you-go model, rather than a large up-front semantic mapping task that limits the scalability of traditional data integration methods like data warehouses.

They describe mashups as workflow-like. Do they mean mashups are programmatic as opposed to declarative? In place of SQL, this group's iFuice system uses a scripting language with "set operations (e.g., union, intersection, and difference) and data transformation (e.g., fuse, aggregate) which can be used to post-process query results". Other key features are instance-level mapping and accommodation of structured and unstructured data.

This definitely gets at what Firegoose is good for - using the web as a channel for structured data - an approach that does for data integration what loose coupling does for software. Firegoose, part of the Gaggle framework, is a toolbar for Firefox that allows data to be exchanged between desktop software and the web. Firegoose can read microformats, call web services, query databases, or even perform nasty dirty screen scraping. Unlike a mashup, data integration in Firegoose and Gaggle requires user participation, although the user never deals with schemas, only instances of the Gaggle data types - mainly lists of identifiers, matrices of numeric data, networks, and tuples. The identifiers serve in a role somewhat analogous to primary keys.

More papers in a similar vein

Tuesday, December 02, 2008

Browsing genomes

I may as well come clean and admit that I'm developing a genome browser. What? Another genome browser? Why? You may well ask these questions. Well, it's a long story. But here is a completely non-exhaustive list of existing genome browsers.

The classics: UCSC Genome Browser Home and paper and it's microbial sibling.
Argo a Java rich-client genome browser built at the Broad Institute.
Integrative Genomics Viewer also from Broad (see press release).
Affymetrix spun out it's Integrated Genome Browser into an open source project, along with a library of re-usable components called GenoViz.
GBrowse Lincoln Stein's Perl based genome browser, followed by the AJAXy javascripty JBrowse (with a paper, nicely written as usual from Lincoln Stein.).
x:map an AJAX genome browser based on the Google Maps API.
The biology of extemophiles lab at the University of Paris Sud have a nice little web based browser for Sulfolobus.
NCBI has a new AJAX tool called Sequence Viewer
Flash based OmicBrowse is apparently big in Japan.
MochiView is really nice. Read the paper in BMC Biology.
GenomeGraphs: integrated genomic data visualization with R
GenomeView
Visualization guru Ben Fry (of processing fame) wrote at least two:cd36 browser and a handheld genome browser
A guy who calls himself Saaien Tist implemented a circular genome browser in ruby-processing
Back in 2002, some canucks built BioViz an SVG based genome browser
The Savant Genome Browser is a desktop visualization tool for genomic data. It was primarily developed for visualizing high throughput (aka next generation) sequencing data... Savant comes out of the Computational Biology Lab at the University of Toronto - also home of Cytoscape Web.

Note: updated in Sept. 2009 to reflect the fact that everyone and their uncle built a genome browser this past couple of years. See Brother, can you spare a genome browser?

Note: updated again in May of 2010 and again in Feb 2011 to add Savant.

Monday, December 01, 2008

UCSC Genome Browser

A while back, I wrote a little hack to to download and parse genome data from NCBI, but was flummoxed by NCBI's format for eukaryotes. A couple of local bioinformatics gurus directed me to UCSC as an alternate data source. UCSC's Genome Browser provides a nice interface to it's underlying data through a Table Browser. The main genome browser has data for eukaryotes, while archaea (and other prokaryotes) are in a separate project. The Table Browser for the archaeal genome browser is a little tricky to find, but it's there.

Digithead's Lab Notebook

Sunday, December 28, 2008

Five deep questions in Computing

Saturday, December 27, 2008

Functional Programming

What is Haskell

Why should I care?

Monday, December 22, 2008

Wordle

Sunday, December 21, 2008

Principle of least surprise my ass

Saturday, December 20, 2008

Random Introduction to Bioinformatics

Friday, December 12, 2008

Educating Software Developers

Tuesday, December 09, 2008

Firefox and bioinformatics

Bioinformatics as a Queryable Knowledge Map: the Pygr Project

Saturday, December 06, 2008

Dynamic Fusion of Web Data

Tuesday, December 02, 2008

Browsing genomes

Monday, December 01, 2008

UCSC Genome Browser

About

About Me

Blog Archive

Labels

Cheat Sheets

Featured on

Sunday, December 28, 2008

Saturday, December 27, 2008

What is Haskell

Why should I care?

Monday, December 22, 2008

Sunday, December 21, 2008

Saturday, December 20, 2008

Friday, December 12, 2008

Tuesday, December 09, 2008

Saturday, December 06, 2008

Tuesday, December 02, 2008

Monday, December 01, 2008

About

About Me

Blog Archive

Labels

Cheat Sheets

Feedz

Featured on