Wednesday, August 26, 2009

Using R and Bioconductor for sequence analysis

Here's another quick R vignette, in case I pick this up later and need to remind myself where I got stuck. I was trying to use R for a bit of basic sequence analysis, with mixed results.

First, install the BSgenome package, which is part of Bioconductor. Get GeneR while you're at it.

> source("http://bioconductor.org/biocLite.R")
> biocLite("BSgenome")
> biocLite("GeneR")

Follow the instructions in the document How to forge a BSgenome data package. You'll need to get fasta files from somewhere such as NCBI's Entrez Genome. Another nice data source is Regulatory Sequence Analysis Tools.

I created a BSgenome package for our favorite model organism Halobacterium salinarum NRC-1, which I named halo for short. Now, I can ask what sequences make up the halo genome and find out how long they are.

> library(BSgenome.halo.NCBI.1)
> seqnames(halo)
[1] "chr"     "pNRC200" "pNRC100"
> seqlengths(halo)
    chr pNRC200 pNRC100 
2014239  365425  191346
> length(halo$chr)
[1] 2014239

There are a few things I wanted to do next. First, I wanted to load a list of genes with their coordinates. That should allow me to quickly get the sequence for each gene, or get sequence of upstream regions for regulatory motif finding. Second, if I'm going to find any new protein coding regions, I'd like to have a function that could take a stretch of DNA and find ORFs (open reading frames). As far as I can tell, all there is to ORF finding is searching each reading frame for long stretches that start with a methionine (AUG) and end with a stop codon (UAG, UGA, and UAA ). Maybe there's more to it than that.

This is where I left off. GeneR seems to use an entirely different way of encoding sequence based on buffers. I have to admit to being a little disappointed. I hope it's just my cluelessness and there's really a reasonable way to do this kind of thing in R and Bioconductor.

Related stuff from Blue Collar Bioinformatics

Monday, August 10, 2009

Autocompletion and Swing

I can remember being asked to implement cross-browser autocompletion in 2000 and telling my employers that it couldn't be done. We got a prototype working on Netscape (remember when that was the most advanced browser?) but it was buggy on Internet Explorer (remember when IE was the bane of every web developer's existence? Wait, some things never change...) and latency was way too high for most users. Anyway, I didn't last long in that gig, but I still think that for practical purposes at the time I was right. Of course, things are different now.

Type something into Google or Amazon's search box and you'll get a nice drop-down list of possible completions. For a biological example, check out NCBI's BLAST. Make sure the database chooser reads "Nucleotide collection" and start typing "Pyrococcus". Nice, huh?

Several ways to give poor abandoned Swing an autocompleting upgrade are documented in a java.net article. Sadly, they all seem to suffer from one deficiency or another.

The JIDE common layer an open source library that spun out of Jide's commercial offerings seems to be the most stable, but isn't nearly as convenient as the javascript versions. GlazedLists does a nice job, but it's currently (still) broken on OS X. Out of the solutions I found, GlazedLists seems most promising, especially if that bug gets fixed.

I also checkout out the Substance Java look & feel. It looks really sharp. The developer has done some really slick transitions -- highlights that fade in and out or components that expand like icons on OS X's dock. It's probably great on Windows, but unfortunately, it didn't seem very stable on the Mac.

There should be some good lessons to be learned from the failure of Swing. It seems apparent that several developers who are a lot smarter than me have tried to get Swing to cough up a decent UI. The results seem to be consistently limited. Not that some aren't impressive; they are. But the limits to the success of some very good developers speaks very loudly.

More attempts...

  • AutoCompleteCombo by Exterminator13 (haha) -
    Exception in thread "AWT-EventQueue-0" java.lang.ArrayIndexOutOfBoundsException: -1
     at java.util.ArrayList.get(ArrayList.java:323)
     at dzone.AutoCompleteCombo$Model.getElementAt(AutoCompleteCombo.java:476)
  • An autocomplete popup by Pierre Le Lannic, which seems to work as long as you don't care about upper case.
  • Java2sAutoTextField from Sun

Friday, August 07, 2009

Learning biological networks

In a paper titled Learning biological networks: from modules to dynamics, Richard Bonneau explains why network inference is tractable in biological systems, in spite of the combinatorial nature of the problem.

  • Biological networks are neither random nor designed by a known process, and therefore have yet-to-be-determined design principles. Nature does provide several clues, however, via considerations of evolution.
  • Biological systems are inherently modular [...] and taking advantage of modularity is key to success in learning biological networks from data.
  • Biological systems are robust and often have reproducible responses to their environment that enable replicate measurement.
  • There is a lot known about the likely layout of biological networks. Several network motifs are found to be over-represented in the best characterized regulatory networks. We also know that regulatory networks are likely to be sparse (for example, most transcription factors don’t regulate most genes).
  • Time-lagged correlation metrics can be used to discover regulatory relationships from microarray data.

Milo, R. et al. Network Motifs: Simple Building Blocks of Complex Networks. Science 298, 824–827 (2002). (from Uri Alon's group)

Flaherty, P., Jordan, M.I. & Arkin, A. Robust design of biological experiments. Proc. Neural Inf. Process. Symp. 18, 363–370 (2005).

Fisher, R.A. Statistical Methods, Experimental Design and Scientific Inference (Oxford University Press, Oxford, 1935).