Wednesday, February 25, 2009

Ruby Docs

RubyRuby cheat sheet. Quick links to Ruby documentation on the web.

ruby-lang.org

Ruby-doc

List operations

Test membership in an Array

Does the array contain the element?

my_array.include?('element')

Ruby on Rails

Ruby QuickRef

Pickaxe book

21 Ruby Tricks You Should Be Using In Your Own Code

Also see Ruby quirks and What is a Ruby code block?

Rubular - regex tester

Sunday, February 22, 2009

REST (Representational State Transfer)

Not too long ago, I used to think REST (Representational state transfer) was a fancy name for sending XML over HTTP and using URLs as an API. I was peripherally aware of the theological debates surrounding REST, but I didn't pay much attention.

In SE-Radio's Episode 98: Stefan Tilkov on REST, Mr. Tilkov describes very clearly what REST is and why it's important.

REST defines a set of proven principles of software architecture for highly-scalable, loosely-coupled, distributed systems.

REST principles:

  1. Identifiers: Every resource has an identifier. (URL) One thing they don't discuss is who's responsible for resolving the identifiers.
  2. Links: One resource can link to another by its identifier. (hyperlinks)
  3. Uniform interface: Every resource implements a common set of methods. (HTTP GET, POST, PUT, DELETE) (~=CRUD)
  4. Representations: you interact with a resource via a representation; a resource can have more than one representation
  5. Statelessness: there is no client specific session state. Move session state into resource state.

Like any good religion, REST has its high priests and scriptures:

ReST vs. WS-*

The contrast is actually kinda interesting. ReST derives from an examination of what makes the web work. Since the web is arguably the most successful distributed computing system of all time, maybe it's worth while taking a look at the architectural principles that evolved along with it. In many ways, web services are the successor of CORBA and DCOM, which modeled computing as distributed objects. Many argue that distributed objects were a failure. Certainly, both CORBA & J2EE evolved away from sharing objects towards exposing facades. (For one reason, because the requirement that cooperating systems share an object model was unworkable in many situations -- application integration for example.)

The podcast pointed out something that was never clear to me before, and I think is a fairly important distinction between the architectures implied by WS and REST. In WS, you assign an identifier to an endpoint. The endpoint accepts requests. Each app has it's own API described in a WSDL (or CORBA's IDL) that specifies what requests it accepts. In contrast, any resource in a REST application has an identifier (its URL) and implements the methods of the uniform interface (GET, PUT, POST, and DELETE in HTTP). While each WS app has whatever methods it wants, REST limits the number of methods (or verbs). This allows generic clients (web browsers, curl) and promotes loose coupling. With the limitation on verbs, more prominence goes to the nouns - the addressable resources. Session state, for example, becomes resource state. If application-specific methods are needed, the four methods of HTTP can be seen as a base interface from which application-specific interfaces can inherit.

For my money, I think HTTP's methods probably could have been more clearly named. It's not obvious from their names what the difference in the semantics of PUT and POST should be. The HTTP methods almost map to the familiar CRUD (create, read, update, and delete) but not exactly. For example, PUT is an idempotent operation that inserts or replaces a resource at a particular URL. I'm not convinced the idempotency thing buys you enough to be worth obfuscating plain old CRUD.

As an aside, Mr. Tilkov mentions the MEST architectural style which seems to be an attempt to meld message-based architecture with REST. In MEST, all the meaning is in the message. The uniform interface becomes process(message).

They also discuss a few more nice aspects of REST:

  • sensible (human readable, hackable) URLs
  • standard response codes
  • collections of resources are also resources, an instance of the composite pattern. For example, you might POST an order to a collection of orders.

Stefan makes an interesting point when asked about the utility of using web protocols within a company. His claim is that organizations internally increasingly resemble the internet. Organizations need loosely coupled applications to promote business flexibility.

So, OK, I guess I'm converted to a RESTafarian. At a gut level, I was never a fan of web services. They always seemed to make things harder than they should be. Maybe, I'm not completely converted, 'cause I think some of the criticism about what is and isn't truely RESTful gets a little silly. I like that Stefan take the pragmatic stance that engineers can adopt some or all of the principles as they see fit.

More Information:

Monday, February 09, 2009

Ruby Love

I've dissed Ruby here, here and here. But, I really like Ruby, so it's time to show Ruby some love. Here's some little snippets that I came up with that I think are slicker than shit.

You could hardly ask for a more concise yet readable way for taking the reverse compliment of a piece of DNA sequence.

# given a DNA sequence in a string return its reverse complement
def reverse_complement(seq)
  return seq.reverse().tr!('ATCGatcg','TAGCtagc')
end

Next, I wanted to parse out a pair of coordinates of the form "1600481-1600540". The first number is the start position and the second is the end. The interval is inclusive and one-based. I want the coordinates converted to zero-based integers.

s,e = coords.split("-").collect {|x| x.to_i - 1}

On second thought, why doesn't Ruby's Range class have a from_string method? Oops, another quirk.

Ruby Quirks

Ruby is a fun language, but here are a few things that tripped up this n00b whilst stumbling along the learning curve.

Why does string[n] return an integer character code instead of the character? Well, Ruby has no such thing as a character (prior to 2.0, anyway?), so then why not a string of length 1?

>> a = "asdf"
=> "asdf"
>> a[0]
=> 97

Either of these works, although they look a little funny:

>> a[0...1]
=> "a"
>> a[0..0]
=> "a"

Also, a[0].chr works. The integer.chr method is described as follows:

int.chr => string
Returns a string containing the ASCII character represented by the receiver‘s value.

It's non-obvious how to iterate through the characters of a string. The string.each_char is listed in the Ruby core 1.8.6 ruby-docs, but, confusingly, you have to require "jcode" for it to work. Maybe I'm just confused about whether core means "loaded by default" or "included in the Ruby distribution".

Two toString methods?In place of object.toString() Ruby has two methods: to_s and inspect. When coercion to a string is required, to_s is called. Docs for inspect say this:

Returns a string containing a human-readable representation of obj. If not overridden, uses the to_s method to generate the string.

If, then, else, elsif

If statements are confusing...

if x==123 then
   puts 'wazoo'
end

# then is optional, as long as you have the line break
if x==123
   puts 'wazoo'
end

For one liners, then is required. Or colon, if you prefer.

if x==123 then puts 'wazoo' end
if x==123 : puts 'wazoo' end

Curly braces seem not to work at all for if statements. For more curly brace related philosophy and WTFs see this issue. DON'T DO THIS:

if x==123 { puts 'qwer' }

Finally, would someone tell all these languages that crawl out of the primordial Bourne-shell ooze that neither elif nor elsif means jack shit in the english language?!?!?! (sputter... rant... fume...)

if x==123
  puts 'wazoo'
elsif x==456
  puts 'flapdoodle'
else
  puts 'foo'
end

An if statement is an expression and returns a value, but ruby also offers the good old ternary operator.

Require vs. Load

There are two ways to import code in Ruby, require and load. See the ruby docs for Kernel (require and load).

Defined? and nil?

Ruby has nil instead of null. Ok, and unlike Java's null, nil is a real object. I appreciate the difference between nil and undefined, but I wouldn't have guessed that defined? nil would return "nil". Not to be confused with a truly undefined variable, in which case defined? asdf returns nil. The pickaxe book explains the other strange return values of defined?. Then, there's nil?.

>> asdf.nil?
NameError: undefined local variable or method `asdf' for main:Object
 from (irb):47
 from :0
>> asdf = nil
=> nil
>> asdf.nil?
=> true

Command line arguments

Just the array ARGV. Not a quirk. Good.

Return

Return behaves oddly; to exit a script you use Kernel::exit(integer). Trying to return 1 instead causes a LocalJumpError whatever that means?? Trying to return from a code block returns from the surrounding context. That hurts my head.

Ruby Exception Handling

Ruby's equivalent of try-catch-finally is begin-rescue-ensure-end.

No Boolean

There's no Boolean class in Ruby. Instead, there's TrueClass and FalseClass. So, what type of value does the logical expression p ^ q produce? Everything has an implicit boolean value, which is true for everything except false and nil.

List operations

(see also Array and Enumerable)

More

Monday, February 02, 2009

Spelunking in the UCSC Genome Browser

The UCSC genome browser is an established workhorse of bioinformatics led by Jim Kent and David Haussler. The software is an open source C and MySQL web-app that generates gifs to display genome annotation in tracks plotted against genomic coordinates. The main UCSC genome browser covers eukaryotic model organisms, while its twin, The UCSC Archaeal Genome Browser covers archaea and, despite its name, several bacteria as well.

Aside from the visualization, the genome browser is also a great source for curated genomic data, which is made available through a convenient mechanism called the Table Browser. Much of this data takes the form of a tuple containing genomic coordinates and some other data, which together are traditionally known as a feature.

(sequence, strand, start, end, ...)

Examples of features might be a gene, a PFAM domain, or a measurement from a microarray or mass spectrometry.

I happen to have a motive for wanting to grab some of this data and use it for my own nefarious purposes. So, here's a little documentation on how to go about doing that.

They allow public access to their MySQL database, so let's explore that first. Each assembly gets its own database, for example hg18 for the March 2006 assembly of the human genome. Application scope data is in a database called hgcentral, which includes a table dbDb, a datab ase of databases. Most of this information is also in the FAQ entry on releases, but since we can get to the tables, we may as well have a look.

mysql> use hgcentral;
mysql> describe dbDb;
+----------------+--------------+------+-----+---------+-------+
| Field          | Type         | Null | Key | Default | Extra |
+----------------+--------------+------+-----+---------+-------+
| name           | varchar(32)  | NO   |     |         |       | 
| description    | varchar(255) | NO   |     |         |       | 
| nibPath        | varchar(255) | NO   |     |         |       | 
| organism       | varchar(255) | NO   |     |         |       | 
| defaultPos     | varchar(255) | NO   |     |         |       | 
| active         | int(1)       | NO   |     | 0       |       | 
| orderKey       | int(11)      | NO   |     | 1000000 |       | 
| genome         | varchar(255) | NO   |     |         |       | 
| scientificName | varchar(255) | NO   |     |         |       | 
| htmlPath       | varchar(255) | NO   |     |         |       | 
| hgNearOk       | tinyint(4)   | NO   |     | 0       |       | 
| hgPbOk         | tinyint(4)   | NO   |     | 0       |       | 
| sourceName     | varchar(255) | NO   |     |         |       | 
| taxId          | int(11)      | NO   |     | 0       |       | 
+----------------+--------------+------+-----+---------+-------+

The clade table partitions the organisms into clades, so a categorized list of currently active genomes can be generated like this:

     select
       d.name, d.description, d.genome, d.scientificName,
       d.taxId, gc.clade
     from dbDb as d join genomeClade as gc
       on d.genome=gc.genome
     where d.active > 0
     order by clade, scientificName;

There's also a table called dbDbArch, which I was hoping stood for archaea. Sadly, no luck. It looks to mean archived, instead. I wasn't able to find an open MySQL DB for the archaeal genome browser (hints, anyone?), but they list the available organisms right on the home page, so no worries. Scraping that and applying some regex's will get you at least a table of organisms and database names which is all you really need.

With that, knowing an organism, we can figure out which database to ransack. The first thing we'll want to know is the configuration of the organism's genome. How many chromosomes (or plasmids, etc.) are there and what are their sizes? An HTTP request for the chromInfo table will do that for us. Both GET and POST seem to work.

http://genome.ucsc.edu/cgi-bin/hgTables?db=hg18&hgta_group=allTables&hgta_track=hg18&hgta_table=chromInfo&hgta_regionType=genome&hgta_outputType=primaryTable&hgta_doTopSubmit=

Try it. Now, let's break that down.

http://genome.ucsc.edu/cgi-bin/hgTables?
db=hg18
hgta_group=allTables
hgta_track=hg18
hgta_table=chromInfo
hgta_regionType=genome
hgta_outputType=primaryTable
hgta_doTopSubmit=

I'm not sure whether all fields shown here are necessary, but this seems to do the trick. Well, one issue is that along with the expected chromosomes 1 through 22, plus X, Y and M for mitochondrial, we get chr1_random, chr6_cox_hap1 and a bunch of weird things like that. What are these things?

The equivalent for prokaryotes looks like this:

http://archaea.ucsc.edu/cgi-bin/hgTables?
db=eschColi_K12
hgta_group=allTables
hgta_track=eschColi_K12
hgta_table=chromInfo
hgta_regionType=genome
hgta_outputType=primaryTable
hgta_doTopSubmit=

Now, let's get us some features, in this case refseq genes:

http://genome.ucsc.edu/cgi-bin/hgTables?
db=hg18
hgta_group=genes
hgta_track=refGene
hgta_table=refGene
hgta_regionType=genome
hgta_outputType=primaryTable
hgta_doTopSubmit=

Oddly, on the archaeal side, we need to ask for refSeq rather than refGene. Try that, too.

http://archaea.ucsc.edu/cgi-bin/hgTables?
db=eschColi_K12
hgta_group=genes
hgta_track=refSeq
hgta_table=refSeq
hgta_regionType=genome
hgta_outputType=primaryTable
hgta_doTopSubmit=

Now we can get chromosome information and gene locations for our choice of organism. Of course, we've only scratched the surface of the tracks available, as a click here or here will show. The FAQ entry on linking also has a few hints.

With all this nicely curated data available so easily over HTTP and even straight from the database, it's begging to be mashed up, recombined and reintegrated. The Table Browser is a great idea. It's just a simple database dump to tab-delimited text, but it's so much easier to work with than, say for example, SOAP/WS-*. Thanks, UCSC, for making such a useful resource available!