Tuesday, May 27, 2008

General purpose programming on the GPU

The video-card manufacturers are at an interesting crossroads. nVidia is really pushing general purpose computing on graphics hardware.

Cool idea, but it can't really take off unless programs are written to detect the presence of capable hardware and load code specially compiled for that particular chipset. In other words, GPGPU code written for nVidia cards won't run on ATI cards. Bummer. Clearly nVidia understands the advantage to be had from a large body of code compiled for their instruction set. They want to be in the position of owning the GPU equivalent of the x86 instruction set. Nice move on their part.

But wouldn't it be much cooler for the rest of us if things were a little more open? Is a common instruction set for graphics cards at all plausable? Like ARM is for embedded devices? Or maybe a common intermediate layer and JIT for all types of GPU code? That would be cool. And something like that would induce a lot more developers to take the plunge into compiling to the GPU, which has got to be a fairly drastically different piece of code relative to a straightforward implementation.

Tuesday, May 20, 2008

What is a Ruby code block?

You can't believe everything you read on the internet. For example, you'll read that, in Ruby, use of the return keyword is optional. Like this:

def foo(a)
  return a*a


def bar(a)


So far, so good. You'll also read that a code block in Ruby is an anonymous function. And what could be more natural than to return from a function? But try this:

[1,2,3].map {|a| a*a}
>> [1,4,9]

[4,5,6].map {|a| return a*a}
>> LocalJumpError: unexpected return
>> from (irb):14
>> from (irb):14:in `map'
>> from (irb):14
>> from :0

Well then, apparently one or both of our premises is wrong. Anyway, what is this LocalJump thing that is apparently in error?

My guess as to what's really going on is this: The code block is an attempt at creating a function-like artifact that matches the behavior of a code block in a non-functional language. That is, the code in the block doesn't execute in its own environment. Calling a code block doesn't cause a stack frame to be created. The reason the return causes problems is that there's no separate environment to return from.

It bothers me that I can't seem to google up an explanation of exactly what a Ruby code block (or Proc or lambda or method for that matter) is in terms of what does or doesn't happen to stack frames or environments. There's plenty of tutorials on how to use these constructs, but, little information on what they are, how they work, or why there are so many flavors of function-like things in Ruby.

Here's what I got from comp.lang.ruby.

Wednesday, May 14, 2008

Working with microarray data

State-of-the-art DNA microarrays contain between 1 million and 6 million features (different probes) on a single slide. Assuming a 32 bit floats, we need at 4 bytes per feature. If we include start and stop coordinates on the genome for each feature, we're up to 12 bytes per feature.

featuresMBMB w/ coords
1 million424
6.5 million2475

In tiling arrays, probes target regularly spaced segments of the genome so that expression can be measured at every point along the genome. Our group has done some work with arrays containing 60bp probes tiled every 20 base-pairs along the genome of Halobacterium salinarium. The genome of H. salinarum is 2.6 million bases, so we are able to cover most of the genome with a little less than a quarter of a million probes. To cover the entire human genome at similar resolution would take ~307 million probes.

As a side note, AFFY sells a set of 14 arrays with a total of almost 90 million probes that covers the whole genome at 35 bp resolution. (90 million * 35bp = 3150 mpbs. Does their idea of a probe include both forward and reverse strand?)

These arrays are all custom made for an individual genome. I wonder if an array with all 4 million possible 11-mers would be useful? Since there's only 4 million of them, you could expect a fair number of collisions - where a probe is non-unique on the genome. How do you do that calculation?

64-bit Java

As far as I can tell, the capability to handle much bigger heap sizes is the main benefit from 64-bit Java. Primitive types don't change in size. An int is still 32 bits.

Java arrays are indexed by 32-bit integers (actually, signed integers, so 31 bits). Apparently this doesn't change in 64-bit Java. From the Java Language Spec:

Arrays must be indexed by int values; short, byte, or char values may also be used as index values because they are subjected to unary numeric promotion (§5.6.1) and become int values. An attempt to access an array component with a long index value results in a compile-time error.

This is a serious limitation, if you ask me. The main advantage of 64-bits is a simple programming model for large data objects. Apparently, Sun disagrees. Integer array indexes are baked fairly deeply into the language and into existing application code. Think of everywhere you refer to the length or an offset into an array as an int. Imagine how often that's done in the library code. Dunno if they'll ever change it.

Other side-effects of 64-bit

  • Increased object size
    • 64 bit pointers
    • alignment to 8-byte boundaries
  • The minimally required heap size is 51.1% larger
  • More cache misses

source: 64-bit versus 32-bit Virtual Machines for Java

Finally, here's an intriguing tidbit: sun.misc.Unsafe