Digithead's Lab Notebook: 06/01/2011

Saturday, June 25, 2011

The future of money

Back during the first dot-com bubble, PayPal got started with revolutionary intentions. One of the founders, Peter Thiel, recently herd urging college students to log in, drop out, and start up, was even more radical back then.

In his book The PayPal Wars, early PayPal marketing guru Eric M. Jackson recounts a stirring speech Thiel gave to the company's early staff.
"PayPal will give citizens worldwide more direct control over their currencies than they ever had before," Thiel said. "It will be nearly impossible for corrupt governments to steal wealth from their people through their old means because if they try the people will switch to dollars or pounds or yen, in effect dumping the worthless local currency for something more secure."
Unfortunately, that vision never panned out. PayPal thrived when it came to innovating and adapting to stay a step ahead of its early competitors. But the company proved less adept at slaying its more formidable antagonists: Lawyers and politicians.

Jack Dorsey's Square, a web 2.0 and mobile compliant version of PayPal, talks disruption but has already accepted money from VISA. It's a shrewd business model: build anything reasonably credible in the payments space and the odds of being bought out by one of the incumbents are about 1000 percent. Dwolla is a digital and mobile payments startup from Iowa, who's early funding came from a credit union.

Startups aren't the only ones who want a piece of VISA's 37% profit margin on 8.6 billion in revenue. Big tech companies, like Apple and Google are getting into the payments game, too. Google just rolled out Google wallet, stirring privacy concerns and getting sued by PayPal.
These days, the wild-eyed radicals look past these tame corporate offerings to Bitcoin, a peer to peer digital currency created in 2009 by the mysterious Satoshi Nakamoto. Bitcoin relies on public key cryptography and digital signatures to guarantee payment and receipt. Satoshi's key insight was the means of validating transactions. Rather than clearing through a central authority, the system validates transactions through a distributed proof-of-work system, relying on the majority of honest nodes on the peer to peer network to solve cryptographic puzzles faster than any attacker.

The idea of an unregulated decentralized currency appeals to some, but don't expect the government to like it. The potential for black markets and money laundering has already drawn scrutiny and calls for a crack down. They're undoubtedly wondering how to tax it.
Possibly a bigger threat, Bitcoin has also attracted the attention of thieves. Last week, Mt. Gox, the currency's largest exchange, was hacked. The system relies critically on the security of end-user machines, a shaky proposition. One Bitcoin user, aptly named, allinvain, reported $500,000 worth stolen. A Bitcoin harvesting trojan has already been spotted in the wild.
Whether Bitcoin can overcome these problems or not, it's sure to be a wild ride. Bitcoin's technical underpinnings are fascinating and an impressive ecosystem has quickly sprung up it. There are dealers, exchanges, an escrow service, charities and a place to keep your treasure horde online.
I'm curious to see how much of the existing financial system gets ported to Bitcoin. Is fractional reserve banking in Bitcoin possible? Or how about securities denominated in Bitcoin? If you're a sceptic, can you sell short? One thing I love about Bitcoin is the mixture of engineering and economics, and even more, the engineering of economics. Of course, this comes with all the caveats and warnings of version 0.1.
The future of money is here. Are we read for it?

PS

I'm ready! Support this blog. Tips accepted here: 15Y9pepdBG9GJxyCc6HgsQS39BvsBUqi1W

Friday, June 24, 2011

Drawing heatmaps in R

A while back, while reading chapter 4 of Using R for Introductory Statistics, I fooled around with the mtcars dataset giving mechanical and performance properties of cars from the early 70's. Let's plot this data as a hierarchically clustered heatmap.

# scale data to mean=0, sd=1 and convert to matrix
mtscaled <- as.matrix(scale(mtcars))

# create heatmap and don't reorder columns
heatmap(mtscaled, Colv=F, scale='none')

By default, heatmap clusters by both rows and columns. It then reorders the resulting dendrograms according to mean. Setting Colv to false tells it not to reorder the columns, which will come in handy later. Let's also turn off the default scaling across rows. We've already scaled across columns, which is the sensible thing to do in this case.

If our columns are already in some special order, say as a time-series or by increasing dosage, we might want to cluster only rows. We could do that by setting the Colv argument to NA. One thing that clustering the columns tells us in this case is that some information is highly correlated, bordering on redundant. For example, displacement, horsepower and number of cylinders are quit similar. And the idea that to get more power (hp) and go faster (qsec) we need to burn more gas (mpg) is pretty well supported.

Separating clusters

If we'd like to separate out the clusters, I'm not sure of the best approach. One way is to use hclust and cutree, which allows you to specify k, the number of clusters you want. Don't forget that hclust requires a distance matrix as input.

# cluster rows
hc.rows <- hclust(dist(mtscaled))
plot(hc.rows)

# transpose the matrix and cluster columns
hc.cols <- hclust(dist(t(mtscaled)))

# draw heatmap for first cluster
heatmap(mtscaled[cutree(hc.rows,k=2)==1,], Colv=as.dendrogram(hc.cols), scale='none')

# draw heatmap for second cluster
heatmap(mtscaled[cutree(hc.rows,k=2)==2,], Colv=as.dendrogram(hc.cols), scale='none')

That works, but, I'd probably advise creating one heatmap and cutting it up in Illustrator, if need be. I have a nagging feeling that the color scale will end up being slightly different between the two clusters, since the range of values in each submatrix is different. Speaking of colors, if you don't like the default heat colors, try creating a new palette with color ramp.

palette <- colorRampPalette(c('#f0f3ff','#0033BB'))(256)
heatmap(mtscaled, Colv=F, scale='none', col=palette)

Confusing things

Another way to separate the clusters is to get the dendrograms out of heatmap and work with those. But Cutree applies to objects of class hclust, returned by hclust, and returns a map assigning each row in the original data to a cluster. Cutree takes either a height to cut at (h) or the desired number of clusters (k), which is nice.

Cut applies to dendrograms, which can be returned by heatmap if the keep.dendro option is set. Cut takes only h, not k, and returns a list with members upper and lower. Lower is a list of subtrees below the cut point.

Doing graphics with R starts easy, but gets arcane quickly. There's also a heatmap.2 function in the gplot package that adds color keys among other sparsely documented features.

This all needs some serious straightening out, but the basics are easy enough. Here are a couple more resources to make your heatmaps extra-hot:

Using R to draw a Heatmap from Microarray Data
Flowing Data's tutorial How to Make a Heatmap

...more on R.

Tuesday, June 07, 2011

Ten Design Lessons

Respect “the genius of a place.”
Subordinate details to the whole.
The art is to conceal art.
Aim for the unconscious.
Avoid fashion for fashion’s sake.
Formal training isn’t required.
Words matter.
Stand for something.
Utility trumps ornament.
Never too much, hardly enough.

from Frederick Law Olmsted, the father of American landscape architecture via This isn't happiness.

Monday, June 06, 2011

Primers in Computational Biology

Nature Biotechnology used to regularly feature primers on various topics in computational biology. Here's an incomplete listing based on what looked interesting to me. Some of these are old, but on topics that are fundamental enough not to go out of style. Lot's of these are just mini-tutorials in machine learning.

What is dynamic programming? Sean R Eddy
What is Bayesian statistics? Sean R Eddy
What is a hidden Markov model? Sean R Eddy
How does gene expression clustering work? Patrik D'haeseleer
Inference in Bayesian networks Chris J Needham, James R Bradford, Andrew J Bulpitt & David R Westhead
What are DNA sequence motifs? Patrik D'haeseleer
How does DNA sequence motif discovery work? Patrik D'haeseleer
What is a support vector machine? William S Noble
How do shotgun proteomics algorithms identify proteins? Edward M Marcotte
What are artificial neural networks? Anders Krogh
What is principal component analysis? Markus Ringnér
What is the expectation maximization algorithm? Chuong B Do & Serafim Batzoglou
What are decision trees? Carl Kingsford & Steven L Salzberg
Understanding genome browsing Melissa S Cline & W James Kent
How to map billions of short reads onto genomes Cole Trapnell & Steven L Salzberg
How to visually interpret biological data using networks Daniele Merico, David Gfeller & Gary D Bader
How does multiple testing correction work? William S Noble
What is flux balance analysis? Jeffrey D Orth, Ines Thiele & Bernhard Ø Palsson
Analyzing 'omics data using hierarchical models Hongkai Ji & X Shirley Liu

...just in case you're in need of some bed-time reading or some mad comp-bio skillz. Sorry if some of these are behind a pay-wall, but there's usually a way around, under or over such walls.

Saturday, June 04, 2011

Environments in R

One interesting thing about R is that you can get down into the insides fairly easily. You're allowed to see more of how things are put together than in most languages. One of the ways R does this is by having first-class environments.

At first glance, environments are simple enough. An environment is just a place to store variables - a set of bindings between symbols and objects. If you start up R and make an assignment, you're adding an entry in the global environment.

> a <- 1234
> e <- globalenv()
> ls()
[1] "a" "e"
> ls(e)
[1] "a" "e"
> e$a
[1] 1234
> class(e)
[1] "environment"

Hmmm, the variable e is part of the global environment and it refers to the global environment, too, which is kind-of circular.

> ls(e$e$e$e$e$e$e$e)
[1] "a" "e"

We'd better cut that out, before we're sucked into a cosmic vortex.

> rm(e)

Most functional languages have some concept of environments, which serves as a higher level of abstraction over implementation details like allocating variables on the heap or stack. Saying that environments are first-class means that you can manipulate them from within the language, which is less common. Several advanced language features of R are built out of environments. We'll look at functions, packages and namespaces, and point out several Scheme-like features in R.

But first, the basics. The R Language Definition gives this definition:

Environments can be thought of as consisting of two things: a frame, which is a set of symbol-value pairs, and an enclosure, a pointer to an enclosing environment. When R looks up the value for a symbol the frame is examined and if a matching symbol is found its value will be returned. If not, the enclosing environment is then accessed and the process repeated. Environments form a tree structure in which the enclosures play the role of parents. The tree of environments is rooted in an empty environment, available through emptyenv(), which has no parent.

You can make a new environment with new.env() and assign a couple variables. The assign function works, as does the odd but convenient dollar sign notation. Think of the dollar sign as equivalent to the 'dot' operator that dereferences object members in Java-ish languages.

> my.env <- new.env()
> my.env
<environment: 0x114a9d940>
> ls(my.env)
character(0)
> assign("a", 999, envir=my.env)
> my.env$foo = "This is the variable foo."
> ls(my.env)
[1] "a"   "foo"

Now we have two variables named a, one in the global environment, the other in our new environment. Let's stick another variable b in the global environment, just for kicks.

> a
[1] 1234
> my.env$a
[1] 999
> b <- 4567

Also, note that the parent environment of my.env is the global environment.

> parent.env(my.env)
<environment: R_GlobalEnv>

A variable can be accessed using get or the dollar operator. By default, get continues up the chain of parents until it either finds a binding or reaches the empty environment. The dollar operator looks specifically in the given environment.

> get('a', envir=my.env)
[1] 999
> get('b', envir=my.env)
[1] 4567
> my.env$a
[1] 999
> my.env$b
NULL

Functions and environments

Functions have their own environments. This is the key to implementing closures. If you've never heard of a closure, it's just a function packaged up with some state. In fact, some say, closures are a poor man's object, while other insist it's the other way 'round. The R Language Definition explains the relationship between functions and environments like this:

Functions (or more precisely, function closures) have three basic components: a formal argument list, a body and an environment. [...] A function's environment is the environment that was active at the time that the function was created. [...] When a function is called, a new environment (called the evaluation environment) is created, whose enclosure is the environment from the function closure. This new environment is initially populated with the unevaluated arguments to the function; as evaluation proceeds, local variables are created within it.

When a function is evaluated, R looks in a series of environments for any variables in scope. The evaluation environment is first, then the function's enclosing environment, which will be the global environment for functions defined in the workspace. So, the global variable a, which had the value 1234 last time we looked, can be referenced inside a function.

> f <- function(x) { x + a }
> environment(f)
<environment: R_GlobalEnv>
> f(4321)
[1] 5555

We can change a function's environment if we want to.

> environment(f) <- my.env
> environment(f)
<environment: 0x114a9d940>
> my.env$a
[1] 999
> f(1)
[1] 1000

Suppose we wanted a counter to keep track of progress of some kind. That could be written and applied like so:

> createCounter <- function(value) { function(i) { value <<- value+i} }
> counter <- createCounter(0)
> counter(1)
> a <- counter(0)
> a
[1] 1
> counter(1)
> counter(1)
> a <- counter(1)
> a
[1] 4
> a <- counter(5)
> a
[1] 9

Notice the special <<- assignment operator. If we had used the normal <- assignment operator, we would have created a new variable 'value' in the evaluation environment of the function masking the value in the function closure environment. That environment disappears as soon as the function returns, sending our new value into the ether. What we want to do is change the value in the function closure environment, so that assignments to value will be persistent across invocations of our counter. Mutable state is generally not the default in functional languages, so we have to use the special assignment operator.

Just to look under the covers, where is that mutable state? In the counter function's enclosing environment.

> ls(environment(counter))
[1] "value"
> environment(counter)$value
[1] 9

For those that geek out on this stuff, this is an implementation of Paul Graham's Accumulator Generator from his article Revenge of the Nerds, which, years ago, I struggled to implement in Java.

Inspired by Scheme, lexical scoping is R's major point of departure from the S language. Gentleman and Ihaka's papers R: A Language for Data Analysis and Graphics (pdf) and Lexical Scope and Statistical Computing (pdf) describe some of their language design decisions around this point.

For functions defined in a package, the situation gets a bit more interesting. The various parts of the plot function are visible below, including a parameter list, (x, y, and some other junk), a block of code, elided here, and an environment, which is the namespace for the graphics package. Packages and namespaces are our next topic.

> plot
function (x, y, ...) 
{
  ...blah, blah, blah...
}
<environment: namespace:graphics>

Packages and namespaces

Walking up the chain of environments starting with the global environment, we see the packages loaded into R.

> globalenv()
<environment: R_GlobalEnv>
> g <- globalenv()
> while (environmentName(g) != 'R_EmptyEnv') { g <- parent.env(g); cat(str(g, give.attr=F)) }
<environment: 0x100fdf078>
<environment: package:stats>
<environment: package:graphics>
<environment: package:grDevices>
<environment: package:utils>
<environment: package:datasets>
<environment: package:methods>
<environment: 0x101a19f58>
<environment: base>
<environment: R_EmptyEnv>

Oddly, you can't test environments for equality. If you try, it says, "comparison (1) is possible only for atomic and list types". That's why we test for the end of the chain by name.

This same information can be had in slightly nicer form using search.

> search()
 [1] ".GlobalEnv"        "tools:RGUI"        "package:stats"     "package:graphics" 
 [5] "package:grDevices" "package:utils"     "package:datasets"  "package:methods"  
 [9] "Autoloads"         "package:base"

By now, you can guess how attach works. It creates an environment and slots it into the list right after the global environment, then populates it with the objects we're attaching.

beatles <- list('george'='guitar','ringo'='drums','paul'='bass guitar','john'='guitar')
> attach(beatles)
> search()
 [1] ".GlobalEnv"        "beatles"           "tools:RGUI"        "package:stats"    
 [5] "package:graphics"  "package:grDevices" "package:utils"     "package:datasets" 
 [9] "package:methods"   "Autoloads"         "package:base"     
> john
[1] "guitar"
> paul
[1] "bass guitar"
> george
[1] "guitar"
> ringo
[1] "drums"

Attaching a package using library adds an entry to the chain of environments. A package can optionally have another environment, a namespace, whose purpose is to prevent naming clashes between packages and hide internal implementation details. R Internals explains it like this:

A package pkg with a name space defines two environments namespace:pkg and package:pkg. It is package:pkg that can be attached and form part of the search path.

When a namespaced package is loaded, a new environment is created and all exported items are copied into it. That's package:pkg in the example above and is what you see in the search path. The namespace becomes the environment for the functions in that package. The parent environment of the namespace holds all the imports declared by the package. And the parent of that is a special copy of the base environment whose parent is the global environment.

We can see what namespaces are loaded using loadedNamespaces.

> loadedNamespaces()
[1] "base"      "graphics"  "grDevices" "methods"   "stats"     "tools"    
[7] "utils"

What if the same name is used in multiple environments? In general, R walks up the chain of environments and uses the first binding for a symbol it finds. R is smart enough to distinguish functions from other types. Here we try to mask the mean function, but R can still find it, knowing that we're trying to apply a function.

> z = list(mean='fluffernutter')
> attach(z)
> mean
[1] "fluffernutter"
> mean
[1] "fluffernutter"
> mean(c(1,2,3,4))
[1] 2.5
> detach(z)

We can mask a function with another function. Now, the mean of any list of numbers is "flapdoodle".

> z = list(mean=function(x){ return("flapdoodle") })
> attach(z)
The following object(s) are masked from 'package:base':
    mean
> mean(c(4,5,6,7))
[1] "flapdoodle"

The double-colon operator will let us specify which mean function we want. And, if you like to break the rules, the triple-colon operator lets you reach inside namespaces and touch private non-exported elements.

> base::mean(c(6,7,8,9))
[1] 7.5

So, there you have two fairly advanced language features built on the simple abstraction of environments. Thrown in for free is a nice look at R's functional side.

Is that everything you wanted to know about environments but were afraid to ask? Be warned that I'm just figuring this stuff out myself. If I've gotten anything bass-ackwards, please let me know. There's more information below, in case you can't get enough.

More Information

R Language Definition Environments
Environment Access functions
R Internals on Environments and variable lookup
Scope
Evaluation of functions
Name Space Internals
Loading and Unloading Name Spaces
Double Colon and Triple Colon Operators
Package name spaces from Writing R Extensions
A Simple Implementation of Name Spaces for R, Luke Tierney, 2003
R environments and formula objects teaches you about "reaching back up the call stack with your zombie programmer hand to eat the brains of the code that called you". Who could resist that?
For an education on functional programming and closures, see SICP.

Digithead's Lab Notebook

Saturday, June 25, 2011

The future of money

PS

Friday, June 24, 2011

Drawing heatmaps in R

Separating clusters

Confusing things

Tuesday, June 07, 2011

Ten Design Lessons

Monday, June 06, 2011

Primers in Computational Biology

Saturday, June 04, 2011

Environments in R

Functions and environments

Packages and namespaces

More Information

About

About Me

Blog Archive

Labels

Cheat Sheets

Featured on

Digithead's Lab Notebook

Saturday, June 25, 2011

The future of money

PS

Friday, June 24, 2011

Drawing heatmaps in R

Separating clusters

Confusing things

Tuesday, June 07, 2011

Ten Design Lessons

Monday, June 06, 2011

Primers in Computational Biology

Saturday, June 04, 2011

Environments in R

Functions and environments

Packages and namespaces

More Information

About

About Me

Blog Archive

Labels

Cheat Sheets

Feedz

Featured on