Tuesday, July 31, 2012

Surfacing good comments

Comments are supposed to make digital media more engaging and interactive. Somewhere in the crowd of readers and viewers are ideas, insight and thoughtful criticism. However, the comments found on popular internet sites like YouTube or news sites generally inspire a loss of faith in humanity. From the comfort and pseudo-anonymity of thousands of living rooms comes a stream of abuse, wingnuttery and outright stupidity that overwhelms etiquette and common sense.

Clay Shirky thinks Gawker is on to something with its attempts to surface quality comments. Gawker redesigned their comments section to serve the people reading the comments, rather than the people writing them, moving most comments off the main page of an article and enabling enhanced curation.

Manual curation is labor intensive. I wonder whether a machine learning approach might be able to do a reasonable job of identifying good comments, or at least weeding out most of the inane ones. It's not really much different than spam filtering. That might make a fun little project.

Thursday, July 26, 2012

Linear regression by gradient descent

In Andrew Ng's Machine Learning class, the first section demonstrates gradient descent by using it on a familiar problem, that of fitting a linear function to data.

Let's start off, by generating some bogus data with known characteristics. Let's make y just a noisy version of x. Let's also add 3 to give the intercept term something to do.

# generate random data in which y is a noisy function of x
x <- runif(1000, -5, 5)
y <- x + rnorm(1000) + 3

# fit a linear model
res <- lm( y ~ x )
print(res)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
     2.9930       0.9981

Fitting a linear model, we should get a slope of 1 and an intercept of 3. Sure enough, we get pretty close. Let's plot it and see how it looks.

# plot the data and the model
plot(x,y, col=rgb(0.2,0.4,0.6,0.4), main='Linear regression by gradient descent')
abline(res, col='blue')

As a learning exercise, we'll do the same thing using gradient descent. As discussed previously, the main idea is to take the partial derivative of the cost function with respect to theta. That gradient, multiplied by a learning rate, becomes the update rule for the estimated values of the parameters. Iterate and things should converge nicely.

# squared error cost function
cost <- function(X, y, theta) {
  sum( (X %*% theta - y)^2 ) / (2*length(y))
}

# learning rate and iteration limit
alpha <- 0.01
num_iters <- 1000

# keep history
cost_history <- double(num_iters)
theta_history <- list(num_iters)

# initialize coefficients
theta <- matrix(c(0,0), nrow=2)

# add a column of 1's for the intercept coefficient
X <- cbind(1, matrix(x))

# gradient descent
for (i in 1:num_iters) {
  error <- (X %*% theta - y)
  delta <- t(X) %*% error / length(y)
  theta <- theta - alpha * delta
  cost_history[i] <- cost(X, y, theta)
  theta_history[[i]] <- theta
}

print(theta)

          [,1]
[1,] 2.9928978
[2,] 0.9981226

As expected, theta winds up with the same values as lm returned. Let's do some more plotting:

# plot data and converging fit
plot(x,y, col=rgb(0.2,0.4,0.6,0.4), main='Linear regression by gradient descent')
for (i in c(1,3,6,10,14,seq(20,num_iters,by=10))) {
  abline(coef=theta_history[[i]], col=rgb(0.8,0,0,0.3))
}
abline(coef=theta, col='blue')

Taking a look at how quickly the cost decreases, I might have done with fewer iterations.

plot(cost_history, type='line', col='blue', lwd=2, main='Cost function', ylab='cost', xlab='Iterations')

That was easy enough. The next step is to look into some of the more advanced optimization methods available within R. I'll try to translate more of the Machine Learning class into R. I know others are doing that as well.

  • The code for this post is available on GitHub as a gist

Tuesday, July 10, 2012

Katy Börner's Plug and Play Macroscopes

Two Cytoscape engineers pointed me towards Plug and Play Macroscopes by Katy Börner. The paper envisions highly flexible and configurable software tools for science through the mechanism of plugin architecture, and is highly worth reading if you're involved in building scientific software.

Decision making in science, industry, and politics, as well as in daily life, requires that we make sense of data sets representing the structure and dynamics of complex systems. Analysis, navigation, and management of these continuously evolving data sets require a new kind of data-analysis and visualization tool we call a macroscope...

A macroscope is a modular software framework enabling end users - biologists, physicists, social scientists - to assemble customized tools reusing and recombining data sources, algorithms and visualizations. These tools consist of reconfigurable bundles of software plug-ins, using the same OSGi framework as the Eclipse IDE. Once a component has been packaged as a plug-in, it can be shared and combined with other components in new and creative ways to make sense of complexity, synthesizing related elements, finding patterns and detecting outliers. Börner sees these software tools as instruments on par with the microscope and the telescope.

CIShell

These concepts are implemented in a framework called Cyberinfrastructure Shell (CIShell), who's source is in the CIShell repo on GitHub. The core abstraction inside CIShell is that of an algorithm - something that can be executed, might throw exceptions, and returns some data. Data is some object that has a format and key/value metadata.

public interface Algorithm {
   public Data[] execute() throws AlgorithmExecutionException; 
}

public interface Data {
  public Dictionary getMetadata();
  public Object getData();
  public String getFormat();
}

Parenthetically, it's too bad there's no really universal abstraction for a function in Java... Callable, Runnable, Method, Function. In general, trying to wedge dynamic code into the highly static world of Java is not the most natural fit.

I'm guessing that integrating a tool involves wrapping it's functionality in a set of algorithm implementations.

The framework also features what looks to be support for dynamic GUI construction, a database abstraction with a slot for a named schema and support for scripting in Python.

An upcoming version of Cytoscape is built on OSGi. Someone should write a genome browser along these same lines.

"To serve the needs of scientists the core architecture must empower non-programmers to plug, play, and share their algorithms and to design custom macroscopes and other tools. " In my experience, scientists who are capable of designing workflows in visual tools are not afraid of learning enough R or Python to accomplish much the same thing. I'm not saying it's obvious that they should do that. Just that the trade-offs are worth considering. The real benefit comes from raising the level of abstraction, rather than replacing command-line code with point-and-click GUIs.

Means of composition

Plugin architecture isn't the only way to compose independently developed software tools. My lab's Gaggle framework links software through messaging. Service oriented architecture boils down to composition of web services, for example Taverna and MyGrid. GenePattern and Galaxy both fit into this space, although I'm not sure I can do a good job of characterizing them. If I understand correctly, both seem to use common file formats and conversions between them as the intermediary between programs. The classic means of composition are Unix pipes and filters - small programs loosely joined - and scripting.

Visualization

In a keynote on Visual Design Principles at VIZBI in March of this year, Börner channeled inspiration from Yann Arthus-Bertrand's Home and Earth from Above, Drew Berry and Edward Tufte, and advised students of visualization to study human visual perception and cognitive processing first in order to "design for the human system".

Her workflow combines data-centric steps similar to processes outlined by Jeffrey Heer and Hadley Wickham with a detailed breakdown of the elements of visualization.

Katy Börner directs the Cyberinfrastructure for Network Science Center at Indiana University. Not content to have written the Atlas of Science (MIT press 2010), she is currently hanging out in Amsterdam writing an Atlas of Knowledge.

More