Friday, December 23, 2011

Practical advice for applying machine learning

Sprinkled throughout Andrew Ng's machine learning class is a lot of practical advice for applying machine learning. That's what I'm trying to compile and summarize here. Most of Ng's advice centers around the idea of making decisions in an empirical way, fitting to a data-driven discipline, rather than relying on gut feeling.

Training / validation / test

The key is dividing data into training, cross-validation and test sets. The test set is used only to evaluate performance, not to train parameters or select a model representation. The rationale for this is that training set error is not a good predictor of how well your hypothesis will generalize to new examples. In the course, we saw the cross-validation set used to select degrees of polynomial features and find optimal regularization parameters.

Model representation

The representation of the hypothesis, the function h, defines the space of solutions that your algorithm can find. The example used in the class was modeling house price as a function of size. The model tells you what parameters your algorithm needs to learn. If we've selected a linear function, then there are two parameters, the slope and intersect of the line.

Feature selection and treatment

Are the given features sufficiently informative to make predictions? Asking whether a human expert can confidently predict the output given the input features will give a good indication.

At times, it may be necessary to derive new features. Polynomial features, for example, can let linear regression fit non-linear functions. Computing products, ratios, differences or logarithms may be informative. Creativity comes in here, but remember to test the effectiveness of your new features on the cross-validation set.

Features are on different scales may benefit from feature scaling. Mean normalizing and scaling to a standard deviation of one puts features on an even footing.

Gathering data might be expensive. Another option is artificial data synthesis, either creating new examples out of whole cloth or by transforming existing examples. In text recognition, a library of digital fonts might be used to generate examples, or existing examples might be warped or reflected.

Overfitting

Often, a learning algorithm may fit the training data very well, but perform poorly on new examples. This failure to generalize is called overfitting.

The classic example is fitting a high degree polynomial, which can lead to a very curvy line that closely fits a large number of data points. Our hypothesis is complex and might be fitting noise rather than an underlying relationship and will therefore generalize poorly.

One way to combat this problem is to use a simpler model. This is valid, but might be limiting. Another option is regularization, which penalizes large parameter values. This prioritizes solutions fitting the training data reasonably well without curving around wildly.

Regularization can be tuned by plotting training set error and validation set error as functions of the regularization parameter, lambda.

Tuning the trade off between bias vs variance.

The steps we take to improve performance depend on whether our algorithm is suffering from bias or variance. A learning curve is a diagnostic that can tell which of these situations we're in, by plotting training error and validation error as a function of training set size. Look for high training and cross-validation error indicating high bias or a steadily decreasing validation error, with a gap between validation and training error indicating high variance.

Bias

A high bias model has few parameters and may result in underfitting. Essentially we're trying to fit an overly simplistic hypothesis, for example linear where we should be looking for a higher order polynomial. In a high bias situation, training and cross-validation error are both high and more training data is unlikely to help much.

  • find more features
  • add polynomial features
  • increase parameters (more hidden layer neurons, for example)
  • decrease regularization

Variance

Variance is the opposite problem, having lots of parameters, which carries a risk of overfitting. If we are overfitting, the algorithm fits the training set well, but has high cross-validation and testing error. If we see low training set error, with cross-validation error trending downward, then the gap between them might be narrowed by training on more data.

  • more training data
  • reduce number of features, manually or using a model selection algorithm
  • increase regularization

Error analysis

To improve performance of a machine learning algorithm, one helpful step is to manually examine the cases your algorithm gets wrong. Look for systematic trends in the errors. What features would have helped correctly classify these cases?

For multi-step machine learning pipelines, ceiling analysis can help decide where to invest effort to improve performance. The error due to each stage is estimated by substituting labeled data for that stage, revealing how well the whole pipeline would perform if that stage had no error. Stepping through the stages, we note the potential for improvement at each one.

Precision/Recall

It's helpful to have a single number to easily compare performance. Precision and recall and the F1 statistic can help when trying to classify very skewed classes, where one class is rare in the data. Simply taking a percentage of correct classifications can be misleading, since always guessing the more common class means you'll almost always be right.

precision: true positives / predicted positives, predicted positives = true positives + false positives (Of all the patients we predicted to have cancer, what fraction actually has cancer?)

recall: true positives / actual positives, actual positives = true positives + false negatives (Of all the patients that really had cancer, how many did we correctly predict?)

F1 = 2·p·r / (p + r)

Miscellaneous tips

Principle component analysis (PCA) can help by reducing dimensionality of high-dimensional features. Collapsing highly correlated features can help learning algorithms run faster.

Often, incorrectly implemented machine learning algorithms can appear to work, producing no obvious error, but simply converging slower or with more error than a correct implementation. Gradient checking is a technique for checking your work, applied in the class to back propagation, but probably more generally applicable.

The recommended approach

Quickly testing ideas empirically and optimizing developer time is the approach embodied in these steps:

  • First, implement a simple quick and dirty algorithm, plot learning curves, and perform error analysis.
  • Create a list of potential ideas to try to improve performance. Then, start trying promising ideas, using the validation set to test for improvement.
  • Use a learning algorithm with many parameters and many features - low bias. Get a very large training set.

Supporting that last point is the findings of Banko and Brill, 2001.

“It's not who has the best algorithm that wins. It's who has the most data.”

Wednesday, December 14, 2011

SDCube and hybrid data storage

The Sorger lab at Harvard published a piece in the February 2011 Nature Methods that shows some really clear thinking on the topic of designing data storage for biological data. That paper, Adaptive informatics for multifactorial and high-content biological data, Millard et al., introduces a storage system called SDCubes, short for semantically typed data hypercubes, which boils down to HDF5 plus XML. The software is hosted semanticbiology.com.

This two part strategy applies HDF5 to store high-dimensional numeric data efficiently, while storing sufficient metadata in XML to reconstruct the design of the experiment. This is smart, because with modern data acquisition technology, you're often dealing with volumes of data where attention to efficiency is required. But, experimental design, the place where creativity directly intersects with science, is a rapidly moving target. The only hope of capturing that kind of data is a flexible semi-structured representation.

This approach is very reminiscent, if a bit more sophisticated, than something that was tried in the Baliga lab called EMI-ML, which was roughly along the same lines except that the numeric data was stored in tab-separated text files rather than HDF5. Or to put it another way, TSV + XML instead of HDF5 + XML.

Another ISB effort, Addama (Adaptive Data Management) started as a ReSTful API over a content management system and has evolved into a ReSTful service layer providing authentication and search and enabling access to underlying CMS, SQL databases, and data analysis services. Addama has ambitions beyond data management, but shares with SDCube the emphasis on adaptability to the unpredictable requirements inherent in research by enabling software to reflect the design of individual experiments.

There's something to be said for these hybrid approaches. Once you start looking, you see a similar pattern in lots of places.

  • NoSQL - SQL hybrids
  • SQL - XML hybrids. SQL-Server, for one, has great support for XML enabling XQuery and XPath mixed with SQL.
  • Search engines, like Solr, are typically used next to an existing database
  • Key/value tables in a database (aka Entity–attribute–value)

Combining structured and semi-structured data allows room for flexibility where you need it, while retaining RDBMS performance where it fits. Using HDF5 adapts the pattern for scientific applications working with vectors and matrices, structures not well served by either relational or hierarchical models. Where does that leave us in biology with our networks?. I don't know whether HDF5 can store graphs. Maybe we need a triple hybrid relational-matrix-graph database.

By the way, HDF5 libraries exist for several languages. SDCube is in Java. MATLAB can read HDF5. There is an HDF5 package for R, but it seems incomplete.

Relational databases work extremely well for some things. But, flexibility has never been their strong point. They've been optimized for 40 years, but shaped by the transaction processing problems they were designed to solve, and they just get awkward for certain uses, to name some - graphs, matrices and frequently changing schemas.

Maybe, before too long the big database vendors go multi-paradigm and we'll see matrices, graphs, key-value pairs, XML and JSON as native data structures that can be sliced and diced and joined at will right along with tables.

More information

  • Quantitative data: learning to share, an essay by Monya Baker in Nature Methods, describes how "Adaptive technologies are helping researchers combine and organize experimental results."

Sunday, December 11, 2011

Neural Networks

Exercise 4 in Andrew Ng's Machine Learning class is on neural networks. Back in the cretaceous period, in 1994 or so, in one of the periodic phases of popularity for neural networks, I hacked up some neural network code in Borland Turbo C++ on a screaming 90MHz Pentium. That code is probably on a 3 1/2 inch floppy in the bottom of a box somewhere in my basement. Back then, I remember being fascinated by the idea that we could get a computer to learn by example.

Neural networks go in and out of style like miniskirts. [cue lecherous graphic.] Some people scoff. Neural networks are just a special cases of Bayesian networks, they say, and SVMs have more rigorous theory. Fine. But, it's a bit like seeing an old friend to find that neural networks are respectable again.

To see how neural networks work, let's look at how information flows through them, first in the forward direction.

(Andrew Ng)

Forward Propogation

An input example Xt becomes the activation at the first layer, the n x 1 vector a1. Prepending a bias node whose value is always 1, we then multiply by the weight matrix, Theta1. This returns z2, whose rows are the sum of the products of the input neurons with their respective weights. We pass those sums through the sigmoid function to get the activations of the next layer. Repeating the same procedure for the output layer gives the outputs of the neural network.

Implementing this in Octave, I think, could be fully vectorized to compute all training examples at once, but I did this in a loop over t which indexes a single training example, like this:

a1 = X(t,:)';
z2 = Theta1 * [1; a1];
a2 = sigmoid(z2);
z3 = Theta2 * [1; a2];
a3 = sigmoid(z3);

As in the previous cases of linear and logistic regression, neural networks have a cost function to be minimized by moving in short steps along a gradient.

(Andrew Ng)

Back propagation

The gradients are computed by back propagation, which pushes the error backwards through the hidden layers. It was the publication of this algorithm in Nature in 1986, that led to the resurgence that I caught the tail end of in the 90's.

Cross-validation, regularization and gradient checking

When using neural networks, choices need to be made about architecture, the number of layers and number of units in each layer. Considerations include over-fitting, bias and computational costs. Trying a range of architectures and cross-validating is a good way to make this choice.

The layered approach gives neural networks the ability to fit highly non-linear boundaries, but also makes them prone to over-fitting, so it's helpful to add a regularization term to the cost function that penalizes large weights. Selecting the regularization parameter can be done by cross-validation.

Translating the math into efficient code is tricky and it's not hard to get incorrect implementations that still seem to work. It's a good idea to confirm the correctness of your computations with a technique called gradient checking. You compute the partial derivatives numerically and compare with your implementation.

Back in the day, I implemented back-prop twice. Once in my C++ code and again in Excel to check the results.

A place in the toolbox

The progression of ideas leading up to this point in the course is very cleverly arranged. Linear regression starts you on familiar ground and helps introduce gradient descent. Logistic regression adds the simple step of transforming your inputs through the sigmoidal function. Neural networks then follow naturally. It's just logistic regression in multiple layers.

In spite of their tumultuous history, neural networks can be looked at as just another tool in the machine learning toolbox, with pluses and minus like other tools. The history of the idea is interesting, in terms of seeing inside the sausage factory of science.

Thursday, December 08, 2011

Effective Data Visualizations

Noah Iliinksy spoke at UW, yesterday, on the topic of Effective Visualization. Iliinksy has a new book out, Designing Data Visualization (review), and served as editor of Beautiful Visualization, both from O'Reilly. And, yay Seattle, he lives here in town and has a degree from UW.

If you had to sum up the talk in a sentence, it would be this: Take the advice from your college technical writing class and apply it to data visualization. Know your audience. Have a goal. Consider the needs, interests and prior knowledge of your readers / viewers. Figure out what do you want them to take away. Ask, “who is my audience, and what do they need?” I guess that's more than a sentence.

Encoding data

The human eye is great at perceiving small differences in position. Use position for your most salient features.

Color is often used poorly. Question: Is orange higher or lower than purple? Answer: No! Color is not ordered. However, brightness and saturation are and can be used effectively to convey quantitative information. Temperature is something of an exception, since it is widely understood that blue is cold and red is hot. Also, color is often loaded with cultural meanings - think of black hats and white hats or the political meanings of red, orange or green, boy/girl = blue/pink, etc.

Appropriate encodings by data type

Click to expand this handy chart!

As an example of how to do it right, Iliinsky points to Hipmunk, which crams an enormous amount of data into a simple chart of flights from Seattle to Phuket, Thailand.

We can see departure and arrival time and duration, in both the absolute and relative senses, plus layovers, airline, airport and price. And, you can sort by "Agony", which is cool. They've encoded lengths (of time) as lengths, used text (sparingly) for exact amounts, color to show categorical variables (airline) and iconography to indicate the presence or absence of wireless internet on flights.

The cool chart and the quote about encoding, were expropriated from the slides from Iliinksy's talk at Strata. If you want more, there's a video of a related talk on You-Tube and a podcast on Letting Data Tell the Story. Tools recomended by Iliinsky include R and GGPlot, D3 and Protovis, and Tableau.

Tuesday, December 06, 2011

K-means

The topics in this week's programming exercise in the machine learning class are K-means and PCA.

K-means is a fairly easily understood clustering algorithm. Once you specify K, the number of clusters, and pick some random initial centroids, it's just two steps. First, assign each data point to a cluster according to it's nearest centroid. Next, recompute the centroids based on the average of the data points in each cluster. Repeat until convergence. It's that simple.

Vectorizing the two steps is a little tricky. Let's take a look at what we're trying to accomplish in step 1. If we had an m by k matrix with the distances where each element i,j was the distance from data point i to centroid j, we could take the min of each row. Actually, we want the index of the min of each row. That would give the assignments for all m data points in one shot.

For example, say we have just 3 data points (m=3) with 2 features each and 2 centroids, (a,b) and (c,d). How do you get to a distance matrix given X and the centroids?

What I came up with can surely be improved upon, but here it is anyway. I loop over each cluster k, finding the distance between that cluster's centroid and all data points.

The cryptic bsxfun may sound like something republicans wouldn't approve of, but really, it's a bit like the apply functions in R. It can work several ways, but in this case takes a function, a matrix X, m by n, and the n by 1 vector of the kth centroid. It applies the function to each row in X along with the centroid vector. The result is an m by n matrix whose ith row is the difference between the ith example and the kth centroid. We square that matrix, element-wise. Then, sum all the rows to compute the vector of m squared distances. After we've filled in our distances for all the centroids, we take the min, row-wise, returning the index of the nearest centroid.

function idx = findClosestCentroids(X, centroids)
  K = size(centroids, 1);
  m = size(X,1)

  idx = zeros(m, 1);
  dist = zeros(size(X,1), K);

  for k = 1:K
    dist(:,k) = sum(bsxfun(@(x,mu_k) x-mu_k, X, centroids(k,:)) .^ 2, 2);
  end

  [min_dist, idx] = min(dist, [], 2);
end

There must be a cleaner way to do that. If we looped over m rather than k, we could compute mins one row at a time and never need the whole dist matrix in memory. Maybe there's some magic linear algebra that could efficiently do the whole thing. Anyone wanna clue me in on that?

Luckily, the next step is easier. To recompute the centroids, we're finding the average of each cluster. Again, I used a loop over the k clusters. We grab the subset of data points belonging to each cluster using find.

for k = 1:K
  example_idx = find(idx==k);
  centroids(k,:) = sum(X(example_idx,:),1) / size(example_idx,1);
end

With those two steps in place, we're clustering away. One puzzle that comes with this particular algorithm is how to choose K. According to Andrew Ng, The elbow method can be problematic because a clear elbow may not present itself. Letting downstream requirements dictate the choice of K seems to be better.

It's getting pretty clear that the trick to most of these programming exercises is vectorization. The combination of functional programming and vectorized operations is very powerful, but definitely comes at the cost of some brain-strain, at least if you have as few working brain cells as I've got left.

Monday, December 05, 2011

International Open Data Hackathon

This past Saturday, I hung out at the Seattle branch of the International Open Data Hackathon. The event was hosted at the Pioneer Square office of Socrata, a small company that helps governments provide public open data.

A pair of data analysts from Tableau were showing off a visualization for the Washington Post's FactChecker blog called Comparing Job Creation Records. Tableau pays these folks to play with data and make cool visualizations that make their software look good. One does politics and the other does pop culture. A nice gig, if you can get it!

A pair of devs from Microsoft's Open Data Protocol (OData) also showed up. OData looks to be a well thought out set of tools for ReST data services. If I understand correctly, it seems to have grown up around pushing relational data over Atom feeds. They let you define typed entities and associations between them, then do CRUD operations on them. You might call it ReSTful enterprise application integration.

Socrata's OpenData portal has all kinds of neat stuff, from White House staff salaries to radiation contamination measurements to investors who were bilked by Bernie Madoff. 13,710 datasets in all. They're available for download as well as through a nice ReST/JSON API. Socrata's platform runs explore.data.gov, data.seattle.gov among others.

For example, if you've got reasonably fat pipes, and want to know about building permits in Seattle, fire up R and enter this:

> permits.url <- 'http://data.seattle.gov/api/views/mags-97de/rows.csv'
> p <- read.csv(permits.url)
> head(p)

Socrata follows the rails-ish convention of letting you indicate the return format like a file extension. In this case, we're asking for .csv, 'cause R parses it so easily. You can get JSON, XML, RDF and several other formats.

Let's say you want to know what Seattlites are paying for kitchen remodels. Holy crap, it's appalling how boring and middle-aged I've gotten. Someone, shoot me!

> cost <- as.numeric(gsub('\\$(.*)', '\\1', p$Value))
> a <- cost[ grepl('kitchen', p$Description) & p$Category=="SINGLE FAMILY / DUPLEX" & cost > 0 & cost < 200000 ]
> hist(a, xlab='cost $', main='Distribution of kitchen remodels in Seattle', col=sample(gray.colors(10),10))

You saw it here, first, folks!

> a <- cost[ grepl('kitchen', p$Description) & p$Category=="SINGLE FAMILY / DUPLEX" & cost > 0]
> summary(a)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    500   18000   35000   46290   59770  420000

I dunno who's 420 thousand dollar kitchen that is, but if I find out, I'm coming over for dinner!

Socrata's API offers a JSON based way of defining queries. Several datasets are updated in near real time. There's gotta be loads of cool stuff to be done with this data. Let's hope the government sees the value in cheap and innovative ideas like these and continues funding for data.gov.