Friday, December 23, 2011

Practical advice for applying machine learning

Sprinkled throughout Andrew Ng's machine learning class is a lot of practical advice for applying machine learning. That's what I'm trying to compile and summarize here. Most of Ng's advice centers around the idea of making decisions in an empirical way, fitting to a data-driven discipline, rather than relying on gut feeling.

Training / validation / test

The key is dividing data into training, cross-validation and test sets. The test set is used only to evaluate performance, not to train parameters or select a model representation. The rationale for this is that training set error is not a good predictor of how well your hypothesis will generalize to new examples. In the course, we saw the cross-validation set used to select degrees of polynomial features and find optimal regularization parameters.

Model representation

The representation of the hypothesis, the function h, defines the space of solutions that your algorithm can find. The example used in the class was modeling house price as a function of size. The model tells you what parameters your algorithm needs to learn. If we've selected a linear function, then there are two parameters, the slope and intersect of the line.

Feature selection and treatment

Are the given features sufficiently informative to make predictions? Asking whether a human expert can confidently predict the output given the input features will give a good indication.

At times, it may be necessary to derive new features. Polynomial features, for example, can let linear regression fit non-linear functions. Computing products, ratios, differences or logarithms may be informative. Creativity comes in here, but remember to test the effectiveness of your new features on the cross-validation set.

Features are on different scales may benefit from feature scaling. Mean normalizing and scaling to a standard deviation of one puts features on an even footing.

Gathering data might be expensive. Another option is artificial data synthesis, either creating new examples out of whole cloth or by transforming existing examples. In text recognition, a library of digital fonts might be used to generate examples, or existing examples might be warped or reflected.

Overfitting

Often, a learning algorithm may fit the training data very well, but perform poorly on new examples. This failure to generalize is called overfitting.

The classic example is fitting a high degree polynomial, which can lead to a very curvy line that closely fits a large number of data points. Our hypothesis is complex and might be fitting noise rather than an underlying relationship and will therefore generalize poorly.

One way to combat this problem is to use a simpler model. This is valid, but might be limiting. Another option is regularization, which penalizes large parameter values. This prioritizes solutions fitting the training data reasonably well without curving around wildly.

Regularization can be tuned by plotting training set error and validation set error as functions of the regularization parameter, lambda.

Tuning the trade off between bias vs variance.

The steps we take to improve performance depend on whether our algorithm is suffering from bias or variance. A learning curve is a diagnostic that can tell which of these situations we're in, by plotting training error and validation error as a function of training set size. Look for high training and cross-validation error indicating high bias or a steadily decreasing validation error, with a gap between validation and training error indicating high variance.

Bias

A high bias model has few parameters and may result in underfitting. Essentially we're trying to fit an overly simplistic hypothesis, for example linear where we should be looking for a higher order polynomial. In a high bias situation, training and cross-validation error are both high and more training data is unlikely to help much.

  • find more features
  • add polynomial features
  • increase parameters (more hidden layer neurons, for example)
  • decrease regularization

Variance

Variance is the opposite problem, having lots of parameters, which carries a risk of overfitting. If we are overfitting, the algorithm fits the training set well, but has high cross-validation and testing error. If we see low training set error, with cross-validation error trending downward, then the gap between them might be narrowed by training on more data.

  • more training data
  • reduce number of features, manually or using a model selection algorithm
  • increase regularization

Error analysis

To improve performance of a machine learning algorithm, one helpful step is to manually examine the cases your algorithm gets wrong. Look for systematic trends in the errors. What features would have helped correctly classify these cases?

For multi-step machine learning pipelines, ceiling analysis can help decide where to invest effort to improve performance. The error due to each stage is estimated by substituting labeled data for that stage, revealing how well the whole pipeline would perform if that stage had no error. Stepping through the stages, we note the potential for improvement at each one.

Precision/Recall

It's helpful to have a single number to easily compare performance. Precision and recall and the F1 statistic can help when trying to classify very skewed classes, where one class is rare in the data. Simply taking a percentage of correct classifications can be misleading, since always guessing the more common class means you'll almost always be right.

precision: true positives / predicted positives, predicted positives = true positives + false positives (Of all the patients we predicted to have cancer, what fraction actually has cancer?)

recall: true positives / actual positives, actual positives = true positives + false negatives (Of all the patients that really had cancer, how many did we correctly predict?)

F1 = 2·p·r / (p + r)

Miscellaneous tips

Principle component analysis (PCA) can help by reducing dimensionality of high-dimensional features. Collapsing highly correlated features can help learning algorithms run faster.

Often, incorrectly implemented machine learning algorithms can appear to work, producing no obvious error, but simply converging slower or with more error than a correct implementation. Gradient checking is a technique for checking your work, applied in the class to back propagation, but probably more generally applicable.

The recommended approach

Quickly testing ideas empirically and optimizing developer time is the approach embodied in these steps:

  • First, implement a simple quick and dirty algorithm, plot learning curves, and perform error analysis.
  • Create a list of potential ideas to try to improve performance. Then, start trying promising ideas, using the validation set to test for improvement.
  • Use a learning algorithm with many parameters and many features - low bias. Get a very large training set.

Supporting that last point is the findings of Banko and Brill, 2001.

“It's not who has the best algorithm that wins. It's who has the most data.”

Wednesday, December 14, 2011

SDCube and hybrid data storage

The Sorger lab at Harvard published a piece in the February 2011 Nature Methods that shows some really clear thinking on the topic of designing data storage for biological data. That paper, Adaptive informatics for multifactorial and high-content biological data, Millard et al., introduces a storage system called SDCubes, short for semantically typed data hypercubes, which boils down to HDF5 plus XML. The software is hosted semanticbiology.com.

This two part strategy applies HDF5 to store high-dimensional numeric data efficiently, while storing sufficient metadata in XML to reconstruct the design of the experiment. This is smart, because with modern data acquisition technology, you're often dealing with volumes of data where attention to efficiency is required. But, experimental design, the place where creativity directly intersects with science, is a rapidly moving target. The only hope of capturing that kind of data is a flexible semi-structured representation.

This approach is very reminiscent, if a bit more sophisticated, than something that was tried in the Baliga lab called EMI-ML, which was roughly along the same lines except that the numeric data was stored in tab-separated text files rather than HDF5. Or to put it another way, TSV + XML instead of HDF5 + XML.

Another ISB effort, Addama (Adaptive Data Management) started as a ReSTful API over a content management system and has evolved into a ReSTful service layer providing authentication and search and enabling access to underlying CMS, SQL databases, and data analysis services. Addama has ambitions beyond data management, but shares with SDCube the emphasis on adaptability to the unpredictable requirements inherent in research by enabling software to reflect the design of individual experiments.

There's something to be said for these hybrid approaches. Once you start looking, you see a similar pattern in lots of places.

  • NoSQL - SQL hybrids
  • SQL - XML hybrids. SQL-Server, for one, has great support for XML enabling XQuery and XPath mixed with SQL.
  • Search engines, like Solr, are typically used next to an existing database
  • Key/value tables in a database (aka Entity–attribute–value)

Combining structured and semi-structured data allows room for flexibility where you need it, while retaining RDBMS performance where it fits. Using HDF5 adapts the pattern for scientific applications working with vectors and matrices, structures not well served by either relational or hierarchical models. Where does that leave us in biology with our networks?. I don't know whether HDF5 can store graphs. Maybe we need a triple hybrid relational-matrix-graph database.

By the way, HDF5 libraries exist for several languages. SDCube is in Java. MATLAB can read HDF5. There is an HDF5 package for R, but it seems incomplete.

Relational databases work extremely well for some things. But, flexibility has never been their strong point. They've been optimized for 40 years, but shaped by the transaction processing problems they were designed to solve, and they just get awkward for certain uses, to name some - graphs, matrices and frequently changing schemas.

Maybe, before too long the big database vendors go multi-paradigm and we'll see matrices, graphs, key-value pairs, XML and JSON as native data structures that can be sliced and diced and joined at will right along with tables.

More information

  • Quantitative data: learning to share, an essay by Monya Baker in Nature Methods, describes how "Adaptive technologies are helping researchers combine and organize experimental results."

Sunday, December 11, 2011

Neural Networks

Exercise 4 in Andrew Ng's Machine Learning class is on neural networks. Back in the cretaceous period, in 1994 or so, in on of the periodic phases of popularity for neural networks, I hacked up some neural network code in Borland Turbo C++ on a screaming 90MHz Pentium. That code is probably on a 3 1/2 inch floppy in the bottom of a box somewhere in my basement. Back then, I remember being fascinated by the idea that we could get a computer to learn by example.

Neural networks go in and out of style like miniskirts. [cue lecherous graphic.] Some people scoff. Neural networks are just a special cases of Bayesian networks, they say, and SVMs have more rigorous theory. Fine. But, it's a bit like seeing an old friend to find that neural networks are respectable again.

To see how neural networks work, let's look at how information flows through them, first in the forward direction.

(Andrew Ng)

Forward Propogation

An input example Xt becomes the activation at the first layer, the n x 1 vector a1. Prepending a bias node whose value is always 1, we then multiply by the weight matrix, Theta1. This returns z2, whose rows are the sum of the products of the input neurons with their respective weights. We pass those sums through the sigmoid function to get the activations of the next layer. Repeating the same procedure for the output layer gives the outputs of the neural network.

Implementing this in Octave, I think, could be fully vectorized to compute all training examples at once, but I did this in a loop over t which indexes a single training example, like this:

a1 = X(t,:)';
z2 = Theta1 * [1; a1];
a2 = sigmoid(z2);
z3 = Theta2 * [1; a2];
a3 = sigmoid(z3);

As in the previous cases of linear and logistic regression, neural networks have a cost function to be minimized by moving in short steps along a gradient.

(Andrew Ng)

Back propagation

The gradients are computed by back propagation, which pushes the error backwards through the hidden layers. It was the publication of this algorithm in Nature in 1986, that led to the resurgence that I caught the tail end of in the 90's.

Cross-validation, regularization and gradient checking

When using neural networks, choices need to be made about architecture, the number of layers and number of units in each layer. Considerations include over-fitting, bias and computational costs. Trying a range of architectures and cross-validating is a good way to make this choice.

The layered approach gives neural networks the ability to fit highly non-linear boundaries, but also makes them prone to over-fitting, so it's helpful to add a regularization term to the cost function that penalizes large weights. Selecting the regularization parameter can be done by cross-validation.

Translating the math into efficient code is tricky and it's not hard to get incorrect implementations that still seem to work. It's a good idea to confirm the correctness of your computations with a technique called gradient checking. You compute the partial derivatives numerically and compare with your implementation.

Back in the day, I implemented back-prop twice. Once in in my C++ code and again in Excel to check the results.

A place in the toolbox

The progression of ideas leading up to this point in the course is very cleverly arranged. Linear regression starts you on familiar ground and helps introduce gradient descent. Logistic regression adds the simple step of transforming your inputs through the sigmoidal function. Neural networks then follow naturally. It's just logistic regression in multiple layers.

In spite of their tumultuous history, neural networks can be looked at as just another tool in the machine learning toolbox, with pluses and minus like other tools. The history of the idea is interesting, in terms of seeing inside the sausage factory of science.

Thursday, December 08, 2011

Effective Data Visualizations

Noah Iliinksy spoke at UW, yesterday, on the topic of Effective Visualization. Iliinksy has a new book out, Designing Data Visualization (review), and served as editor of Beautiful Visualization, both from O'Reilly. And, yay Seattle, he lives here in town and has a degree from UW.

If you had to sum up the talk in a sentence, it would be this: Take the advice from your college technical writing class and apply it to data visualization. Know your audience. Have a goal. Consider the needs, interests and prior knowledge of your readers / viewers. Figure out what do you want them to take away. Ask, “who is my audience, and what do they need?” I guess that's more than a sentence.

Encoding data

The human eye is great at perceiving small differences in position. Use position for your most salient features.

Color is often used poorly. Question: Is orange higher or lower than purple? Answer: No! Color is not ordered. However, brightness and saturation are and can be used effectively to convey quantitative information. Temperature is something of an exception, since it is widely understood that blue is cold and red is hot. Also, color is often loaded with cultural meanings - think of black hats and white hats or the political meanings of red, orange or green, boy/girl = blue/pink, etc.

Appropriate encodings by data type

Click to expand this handy chart!

As an example of how to do it right, Iliinsky points to Hipmunk, which crams an enormous amount of data into a simple chart of flights from Seattle to Phuket, Thailand.

We can see departure and arrival time and duration, in both the absolute and relative senses, plus layovers, airline, airport and price. And, you can sort by "Agony", which is cool. They've encoded lengths (of time) as lengths, used text (sparingly) for exact amounts, color to show categorical variables (airline) and iconography to indicate the presence or absence of wireless internet on flights.

The cool chart and the quote about encoding, were expropriated from the slides from Iliinksy's talk at Strata. If you want more, there's a video of a related talk on You-Tube and a podcast on Letting Data Tell the Story. Tools recomended by Iliinsky include R and GGPlot, D3 and Protovis, and Tableau.

Tuesday, December 06, 2011

K-means

The topics in this week's programming exercise in the machine learning class are K-means and PCA.

K-means is a fairly easily understood clustering algorithm. Once you specify K, the number of clusters, and pick some random initial centroids, it's just two steps. First, assign each data point to a cluster according to it's nearest centroid. Next, recompute the centroids based on the average of the data points in each cluster. Repeat until convergence. It's that simple.

Vectorizing the two steps is a little tricky. Let's take a look at what we're trying to accomplish in step 1. If we had an m by k matrix with the distances where each element i,j was the distance from data point i to centroid j, we could take the min of each row. Actually, we want the index of the min of each row. That would give the assignments for all m data points in one shot.

For example, say we have just 3 data points (m=3) with 2 features each and 2 centroids, (a,b) and (c,d). How do you get to a distance matrix given X and the centroids?

What I came up with can surely be improved upon, but here it is anyway. I loop over each cluster k, finding the distance between that cluster's centroid and all data points.

The cryptic bsxfun may sound like something republicans wouldn't approve of, but really, it's a bit like the apply functions in R. It can work several ways, but in this case takes a function, a matrix X, m by n, and the n by 1 vector of the kth centroid. It applies the function to each row in X along with the centroid vector. The result is an m by n matrix whose ith row is the difference between the ith example and the kth centroid. We square that matrix, element-wise. Then, sum all the rows to compute the vector of m squared distances. After we've filled in our distances for all the centroids, we take the min, row-wise, returning the index of the nearest centroid.

function idx = findClosestCentroids(X, centroids)
  K = size(centroids, 1);
  m = size(X,1)

  idx = zeros(m, 1);
  dist = zeros(size(X,1), K);

  for k = 1:K
    dist(:,k) = sum(bsxfun(@(x,mu_k) x-mu_k, X, centroids(k,:)) .^ 2, 2);
  end

  [min_dist, idx] = min(dist, [], 2);
end

There must be a cleaner way to do that. If we looped over m rather than k, we could compute mins one row at a time and never need the whole dist matrix in memory. Maybe there's some magic linear algebra that could efficiently do the whole thing. Anyone wanna clue me in on that?

Luckily, the next step is easier. To recompute the centroids, we're finding the average of each cluster. Again, I used a loop over the k clusters. We grab the subset of data points belonging to each cluster using find.

for k = 1:K
  example_idx = find(idx==k);
  centroids(k,:) = sum(X(example_idx,:),1) / size(example_idx,1);
end

With those two steps in place, we're clustering away. One puzzle that comes with this particular algorithm is how to choose K. According to Andrew Ng, The elbow method can be problematic because a clear elbow may not present itself. Letting downstream requirements dictate the choice of K seems to be better.

It's getting pretty clear that the trick to most of these programming exercises is vectorization. The combination of functional programming and vectorized operations is very powerful, but definitely comes at the cost of some brain-strain, at least if you have as few working brain cells as I've got left.

Monday, December 05, 2011

International Open Data Hackathon

This past Saturday, I hung out at the Seattle branch of the International Open Data Hackathon. The event was hosted at the Pioneer Square office of Socrata, a small company that helps governments provide public open data.

A pair of data analysts from Tableau were showing off a visualization for the Washington Post's FactChecker blog called Comparing Job Creation Records. Tableau pays these folks to play with data and make cool visualizations that make their software look good. One does politics and the other does pop culture. A nice gig, if you can get it!

A pair of devs from Microsoft's Open Data Protocol (OData) also showed up. OData looks to be a well thought out set of tools for ReST data services. If I understand correctly, it seems to have grown up around pushing relational data over Atom feeds. They let you define typed entities and associations between them, then do CRUD operations on them. You might call it ReSTful enterprise application integration.

Socrata's OpenData portal has all kinds of neat stuff, from White House staff salaries to radiation contamination measurements to investors who were bilked by Bernie Madoff. 13,710 datasets in all. They're available for download as well as through a nice ReST/JSON API. Socrata's platform runs explore.data.gov, data.seattle.gov among others.

For example, if you've got reasonably fat pipes, and want to know about building permits in Seattle, fire up R and enter this:

> permits.url <- 'http://data.seattle.gov/api/views/mags-97de/rows.csv'
> p <- read.csv(permits.url)
> head(p)

Socrata follows the rails-ish convention of letting you indicate the return format like a file extension. In this case, we're asking for .csv, 'cause R parses it so easily. You can get JSON, XML, RDF and several other formats.

Let's say you want to know what Seattlites are paying for kitchen remodels. Holy crap, it's appalling how boring and middle-aged I've gotten. Someone, shoot me!

> cost <- as.numeric(gsub('\\$(.*)', '\\1', p$Value))
> a <- cost[ grepl('kitchen', p$Description) & p$Category=="SINGLE FAMILY / DUPLEX" & cost > 0 & cost < 200000 ]
> hist(a, xlab='cost $', main='Distribution of kitchen remodels in Seattle', col=sample(gray.colors(10),10))

You saw it here, first, folks!

> a <- cost[ grepl('kitchen', p$Description) & p$Category=="SINGLE FAMILY / DUPLEX" & cost > 0]
> summary(a)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    500   18000   35000   46290   59770  420000

I dunno who's 420 thousand dollar kitchen that is, but if I find out, I'm coming over for dinner!

Socrata's API offers a JSON based way of defining queries. Several datasets are updated in near real time. There's gotta be loads of cool stuff to be done with this data. Let's hope the government sees the value in cheap and innovative ideas like these and continues funding for data.gov.

Wednesday, November 30, 2011

Support Vector Machines

Week 7 of Andrew Ng's Machine Learning class covers support vector machines, pragmatically from the perspective of calling a library rather than implementation. I've been wanting to learn more about SVMs for quite a while, so I was excited for this one.

A support vector machine is a supervised classification algorithm. Given labeled training data, typically high-dimensional vectors, SVM finds the maximum-margin hyperplane separating the positive and negative examples. The algorithm selects a decision boundary that does the best job of separating the classes with the extra stipulation that the boundary be as far as possible from the nearest samples on either side. This is where the large margin part comes from.

(from Andrew Ng's ml-class.)

The cost function used with SVMs is a slightly modified version of that used with logistical regression:

With SVMs, we replace the sigmoid functions with linearized version called cost1 and cost2.

Some error is accepted, allowing for misclassification of some training examples in the interest of getting the majority correct. The parameter C acts as a form of regularization, specifying tolerance for training error.

But, what if the boundary between classes is non-linear, like the one shown here?

(from Andrew Ng's ml-class.)

The SVM algorithm generalizes to non-linear cases with the aid of kernel functions. A straight line in n dimensions, a hyper-plane, can be viewed as a linear kernel. The other widely used class of kernel functions is the guassian kernel. It's my understanding that the kernel function maps a non-linear boundary in the problem space to a linear boundary in a higher dimensional space.

(from the wikipedia entry for support vector machine.)

The SVM algorithm is sped up by a performance hack called the kernel trick, which I understand just in general outline: The kernel trick is a way of mapping observations into a higher dimensional space V, without ever having to compute the mapping explicitly. The trick is to use learning algorithms that only require dot products between the vectors in V, and choose the mapping such that these high-dimensional dot products can be computed within the original space, by means of a kernel function.

There is some equivalence between SVMs and neural networks that I don't quite grasp. The process of computing the kernel function on the input vectors is something like the hidden layer of the neural network, which transforms and weighs the input features. I'm not sure if the analogy between SVMs and ANNs goes deeper. Also by virtue of the kernels, SVMs are a member of a more general class of statistical algorithms called kernel methods.

The exercise was to build a spam filter based on a small subset of the SpamAssassin public corpus of 6047 messages, of which roughly a third are spam. I trained an SVM and tried in on email from my spam-magnet yahoo email address, and it worked!

So, I guess the up-shot is that I'm still a little hazy on SVMs, if a bit less so than before. If I really want to know more, there's the source code that came with the homework. Or, I could read A training algorithm for optimal margin classifiers.

More SVM links

2012 conference dates

Here are a few conferences for 2012 in computing or bioinformatics:

Also, here's the full list of O'Reilly conferences. Last year's Strata on big data/data science was really fun.

Biovis, not to be confused with VizBi (see above) is part of VisWeek and will be in Seattle next October.

We'll soon be adding our own Systems Bioinformatics Workshop to the schedule, probably sometime in the Summer. Hope to see you there.

Thursday, November 10, 2011

Matrix arithmetic

Here are a couple bits of basic linear algebra that'll come in handy in the Machine Learning class.

How to multiply matrices

Matrix Identities

If X and Y are two vectors of length m:

If X is (m x n) matrix and Y is (n x 1) vector

Paul's Online Math Notes on Linear Algebra

Wednesday, October 26, 2011

PostgreSQL cheat sheet

A handy cheat-sheet for the PostgreSQL database, for when I'm too lazy to dig through the docs or find another cheat-sheet.

Start and stop server

sudo su postgres -c '/opt/local/lib/postgresql/bin/postgres -D /opt/local/var/db/postgres/defaultdb'
cnt-z
bg

or

su -c 'pg_ctl start -D /opt/local/var/db/postgres/defaultdb -l postgreslog' postgres

To shutdown

sudo su postgres -c 'pg_ctl stop -D /opt/local/var/db/postgres/defaultdb'

Run client

psql -U postgres
sudo -u postgres psql

Commands

Postgres commands start with a backslash '\' character. Type \l to list databases and \c to connect to a database. \? shows help and \q quits. \d lists tables, views and sequences, \dt lists tables.

Granting access privileges

create database dbname;
create user joe_mamma with password 'password';
grant all privileges on database dbname to joe_mamma;
grant all privileges on all tables in schema public to joe_mamma;
grant all privileges on all sequences in schema public to joe_mamma;

See the docs for GRANT.

SQL dump and restore

pg_dump -U postgres dbname | gzip > dbname.dump.2011.10.24.gz
gunzip < dbname.dump.2011.10.24.gz | sudo -u postgres psql --dbname dbname

For more, see Backup and Restore from the Postgres manual.

Truncate

Delete all data from a table and related tables.

truncate my_table CASCADE;

Sequences

Sequences can be manipulated with currval and setval.

select currval('my_table_id_seq');
select setval('my_table_id_seq',1,false);

Trouble-shooting

If you seen an Ident authentication error...

FATAL:  Ident authentication failed for user "postgres"

... look in your pg_hba.conf file. Ask Postgres where this file is by typing, "show hba_file;".

sudo cat /etc/postgresql/9.0/main/pg_hba.conf

You might see a line that looks like this:

local  all  all      ident

What the 'ident' means is postgres uses your shell account name to log you in. Specifying the user on the command line "psql -U postgres" doesn't help. Either change "ident" in the pg_hba.conf to "md5" or "trust" and restart postgres, or just do what it wants: "sudo -u postgres psql". More on this can be found in “FATAL: Ident authentication failed”, or how cool ideas get bad usage schemas.

Machine-learning: gradient descent

The first section of Andrew Ng's Machine Learning class is about applying gradient descent to linear regression problems.

Andrew Ng

Our input data is an m-by-n matrix X, where we have m training examples with n features each. For these training examples, we know the expected outputs y where y is the variable we're trying to predict. We want to find a line defined by the parameter vector ϴ that minimizes the squared error between the line and our data points.

Gradient descent takes a cost function, which is the squared error of the prediction vs. the training data. I think the 2 in the denominator is there so that it cancels out when we take the derivative, leaving us with a simpler gradient function.

The update rule for each ϴj is the partial derivative of the cost function with respect to ϴj.

Part of the challenge is converting this to matrix notation, to take advantage of fast matrix arithmetic algorithms.

Next we vectorize the update rule and show how to compute least squared error directly with the normal equation.

In action, gradient descent gradually approaches optimal values for ϴ. How gradual depends on the learning rate, α.

While the classwork was done in Octave, I also did a simple gradient descent implementation in R.

Saturday, October 22, 2011

Octave cheat sheet

I'm mucking about with Octave, MATLAB's open source cousin, as part of Stanford's Machine Learning class. Here are a few crib notes to keep me right side up.

The docs for Octave must be served from a Commodore 64 in Siberia judging by the speed, but Matlab's Function Reference is convenient. The Octave WikiBook covers a lot of the basics.

Matrices

Try some matrix operations. Create a 2x3 matrix, and a 3x2 matrix. Multiply them to get a 2x2 matrix. Try indexing.

>> A = [1 2 3; 4 5 6]
A =
   1   2   3
   4   5   6

>> B = 2 * ones(3,2)
B =
   2   2
   2   2
   2   2

>> size(B)
ans =
   3   2

>> A * B  % matrix multiplication
ans =
   12   12
   30   30

>> who    % list variables
A    B    ans

>> A(2,3) % get row 2, column 3
ans =  6

>> A(2,:) % get 2nd row
ans =
   4   5   6

>> A'     % A transpose
ans =
   1   4
   2   5
   3   6

>> A' .* B  % element-wise multiply
ans =
    2    8
    4   10
    6   12

Sum

sum(A,dim) is a little bass-ackwards in that the columns are dimension 1, rows are dimension 2, contrary to R and common sense.

>> sum(A,2)
ans =
    6
   15

Max

The max function operates strangely. There are at least 3 forms of max.

[C,I] = max(A)
C = max(A,B)
[C,I] = max(A,[],dim)

For max(v), if v is a vector, returns the largest element of v. If A is an m x n matrix, max(A) returns a row vector of length n holding the largest element from each column of A. You can also get the indices of the largest values in the I return value.

To get the row maximums, use the third form, with an empty vector as the second parameter. Oddly, setting dim=1 gives you the max of the columns, while dim=2 gives the row maximums.

Navigation and Reading data

Perform file operations with Unix shell type commands: pwd, ls, cd. Import and export data, like this:

>> data = csvread('ex1data1.txt');
>> load binary_file.dat

Printing output

The disp function is Octave's word for 'print'.

disp(sprintf('pi to 5 decimal places: %0.5f', pi))

Histogram

Plot a histogram for some normally distributed random numbers

>> w = -6 + sqrt(10)*(randn(1,10000))  % (mean = 1, var = 2)
>> hist(w,40)

Plotting

Plotting

t = [0:0.01:0.99];
y1 = sin(2*pi*4*t); 
plot(t,y1);
y2 = cos(2*pi*2*t);
hold on;         % "hold off" to turn off
plot(t,y2,'r');
xlabel('time');
ylabel('value');
legend('sin','cos');
title('my plot');
print -dpng 'myPlot.png'
close;           % or,  "close all" to close all figs

Multiple plots in a grid.

figure(2), clf;  % select figure 2 and clear it
subplot(1,2,1);  % Divide plot into 1x2 grid, access 1st element
plot(t,y1);
subplot(1,2,2);  % Divide plot into 1x2 grid, access 2nd element
plot(t,y2);
axis([0.5 1 -1 1]);  % change axis scale

heatmap

figure;
imagesc(magic(15)), colorbar

These crib notes are based on the Octave tutorial from the ml class by Andrew Ng. Also check out the nice and quick Introduction to GNU Octave. I'm also collecting a few notes on matrix arithmetic.

Defining a function

function ret = test(a)
  ret = a + 1;
end

More

Tuesday, October 18, 2011

Python cheat sheet

The most important docs at python.org are the tutorial and library reference.

Pointers to the docs

Important modules: sys, os, os.path, re, math, io

Inspecting objects

>>> help(obj)
>>> dir(obj)

List comprehensions and generators

numbers = [2, 3, 4, 5, 6, 7, 8, 9]
[x * x for x in numbers if x % 2 == 0]

Generators might be thought of as lazy list comprehensions.

def generate_squares():
  i = 1
  while True:
    yield i*i
    i += 1

Main method

def main():
    # do something here

if __name__ == "__main__":
    main()

See Guido's advice on main methods. To parse command line arguments use argparse instead o the older optparse or getopt.

Classes

The tutorial covers classes, but know that there are old-style classes and new-style classes.

class Foo(object):
  'Doc string for class'

  def __init__(self, a, b):
    'Doc string for constructor'
    self.a = a
    self.b = b
  
  def square_plus_a(self, x):
    'Doc string for a useless method'
    return x * x + a
    
  def __str__(self):
    return "Foo: a=%d, b=%d" % (self.a, self.b)

Preoccupation with classes is a bit passé these days. Javascript objects are just bags of properties to which you can add arbitrary properties whenever you feel like it. In Ruby, you might use OpenStruct. It's quite easy in Python. You just have to define your own class. I'll follow the convention I've seen elsewhere of creating an empty class called Object derived from the base object. Why you can't set attributes on an object instance is something I'll leave to the Python gurus.

class Object(object):
    pass

obj = MyEmptyClass()
obj.foo = 123
obj.bar = "A super secret message
dir(obj)
['__doc__', '__module__', 'bar', 'foo']

You can add methods, too, but they act a little funny. Self doesn't seem to work.

Files

Reading text files line by line can be done like so:

with open(filename, 'r') as f:
    for line in f:
        dosomething(line)

Be careful not to mix iteration over lines in a file with readline().

Exceptions

try:
  raise Exception('My spammy exception!', 1234, 'zot')
except Exception as e:
  print type(e)
  print e
finally:
  print "cleanup in finally clause!"

Traceback prints stack traces.

Logging

import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG,
                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

Conditional Expressions

Finally added in Python 2.5:

x = true_value if condition else false_value

Packages, Libraries, Modules

What do they call them in Python? Here's a quick tip for finding out where installed packages are:

python -c 'import sys, pprint; pprint.pprint(sys.path)'

To find out what packages are installed, open a python shell and type:

help('modules')

Magic methods

Python's magic methods are the source of much confusion. Rafe Kettler's Guide to Python's Magic Methods sorts things out beautifully.

Saturday, October 08, 2011

Stanford Machine Learning class

Stanford is offering a free online version of it's Machine Learning class taught by Andrew Ng. Study groups are popping up everywhere. Cool!

The class officially starts Monday, October 10th, but the first few lectures are up already, broken into bite size pieces of 10 minutes or so. What I've seen so far is at a basic level, covering a course introduction and terminology. Ng then posses a linear regression problem.

We want to find a line y = ϴ0 + ϴ1 x such that we minimize the squared error between our line and our data points.

The solution is our first learning algorithm, gradient descent.

More about the Machine Learning class

The real class at Stanford is: CS229. Exercises are to be done in Octave. Recommended reading includes the usual suspects:

  • Pattern Recognition and Machine Learning, Christopher Bishop
  • Machine Learning, Tom Mitchell
  • The Elements of Statistical Learning, Hastie, Tibshirani and Friedman

Several of the Primers in Computational Biology series would probably make for good supplementary material.

There are threads related to the class on Quora and Reddit, for whatever that's worth. Also, see some good resources for learning about machine learning.

Monday, September 26, 2011

Hipster programming languages

If you look at the programming languages that are popular these days, a few patterns emerge. I'm not talking about languages that have the most hits on the job sites. I'm talking about what the cool kids are coding in - the folks that hang out on hacker-news or at Strange Loop. Languages like Clojure, Scala and CoffeeScript. What do these diverse languages have in common other than an aura of geek-chic?

  • Functional programming is emphasized over object-oriented programming.
  • Common patterns for manipulating lists: map, filter, reduce.
  • Modern syntax in which everything is an expression and syntactic noise like semicolons is reduced.
  • CoffeeScript compiles to JavaScript, while both Clojure and Scala target the JVM. Targeting legacy platforms seems to be getting easier and more popular.
  • Innovative approaches to concurrency.

This last point deserves some elaboration. It might be a stretch to compare the event-driven nature of node.js with immutable data structures with actor model and software transactional memory, but, at heart, these are all strategies for dealing with concurrency. One place where Java was ahead of its peers was concurrency, so it's cool that Clojure and Scala are taking the next steps in concurrent programming on the JVM.

CoffeeScript

CoffeeScript is javascript, redesigned. It cleans up the syntax adding many of the niceties familiar from Ruby and Python. Curly braces and semicolons are out. String interpolation, list-comprehensions, default arguments, and more tasty sugar are in. Its creator, Jeremy Ashkenas, believes in code as literature and it shows all the way through the project. Take a look at the annotated source code for the CoffeeScript grammar and see if it doesn't make you weep for the ugliness of your own code.

Don't forget, CoffeeScript gets about 10 times hipper when combined with node.js, the event-driven app-server based on google's V8 javascript engine.

Clojure

Clojure is a dialect of lisp targeting the JVM. One of Clojure's key features is a set of immutable functional data structures, with efficiency comparable to Java's mutable collection classes. The beauty of a Lisp dialect is that the syntax of the language is also it's data representation. Code is data, data is code. There are lots of Clojure resources floating around, including a famous talk by Rich Hickey on state, identity, value and time and a project to port SICP to Clojure. Extra points if your client-side code is written in ClojureScript.

Several hip people have recommended reading The Joy of Clojure. Also on Rich Hickey's bookshelf is Chris Okazaki's book Purely Functional Data Structures.

Scala

Scala is a strongly typed object-functional hybrid. It's targets include the JVM and Microsoft's CLR. It's an academic language derived from the ML family, but meant to be a pragmatic replacement for Java. It has a C++ like reputation for being fully understood only by guru level developers. One of it's key features is a type system that is Turing complete in itself. I guess I'm not completely convinced that a rocket-science type system is the answer, but it's there's cool stuff in there - generics done properly, higher-kinded types, which as near as I can tell takes parametric types to a level of meta beyond generics. One nice thing is that Scala has a tighter mapping to Java than Clojure so the interop between the two is a little more reasonable.

The Akka project is a Scala platform for concurrent applications, providing both the actor model and software transactional memory. Those wanting to learn more can track down some interesting talks by language designer Martin Odersky available, plus the Scala Types podcast.

A couple more

Not enough languages for you? I'll throw in a couple more hip languages, R and Haskell. Truly cool kids know Haskell. What can I say, except that I am not yet that cool. I need to go out to a shed in the woods with a couple of books and learn me some Haskell.

R may have a bastardized syntax, but, eventually, it's functional core shines through. R is seeing a surge in popularity based on the highly hip and trendy field of data science, where it's powerful statistical methods and graphing come in handy. Aside from mclapply, R is a bit lacking in support for concurrency. [See correction below in comments!] Rhipe and Revolution Analytics's RHadoop are trying to change that by enabling distributed data analysis with R and Hadoop.

Fashion

You might be tempted to say it's all fashion. What goes around comes around. To some extent that's true, but, in each of these languages, there's something new and worthwhile to be learned. We have a ways to go before code is as expressive as we want it to be. Someone smart said that you'll like a programming language in proportion to what it teaches you. Mostly, I want to remind myself to set aside some time to play with these languages and see what new tricks they have to teach this old dog.

PS: When this post grows up, it wants to be Programming language development: the past 5 years by the very hip Michael Fogus.

Friday, September 23, 2011

Network Science

Network analysis is hip. Applications range over social networks, security, biology, and economics. At this point, you'll hardly be the first one to the party, but if you want to give network science a try, here's a random grab-bag of resources to get started.

Coordination of frontline defense mechanisms under severe oxidative stress, Kaur et al. 2010

Learning network science

Jon Kleinberg, a professor of computer science at Cornell University, co-wrote Networks, Crowds, and Markets: Reasoning About a Highly Connected World along with David Easley. He also wrote Algorithm Design, an undergraduate textbook.

A 2004 review paper by Barabasi and Oltvai Network biology: understanding the cell's functional organization. covers a broad range of applications of networks in modern biology. Barabasi is also author of Linked.

A Science special issue on networks, from July 2009, revisits the foundations of network analysis, and delves into applications to ecological interactions, counter-terrorism, and finance.

Video and slides are available for Drew Conway's presentation on social network analysis in R, which mostly focuses on software tools.

Tools for analyzing networks

Software tools for working with networks include the R packages graph, igraph, network. Also, the NetworkX library for Python looks quite powerful. Visualization tools tend to come and go, but some well-known tools are: Cytoscape, Gephi, and GraphViz.

More network stuff

Monday, September 19, 2011

Applying control theory to the cell

In a talk about research goals in the systems biology of microbes, Adam Arkin referenced the Internal Model Principle of control theory. Here are a couple definitions.

A regulator for which both internal stability and output regulation are structurally stable properties must utilize feedback of the regulated variable and incorporate in the feedback loop a suitably reduplicated model of the dynamic structure of the exogenous signals which the regulator is required to process.

Towards an Abstract Internal Model Principle Wonham, 1976

That's a mouthful. This one's a little less scary.

Internal Model Principle: control can be achieved only if the control system encapsulates, either implicitly or explicitly, some representation of the process to be controlled.

Lecture notes on Introduction to Robust Control by Ming T. Tham, 2002

Driving this thinking is the discovery that microbes show anticipatory behavior and the associations can be fairly readily entrained and lost in a few generations. Ilias Tagkopoulos and Saeed Tavazoie, in Predictive behavior within microbial genetic networks, demonstrated associative learning through rewiring gene regulatory networks. It turns out that when E. coli senses a shift to mammalian body temperature, it begins the transition to anaerobic metabolism, nicely anticipating the correlated structure of it's environment.

In another example, Amir Mitchell working at Weizmann, showed that yeast anticipates the stages of fermentation in Adaptive prediction of environmental changes by microorganism.

This raises some important questions. How is the internal model encoded within the cell? And how does the cell acquire, parameterize and adjust its internal model over evolutionary time scales? The answers will lead to a deeper understanding of living systems and might even feed new techniques and principles back to control theory.

An interesting challenge will be to experimentally read out the information embedded in the cell's control systems and then the informatics problem of how to represent and work with such things.

Understanding how this works is a prerequisite for re-engineering living systems, otherwise known as synthetic biology, championed by George Church and Drew Endy. This month, by the way, the journal Science has a special issue on synthetic biology.

I'm fascinated by the idea of applying engineering principles to biology - evolved systems, rather than engineered artifacts. Maybe that's because my spaghetti code looks a lot like the messy interconnectedness of biology. Creating software feels organic, rather than wholly predesigned. The engineering of complex software systems tends to be an adaptive evolutionary process. As messy as biology is, modularity naturally emerges. Maybe biology has something to teach us about organizing this chaos.

Thursday, August 25, 2011

String functions in R

Here's a quick cheat-sheet on string manipulation functions in R, mostly cribbed from Quick-R's list of String Functions with a few additional links.

  • substr(x, start=n1, stop=n2)
  • grep(pattern,x, value=FALSE, ignore.case=FALSE, fixed=FALSE)
  • gsub(pattern, replacement, x, ignore.case=FALSE, fixed=FALSE)
  • gregexpr(pattern, text, ignore.case=FALSE, perl=FALSE, fixed=FALSE)
  • strsplit(x, split)
  • paste(..., sep="", collapse=NULL)
  • sprintf(fmt, ...)
  • toupper/tolower(x)
  • nchar(x)

Also see Regular Expressions as used in R and R String processing.

Note: Just to be clear, R is far from an ideal platform for processing text. For anything where that's the major concern, you're better off going to Python or Ruby.

Monday, August 15, 2011

MySQL and R

Using MySQL with R is pretty easy, with RMySQL. Here are a few notes to keep me straight on a few things I always get snagged on.

Typically, most folks are going to want to analyze data that's already in a MySQL database. Being a little bass-ackwards, I often want to go the other way. One reason to do this is to do some analysis in R and make the results available dynamically in a web app, which necessitates writing data from R into a database. As of this writing, INSERT isn't even mentioned in the RMySQL docs, sadly for me, but it works just fine.

The docs are a bit clearer for RS-DBI, which is the standard R interface to relational databases and of which RMySQL is one implementation.

Opening and closing connections

The best way to close DB connections, like you would do in a finally clause in Java, is to use on.exit, like this:

con <- dbConnect(MySQL(),
         user="me", password="nuts2u",
         dbname="my_db", host="localhost")
on.exit(dbDisconnect(con))

Building queries

Using sprintf to build the queries feels a little primitive. As far as I can tell, there's no prepared statements in RMySQL. I don't suppose SQL-injection is a concern here, but prepared statements might be a little tidier, anyway.

Processing query results

You can process query results row by row, in blocks or all at once. The highly useful function dbGetQuery(con, sql) returns all query results as a data frame. With dbSendQuery, you can get all or partial results with fetch.

con <- dbConnect(MySQL(), user="network_portal", password="monkey2us", dbname=db.name, host="localhost")
rs <- dbSendQuery(con, "select name from genes limit 10;")
data <- fetch(rs, n=10)
huh <- dbHasCompleted(rs)
dbClearResult(rs)
dbDisconnect(con)

If there's no more results, fetch returns a data frame with 0 columns and 0 rows. dbHasCompleted is supposed to indicate whether there are more records to be fetched, but seems broken. The value of huh in the code above is false, which seems wrong to me.

Retrieving AUTO_INCREMENT IDs

A standard newbie question with MySQL is how to retrieve freshly generated primary keys from AUTO_INCREMENT fields. That's what MySQL's LAST_INSERT_ID() is for.

You can retrieve the most recent AUTO_INCREMENT value with the LAST_INSERT_ID() SQL function or the mysql_insert_id() C API function. These functions are connection-specific, so their return values are not affected by another connection which is also performing inserts.

The same works with RMySQL, but there are some traps to watch out for. Let's say you're inserting a row into a table of networks. Don't worry about the specifics. You want to insert related data in another table, so you need the ID of the newly inserted row.

create.network <- function(species.id, network.name, data.source, description) {
  
  con <- dbConnect(MySQL(),
           user="super_schmuck", password="nuts2u",
           dbname="my_db", host="localhost")
  on.exit(dbDisconnect(con))

  sql <- sprintf("insert into networks
                  (species_id, name, data_source, description, created_at)
                  values (%d, '%s', '%s', '%s', NOW());",
                 species.id, network.name, data.source, description)
  rs <- dbSendQuery(con, sql)
  dbClearResult(rs)

  id <- dbGetQuery(con, "select last_insert_id();")[1,1]

  return(id)
}

Don't forget to clear the result of the insert. If you do, you'll get 0 from the last_insert_id(). Also, using dbGetQuery for the insert produces an strange error when you go to call last_insert_id:

Error in mysqlExecStatement(conn, statement, ...) : 
  RS-DBI driver: (could not run statement: Commands out of sync; you can't run this command now)

Alternatively, you can also combine both SQL statements into one call to dbSendQuery, but, you have to remember to set a flag when you make the connection: client.flag=CLIENT_MULTI_STATEMENTS. Trying to use multiple queries seems not to work with dbGetQuery.

create.network <- function(species.id, network.name, data.source, description) {

  con <- dbConnect(MySQL(),
           user="super_schmuck", password="nuts2u",
           dbname="my_db", host="localhost",
           client.flag=CLIENT_MULTI_STATEMENTS)
  on.exit(dbDisconnect(con))

  sql <- sprintf("insert into networks
                  (species_id, name, data_source, description, created_at)
                  values (%d, '%s', '%s', '%s', NOW());
                  select last_insert_id();",
                 species.id, network.name, data.source, description)

  rs <- dbSendQuery(con, sql)

  if (dbMoreResults(con)) {
    rs <- dbNextResult(con)
    id <- fetch(rs)[1,1]
  } else {
    stop('Error getting last inserted id.')
  }

  dbClearResult(rs)

  return(id)
}

Any effort saved by combining the SQL queries is lost in the extra house-keeping so I prefer the first method.

In spite of these few quirks, RMySQL generally works fine and is pretty straightforward.

Wednesday, August 03, 2011

Microarrays

Microarrays are one of the workhorses of modern biology. Measuring transcript levels enables studies of differential expression - asking what the difference is, at the gene expression level, for example, between cancer tumor cells and normal cells.

Bruz Marzolf, who up 'til recently ran my local microarray facility, spoke recently, tracing the journey of microarrays through the full technology life-cycle, starting in 1995 with the publication of Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray in Science. Bruz put microarrays in the category of a utility technology, but not quite to the point of commoditization as there remain major differences between manufacturers.

  • Affymetrix, first to commercialize microarray technologies, is the 800 pound gorilla. Their photolithography process borrows from computer chip manufacturing and their standardized probe sets are well supported by tools such as Bioconductor. The technology is robust but producing the masks is quite expensive, thus custom arrays are not economical.
  • Agilent, which spun out of HP, uses ink-jet technology. Custom arrays can be designed using Agilent's eArray software. Agilent arrays come in a variety of resolutions including 8x60k, 1x244k and 1x1m with 60mer probes.
  • Illumina builds arrays out of beads coated in oligo probes. Beads are laid out randomly on the slides, necessitating a layout discovery step. These chips have extra redundancy to account for randomness in bead-probe count.
  • Nimblegen's maskless photolithography process is more flexible for custom arrays. Nimblegen provides arrays in 385K, 4x72K, and 12x135K resolutions using 60mer probes. They emphasize high array-to-array data reproducibility.

As an aside, our group uses custom spotted arrays and Agilent arrays. We tried Nimblegen and found that inter-array consistency was excellent, but inter-probe consistency was not. Below we see the ribosomal RNAs and adjacent genes with total RNA measured by a custom Agilent array (in blue) plotted next to a custom Nimblegen array (in green). To be fair there might be other explanations for what we saw, but it certainly looks like there is significant variability between probes that we would expect to have identical readings.

In RNA-Seq: a revolutionary tool for transcriptomics (Nature Reviews Genetics, 2009), Zhong Wang, Mark Gerstein & Michael Snyder show this comparison between microarrays and RNA-Seq.

While RNA-seq, no doubt, has a higher dynamic range, does it really have less noise? Some folks say so. With tens or hundreds of thousands of probes, fairly dense coverage of whole microbial genomes is possible. If you know what you're looking for, microarrays are still cheaper. Discovery oriented work is going increasingly toward sequencing.

Links

Friday, July 08, 2011

Notes on Engineering Data Analysis (with R and ggplot2)

Hadley Wickham gave a Google Tech Talk a couple weeks back titled Engineering Data Analysis (with R and ggplot2). These are my notes.

The data analysis cycle is to iteratively transform, visualize and model. Leading into the cycle is data access and the output of the process is knowledge, insight and understanding which can be communicated to others. Transforming the data is almost always necessary to bring data into a workable form. Visualization and modeling have something of a duality where visualization is good at revealing the unexpected but has problems scaling. Models scale better, but will only find expected relationships. A larger cycle comes about when answers to one question lead to more questions.

Hadley makes a case for data analysis in code, rather than GUIs and for R in particular. Working in a programming language gives you a means of:

  • reproducibility
  • automation
  • version control
  • communication

Advantages of R:

  • open source
  • runs anywhere
  • well established community
  • huge library of packages
  • connectivity to other languages

Downsides of R are it's learning curve, strangeness relative to other programming languages, lack of programming infrastructure and prickliness of the community. R scales well up to about a million observations. How to scale the interactive analysis cycle up to billions of observations is an open question. Programming infrastructure is an area where programmers can contribute.

DSLs help express and think clearly about common problems in data analysis. Hadley views his libraries as DSLs (domain specific languages) within R for the phases of the analysis cycle. For visualization, there's ggplot2. DSLs align nicely with ggplot's philosophy as a grammar of graphics. R's model formula is the DSL for modeling. Plyr is the DSL for data transformation.

The four key verbs of data transformation are:

  • subset
  • mutate
  • arrange
  • summarize

...plus...

  • group by
  • join

Data can be divided by subsetting or filtering; mutated, for example adding new columns to a table that are functions of other columns; rearranged or sorted; and summarized, condensing a data set down to a smaller number of values. These actions can be combined with a group by operator. Finally, data sets can be joined to other related data sets.

The second half of a talk is a case study, dissecting a set of cause-of-death statistics from the Mexican government. Finally, Hadley makes a familiar sounding point about the tension between making new things and making well-engineered user-friedly software that does old things.