Monday, June 05, 2017

Postgres lockup on ALTER TABLE

You can learn a lot when you break things.

I had a bit of fun and panic with Postgres the other day, while trying to do the equivalent of this innocuous seeming drop column query:

ALTER TABLE aschema.sometable DROP COLUMN IF EXISTS unwanted_column;

So innocuous was this query that the column didn't even exist in production. It had already been dropped. I was checking in an Alembic migration to keep staging and dev DBs in line with prod. Harmless, right?

The deploy went live and things started going from green to red. Uh oh.

Well, it turns out that even a non-consequential ALTER TABLE operation takes a table lock. ...which should be totally fine, unless some long-running process has taken out an AccessShareLock and is sitting on it. There's a nice write-up on this exact situation here:

The [ALTER TABLE operation] tries to take the ACCESS EXCLUSIVE it needs, and it queues up behind the first lock. Now all future lock requests queue up behind the waiting ACCESS EXCLUSIVE request. Conceptually, incoming lock requests which are compatible with the already-granted lock could jump over the waiting ACCESS EXCLUSIVE and be granted out of turn, but that is not how PostgreSQL does it. other words, screeching halt - database frozen. Crap!

Postgres committer Josh Berkus wrote a pair of articles on the topic (ALTER TABLE and downtime, part 1 and ALTER TABLE and downtime, part 2). Great, now what?

Luckily, Postgres provides means to find and kill errant tasks. So, let's find and cancel the blocked ALTER TABLE query and/or the guilty part holding the shared lock.

In case you're ever in the same situation, here are a few key commands:

select * from pg_locks where granted and relation = 'my_table'::regclass \x\g\x
select * from pg_stat_activity where pid=1234567 \x\g\x
select pg_cancel_backend(pid);

...and finally, thanks to our friend pg_cancel_backend, our DB is unfrozen and we can go track down who had that lock held for so long.

Tuesday, May 24, 2016

Docker cheat sheet

If you're using Docker, here's a nice Docker cheat sheet. I've collected a few choice bits of Docker magic here.

Docker comes with a point-n-click way to start a shell with docker hooks attached. Here's an easier way:

eval "$(docker-machine env default)"


Docker terminology has spawned some confusion. For instance: images vs. containers and registry vs. repository. Luckily, there's help, for example this stack-overflow post by a brilliant, but under-appreciated, hacker on the difference between images and containers.

  • Registry - a service that stores image repositories
  • Repository - a set of Docker images, usually versions of the same application
  • Image - an immutable snapshot of a running container. An image consists of layers of file system changes stacked up on top of a base image.
  • Container - a runtime instance of an image

Working with Docker

We don't need to ssh into the container. Maybe you could call this "shelling" into a container?

docker run --rm -it ubuntu:latest bash

... more here later. In the meantime, see wsargent's cheat sheet ...

The Dockerfile is a version-controllable artifact that automates the creation of a customized image from a base image. There's plenty of good advice in the guide to Best practices for writing Dockerfiles.

docker build -t myimagename .


You should almost always run with the --rm flag to avoid leaving shrapnel around. But, that's easy to forget. If there are lots of old containers hanging around, this bit of magic will help:

docker ps -a | grep 'weeks ago' | awk '{print $1}' | xargs --no-run-if-empty docker rm

Images can pile up, too. In What are Docker : images?, Shishir Mahajan show how to clean up dangling images:

docker rmi $(docker images -f "dangling=true" -q)

Tuesday, May 03, 2016

Topic Modeling with LDA

Rob McDaniel gave a nice presentation on the flaming-hot topic of topic analysis yesterday evening hosted by Seattle metastartup Pitchbook. Grab slides and code from the github repo.

Rob is interested in using NLP to discern the level of objectivity or bias in text. As an example, he took the transcripts of the debates of this year's presidential campaign. Here's part of what he did with them:

For more, have a look at the post on Semantic analysis of GOP debates.

Interesting tidbits:

  • Wikipedia is a source of documents labeled as not objective.
  • Movie reviews are a source of documents labeled by rating, number of stars.
  • Topic cohesion measures how well a given document stays "on-topic" or even "on-message".
  • KL Divergence is entropy based measure of relatedness of topics.

There was an interesting side discussion of the orthogonality of topic modeling and word embedding (word2vec).

Some of the sources Rob mentioned were Tethne and one of it's tutorials, also a pair of papers Introduction to Probabilistic Topic Models and Probabilistic Topic Models both by David Blei.

Sunday, May 01, 2016

Future looks bright

Way back in December of 2008, Python 3.0 was released. Seven years later, Python 3 is finally gaining traction.

Why the wait? Broken backward compatibility? Lagging library support? Print's annoying new parentheses? Well, coders are a cranky lot, often not fond of change. Some even thought Python 3 would be the end of Python, suffering a fate similar to Perl 6. There was a bit of controversy:

The story holds some lessons.

Initially, the Python core developers seem to have imagined that people would take the plunge sooner or later, migrating to Python 3 and never looking back. But, there was something of a chicken-and-egg problem. Library maintainers were confronted with the unappealing prospect of supporting two code bases for some transition period of unknown duration. Python application developers were reluctant to upgrade until all their dependencies had made the move. Libraries and applications saw the majority of their peers waiting to upgrade. So, nothing happened. Python 3 adoption languished.

It took a while, but the Python community came up with a better solution. Tools like future and six enable a single code base to support both Python 2 and Python 3. Mostly this works by backporting the new Python 3 idioms to Python 2 allowing existing code to adopt new idioms at a measured pace while maintaining backwards compatibility. This strategy makes a ton of sense in a language as dynamic as Python. And with it, the deadlock was broken and migration could get underway. End of Life for Python 2 has been moved out to 2020, so those shims will be in place for a long time.

Contrast the Python 3 story with that of Java where a very high level of backwards compatibility is the standard. At this point, Java folks may be congratulating themselves on making the right call. But, backwards compatibility comes at a cost. There's no denying that those constraints have limited the evolution of the language. Java generics are marred by type erasure. Widely recognized warts in the standard libraries persist - for example, the inconsistency of string.length(), array.length, and collection.size(). Swing is as clunky as ever. Date support in the standard library has only recently been upgraded after years of criticism. Advanced practitioners like those behind Spring resort to esoteric means like byte-code manipulation to extend the capabilities of the language.

At the same time, the JVM became a laboratory for programming language design. Scala is essentially an evolved Java. Clojure brings LISPy metaprogramming to the JVM. These advanced JVM languages are a big influence on modern Java.

So, finally, Python 3 is humming along. The long incubation period has not been without advantages. Python 3.5 is very polished. Most promenant libraries support it. Usage is accelerating. Core developers promise Python 4.0 won't be like Python 3.0.

What lessons will the Python community and that of other technologies take away from the Python 3 experience? That breaking backwards compatibility is a terrible, terrible thing that should never be contemplated again? That's a progress limiting point of view. I hope there are lessons in how to manage substantial change and how to avoid fracturing the community in the process.

Careful attention needs to be paid to a smooth transition. Working code should continue to work while experiments continue to push forward with new and potentially better idioms. Those that offer a real improvement will, in time, take over, but gradually not as a step function. Periods of stability may alternate with accelerated change. End-users should be firmly in control of how cutting-edge they want to be. An essential ingredient is respect for differing rates of change, differing tolerance for risk, etc. There are those who thrive on the new and those who count on the tried and true. But, living breathing technologies need to evolve.

“In the long run every program becomes rococo, then rubble.”

...but not yet! ...can't wait 'til PyCon in a few weeks!

Sunday, November 22, 2015

Rdio was too good to last

When Apple killed Lala, there was a bad guy with motive, opportunity and a smoking gun. In the case of Rdio's recently announced demise, it's different. A postmortem by Casey Newton of the Verge explains Why Rdio died. The team had the design chops, engineering talent and love of music to create a fantastic product but lacked the business and marketing savvy to make it pay off.

Rdio's star feature was certainly design. The site enabled individuals to express themselves and relate to others - for their enthusiasm to feed off one another. You could follow people with intersecting tastes. New listening was suggested algorithmically based on what was "On the rise in your network". You could comment and respond, lovingly curate playlists and follow activity by your musical soul-mates. The social dimension is key to something as personal and tribal as music.

If you wanted to take art pop seriously, do a deep dive into electronic music, exhaustively survey the Zappa catalog, or peruse the archives of 5 years of Sunday Jazz Brunch selections, your people were there. Any platform will play music you already know, Rdio was a place to explore.

In some ways, credit belongs to the users themselves - those who shoehorned rich conversations into a relatively bare-bones comment feature, repurposing shared playlists as the equivalent of discussion forums. In one case, Community Playlist the Trilogy had some 3,522 comments and 116 collaborators.

It had, in a word, community.


Refugees looking for a new musical home can find lots of resources on the Rdio lover's slack channel, including a compilation of tools for exporting playlists and other digital assets. A Python script by Jesse Mullan (playlist_helper), which will soon become the official data exporter, worked nicely for me.

There were several calls for a platform-neutral place for the community to live, independent of which streaming service folks end up migrating to. Some nascent possibilities are The Playlist or Hatchet. fills that roll for me, at least for now. Maybe the Rdio Lover's slack channel will survive beyond the transition period.


Which service comes the closest to Rdio? Users on the slack channel have compiled a helpful guide Rdio features compared to the competitors. Roughly in order of how interesting they look to me, the main contenders are:


In A Eulogy for Rdio in the Atlantic, Robinson Meyer calls Rdio "a better streaming service in most every way". So, why did a great service with an intensely loyal following fail?

The economics of digital music are tricky. None of the streaming services are really making money. Rdio's $1.5M in monthly revenue, corresponding to perhaps 150,000 paying users, and $100-150k in advertising couldn't cover their costs of roughly $4M mainly for 140 employees and royalties. This explains the nasty pile-up of $220 million in debt. Music has been called Too free to be expensive. Too expensive to be free.

Pandora is buying Rdio's intellectual property and taking on some of the talent with the intention of introducing their own on-demand streaming service some time in 2016. Interestingly, customer data was not part of that transaction.

One message I hope no one takes away is that community doesn't matter. It's one of the few ways for streaming services to differentiate themselves. Without it, you feel "solitary, lonely and probed" in the characteristic phrasing of CAW a.k.a. The Aquatic Ape

So long

So, we're left with a reminder that the best doesn't always win. No doubt, us "snobby album purists" will find or coopt another platform on which to indulge our musical obsessions. Keep in touch, music peeps:

“To all of you that have expanded my musical experience for the past six or so years - thank you, thank you thank you.”
“Man, I'm gonna miss my playlists.”
“I spent A LOT of time here.”
“Thank you all for introducing me to some truly great music (and some truly terrible, which I enjoyed nearly as much).”
“from the start it has been my most active social network”
“perhaps the kindest community in online music.”
“People left comments on albums, and, lo and behold, the writing was good and interesting.”
“you all have been invaluable in helping me not just discover new music, but in helping me open my mind to new kinds of music.”
“such a welcoming and amazing crew of fellow travelers”

...parting comments from Rdio users collected by fangoguagua.

Tuesday, August 04, 2015

Hacking Zebrafish thoughts

The last lab from Scalable Machine Learning with Spark features a guest lecture by Jeremy Freeman, a professor of neuroscience at Janelia Farm Research Campus.

His group produced this gorgeous video of a living zebrafish brain. Little fish thoughts sparkle away, made visible by a technique called light-sheet flourescent microscopy in which engineered proteins that light up when the neurons fire are engineered into the fish.

The lab covers principal component analysis in a lively way. Principal components are extracted from time-series data and mapped onto an HSV color wheel and used to color an image of the zebrafish brain. In the process, we use some fun matrix manipulation to aggregate the time series data in two different ways - by time relative to the start of a visual stimulus and by the directionality of the stimulus (shown below).

The whole series of labs from the Spark classes was nicely done, but this was an especially fun way to finish it out.

Check out the Freeman Lab's papers:

Tuesday, July 21, 2015

Machine learning on music data

The 3rd lab from Scalable Machine Learning with Spark has you predict the year a song was published based on features from the Million Song Dataset. How much farther could you take machine analysis of music? Music has so much structure that's so apparent to our ears. Wouldn't it be cool to be able to parse out that structure algorithmically? Turns out, you can.

Apparently The International Society for Music Information Retrieval (ISMIR) is the place to go for this sort of thing. A few papers, based on minutes of rigorous research (aka random googling):

In addition to inferring a song's internal structure, you might want to relate it's acoustic features to styles, moods or time periods (as we did in the lab). For that, you'll want music metadata from sources like:

There's a paper on The Million Song Dataset paper by two researchers at Columbia's EE department and two more at the Echo Nest.

Even Google is interested in the topic: Sharing Learned Latent Representations For Music Audio Classification And Similarity.

Tangentially related, a group out of Cambridge and Stanford say Musical Preferences are Linked to Cognitive Styles. I fear what my musical tastes would reveal about my warped cognitive style.