Tuesday, July 21, 2015

Machine learning on music data

The 3rd lab from Scalable Machine Learning with Spark has you predict the year a song was published based on features from the Million Song Dataset. How much farther could you take machine analysis of music? Music has so much structure that's so apparent to our ears. Wouldn't it be cool to be able to parse out that structure algorithmically? Turns out, you can.

Apparently The International Society for Music Information Retrieval (ISMIR) is the place to go for this sort of thing. A few papers, based on minutes of rigorous research (aka random googling):

In addition to inferring a song's internal structure, you might want to relate it's acoustic features to styles, moods or time periods (as we did in the lab). For that, you'll want music metadata from sources like:

There's a paper on The Million Song Dataset paper by two researchers at Columbia's EE department and two more at the Echo Nest.

Even Google is interested in the topic: Sharing Learned Latent Representations For Music Audio Classification And Similarity.

Tangentially related, a group out of Cambridge and Stanford say Musical Preferences are Linked to Cognitive Styles. I fear what my musical tastes would reveal about my warped cognitive style.

Wednesday, July 08, 2015

Scalable Machine Learning with Spark class on edX

Introduction to Big Data with Apache Spark is an online class hosted on edX that just finished. Its follow-up Scalable Machine Learning with Spark just got started.

If you want to learn Spark - and who doesn't? - sign up.

Spark is a successor to Hadoop that comes out of the AMPLab at Berkeley. It's faster for many operations due to keeping data in memory, and the programming model feels more flexible in comparison to Hadoops rigid framework. The AMPLab provides a suite of related tools including support for machine learning, graphs, SQL and streaming. While Hadoop is most at home with batch processing, Spark is a little better suited to interactive work.

The first class was quick and easy, covering Spark and RDDs through PySpark. No brain stretching on the order of Daphne Koller's Probabilistic Graphical Models to be found here. The lectures stuck to the "applied" aspects, but that's OK. You can always hit the papers to go deeper. The labs were fun and effective at getting you up to speed:

Labs for the first class:

  • Word count, the hello world of map-reduce
  • Analysis of web server log files
  • Entity resolution using a bag-of-words approach
  • Collaborative filtering on a movie ratings database. Apparently, I should watch these: Seven Samurai, Annie Hall, Akira, Stop Making Sense, Chungking Express.

The second installment looks to very cool, delving deeper into mllib the AMPLab's machine learning library for Spark. Its labs cover:

  • Musicology: predict the release year of a song given a set of audio features
  • Prediction of click-through rates
  • Neuroimaging Analysis on brain activity of zebrafish (which I suspect is the phase "Just keep swimming" over and over) done in collaboration with Jeremy Freeman of the Janelia Research Campus.

The labs for both classes are authored as IPython notebooks in the amazingly cool Jupyter framework where prose, graphics and executable code fit combine to make a really nice learning environment.

Echoing my own digital hoarder tendencies, the first course is liberally peppered with links, which I've dutifully culled and categorized for your clicking compulsion:

Big Data Hype


Data Cleaning



The Data Science Process

In case you're still wondering what data scientists actually do, here it is according to...

Jim Gray

  • Capture
  • Curate
  • Communicate

Ben Fry

  • Acquire
  • Parse
  • Filter
  • Mine
  • Represent
  • Refine
  • Interact

Jeff Hammerbacher

  • Identify problem
  • Intrumenting data sources
  • Collect data
  • Prepare data (integrate, transform, clean, filter, aggregate)
  • Build model
  • Evaluate model
  • Communicate results

...and don't forget: Jeffrey Leek and Hadley Wickham.