Friday, January 24, 2014

Online class on Statistical Learning

Trevor Hastie and Robert Tibshirani are teaching an online class on Statistical Learning starting this week.

The first week is introduction and overview, so it's not too late to join up.

They've also published a new book, An Introduction to Statistical Learning, as a more accessible companion to their widely revered The Elements of Statistical Learning. Like it's older sibling, the new book is availabe for free download as a PDF.

The class overlaps a bit with Andrew Ng's Machine Learning class, but I'm looking forward to a different perspective, new material on penalized regression, resampling methods and non-linear fitting and random forest, and more practice.

The Statistical Learning class is taught with examples in R, which is great.

Amir Sadoughi is starting a community driven solution guide to the exercises.

If you prefer Python, some folks at Boston startup DataRobot is planning to follow the class with a series of blog posts that show how "statistical learning techniques presented in the course can be applied using tools from the Python ecosystem: “numpy”, “scipy”, “pandas”, “matplotlib”, “scikit-learn”, and “statsmodels”". Awesome!

Those interested may also like Yaser Abu-Mostafa's MOOC Learning from Data which ran "live" last year but is now available in "take at your own pace" mode. I haven't taken it, but have heard glowing recommendations. Students for that course produced a truely impressive solutions guide with code in R, Python, Octave, Haskell and several other languages.

For those who want and have a budget for the in-person experience Hastie and Tibshirani are teaching a 2 day seminar in Palo Alto on March 20-21.

Thursday, January 09, 2014

Guide to Open Science

Open science is the idea that scientific research should be openly and immediately shared.

…when the journal system was developed in the 17th and 18th centuries it was an excellent example of open science. The journals are perhaps the most open system for the dissemination of knowledge that can be constructed — if you’re working with 17th century technology. But, of course, today we can do a lot better.

Refactoring science to take advantage of digital technology is a what Michael Nielsen, quoted above, calls, “Building a better collective memory.”

The problems with the current system reinforce the case for change. The reproducibility crisis, the prevalence of unreliable findings in published research described by Ioannidis, is giving science a credibility problem. Negative results and replications go unpublished. Peer review is uncompensated and sometimes ineffective. The publication process is slow. Over time, data gets lost. Artificial barriers impede potential synergy among researchers, or between research and industry, students, or patients.

The scientific paper has become something of a “choke point”. Unbundling the functions of a paper might allow more degrees of freedom for progress and innovation.

As science and technology progress, the amount of accumulated knowledge that must be mastered to get to the frontier increases. "If one is to stand on the shoulders of giants, one must first climb up their backs, and the greater the body of knowledge, the harder this climb becomes." This "burden of knowledge" changes the effective organization of innovative activity. As the low hanging fruit is depleted, research becomes more specialized and team oriented.

In this new regime, the strategy that comes into play is leveraging the “scale of the communication” of the Internet. The experience of the Polymath project inspired Gowers and Nielsen to write “mass collaboration will extend the limits of human problem-solving ability.”

Not everything digital is necessarily open, but the interesting developments are concentrated at the intersection of open and digital science - in the interaction between technology and the redesign of centuries old institutions.

Overview

Here are two great places to start:

Open Access

In the primordial days of the web, known to some as 1991, physicist Paul Ginsparg started the arXiv, an on-line preprint library, superseding a mailing list. Today, arXiv offers, "Open access to 901,072 e-prints in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics". Researchers enjoy the convenience and increased exposure enough to negotiate exceptions to journal copyright agreements or to ignore them. The arXiv is currently maintained by the Cornell University Library and supporting member institutions at a cost of $826,000 per year.

The economics of information delivery have changed. But, there's also an ethical argument for making the fruits of publicly funded research public.

In 2000, the National Library of Medicine launched PubMed Central a digital archive of NIH funded biomedical and life sciences journal literature. Open access is particularly strong at the NIH where PLoS Computational Biology founding editor Philip E. Bourne was recently named Associate Director for Data Science. The stories of PubMed Central and PLoS are tightly bound. Harold Varmus, now NCI director and formerly director of the NIH in the late 1990's, co-founded PLoS along with Patrick Brown and Michael Eisen partly out of frustration with the resistance met by Pubmed Central.

There's a struggle going on in the US congress over whether to extend the NIH's open access policies to the rest of federally funded research. Publishers, meanwhile, lobby to block or limit open access mandates.

Open access journals give away content for free, typically recouping costs by charging publication fees and ads. This may incentivize weaker peer review. Science magazine's open access sting, in which a bogus paper was accepted by many open access journals, has some hallmarks of a FUD (fear uncertainty and doubt) campaign, but also shines a light on a real problem. The limitations of peer review, it is argued, might be addressed by a combination of post-publication review and algorithmic filtering, ensuring that quality research rises to the top.

eLife aims for the highest standards. Its founder, Nobel laureate Randy Schekman, wrote recently of the “inappropriate incentives” created by the “luxury” journals.

Open access publishers
  • BioMed Central, the largest open access publisher; founded in 2000; now owned by Springer
  • Elementa Science of the Anthropocene; Publishing original research on the Earth's physical, chemical, and biological systems
  • eLife, a high-impact open access journal formed in a joint initiative of the Wellcome Trust, the Howard Hughes Medical Institute and the Max Planck Society.
  • F1000 Research, a post-publication reviewed journal from Faculty of 1000
  • Frontiers, community-driven journals with it's own peer-review system
  • PeerJ, a peer reviewed journal with a membership model
  • PLOS ,launched October 2003

The DOAJ lists almost 10 thousand open access journals.

Priem and Hemminger propose Decoupling the scholarly journal - unbundling the functions that have been tightly coupled within traditional academic journals since the days of Henry Oldenburg in the 17th century, and freeing up the parts for experimentation.

Traditional functions of a paper
  • Archival: storing scholarship for posterity.
  • Registration: time-stamping discoveries to establish precedence.
  • Dissemination: distribution of scholarly products.
  • Certification: assessing rigor and significance of contributions

Re-engineering the journal makes new things publishable, negative results, for example, which science calls “the neglected stepchild of scientific publishing”. There are further possibilities yet, outside to format of the paper.

Data and Code

Potentially valuable products of scholarship include: papers, data, software, figures, posters, reviews, talks, slides, instructional material.

Data and source code, in particular, are indispensable for replication and should, in theory, be highly re-usable. Data archives like GenBank and GEO exist for this reason and more scientific software is finding its way into GitHub or other public source code repositories.

Data are a classic example of a public good, in that shared data do not diminish in value. To the contrary, shared data can serve as a benchmark that allows others to study and refine methods of analysis, and once collected, they can be creatively repurposed by many hands and in many ways, indefinitely.
Permanent archives for published research data would allow us to write an amendment to the centuries-old social contract governing scientific publishing and give data their due.

Open Data and the Social Contract of Scientific Publishing, BioScience 2010 Todd J. Vision

Longer term, lots of work is still needed in the ongoing project of creating machine readable data standards, open APIs, rich metadata, and semantic integration.

The exact meaning of a “GitHub for science” has many interpretations. There are a handful of platforms looking to implement some version of that idea.

Platforms
  • Arvados spun out of work done in George Church's Lab at Harvard Medical School to support the Personal Genome Project. According to developer Alexander Wait Zaranek, Arvados is "an open-source platform for data-management, analysis and sharing of large biomedical data sets spanning millions of individual humans across numerous organizations and eventually encompassing exabytes of data.".

    Technology: Ruby. SDK libraries in Perl, Python and Ruby. "Keep", a content-addressable distributed file system developed by the Arvados team. Docker.

  • DataDryad is a curated general-purpose repository that makes the data underlying scientific publications discoverable, freely reusable, and citable. Dryad has integrated data submission for a growing list of journals; submission of data from other publications is also welcome.

    Leadership: Todd J. Vision at the Department of Biology at UNC and a board with deep publishing background.

    Technology: Java. Dryad is built on the open source DSpace software for building open digital repositories.

    Funding: NSF and data publishing charges.

  • Figshare is an open science platform for publishing figures, data sets and other research outputs such that they are discoverable, sharable and citable.

    Technology: source not available.

    Support: FigShare is now supported by Digital Science, employer of FigShare founder Mark Hahnel, and a division of Macmillan Publishers which also owns the Nature Pulishing Group.

  • The Open Science Framework is part network of research materials, part version control system, and part collaboration software. The purpose of the software is to support scientific workflow and help increase the alignment between scientific values and scientific practices.” The OSF is the flagship product of the Center for Open Science founded by Brian Nosek and Jeffrey Spies, who presented OSF at SciPy2013.

    Technology: Python; Git on the back end. Hosted on Linode; Source on GitHub promised soon.

    Funding: The Laura and John Arnold Foundation, Alfred P. Sloan Foundation, the Templeton Foundation, and an anonymous donor

  • RunMyCode enables scientists to create companion sites for papers holding code or data. Aimed at reproducibility and compatibility with journals.

    Leadership: Victoria Stodden, Christophe Hurlin, and Christophe Perignon are co-authors on the paper.

    Funding: Sloan, French universities and research agencies.

  • Synapse by Sage Bionetworks is a platform for sharing data, code, and narrative description linked together with provenance. Synapse has been used in conjunction with the DREAM challenges in computational biology and the in The Cancer Genome Atlas (TCGA) Pan-Cancer project.

    Technology: Java back end, javascript web front end, Python client and R client for programmatic access.

    Leadership: Developed by Sage Bionetworks, led by Stephen Friend, the sage team has expertise in drug development, machine learning, oncology, open access and data governance. Engineering group led by Michael Kellen.

    Funding: Sloan, Washington State Life Sciences Discovery Fund, NCI, NIH

By offering collaborative features or the ability to execute code, these platforms aspire to be more than just data repositories, of which there are reckoned to be hundreds by the Registry of Research Data Repositories (re3data.org).

Journals for data

Burying data in supplementary material is a better archival strategy than keeping data on a post-doc's laptop archival strategy, backed up by USB key in post-doc's shirt pocket. A couple big journals are experimenting with publication of data sets in a more prominent way.

Nature Publishing Group's Scientific Data an open-access, online-only publication for descriptions of datasets to launch in May 2014. Data stored in separate repositories. Descriptions will have a narrative part and a structured part, the Data Descriptor metadata, consisting of ISA-Tab formatted information about samples, methods and data processing steps.

GigaScience, a new on-line publication from from BioMed Central, links manuscripts with associated data sets archived in GigaDB, a database hosted by BGI. (src: github.com/gigascience)

The GigaDB paper, starts with a nice quote from Tim Berners-Lee: "Data is a precious thing and will last longer than the systems themselves”. It describes several use cases: an epigenomics pipeline, the mouse methylome dataset, 15 Tb of hepatocellular carcinoma tumor/normal sequence data. GigaDB aims to "accept any large-scale data including proteomic, environmental, and imaging data" and states the goal of "working with authors to make the computational tools and data processing pipelines described in their papers available and, where possible, executable."

Computing infrastructure

If the functions of the journal along with new functions are going to be performed by a loosely connected set of services on the web, a lot of cyberinfrastructure will be needed to connect it all together. One example is ORCID, a service that assigns globally unique IDs for researchers. Similarly, DataCite issues DOIs (Digital Object Identifiers) for data sets providing a convenient and unambiguous way of citing data.

Another bit of programmatic glue, rOpenSci is a set of R packages that wrap APIs to a variety of scientific data repositories and journals. These packages interface between R and several of the tools mentioned here: Dryad, Figshare, Mendeley, NCBI, DataCite, altmetric.com, PLoS and Pubmed, etc.

Consent

Extracting maximal scientific value from data isn't entirely a technical problem. There are sticky legal and ethical issues around the privacy and consent of research subjects, licensing, publication embargoes, the interface between academic and commercial.

Altmetrics

I was once told by a successful P.I. in reference to a scientific database project, "I don't want to be a librarian." Data curation, though undoubtedly valuable, is labor intensive and doesn't always generate much career momentum for a scientist.

Incentives matter. If publishing is the currency of the realm, how does a scientist get credit for curating data or supporting well engineered software? Altmetrics is a catch-phrase for a set of alternate ways of measuring and incentivizing academic progress, including creation of artifacts such as data sets and software.

  • ImpactStory is a project by Jason Priem and Heather Piwowar to compile and display a broad range of scientific contributions including everything from papers to source code to twitter feeds.

    Here are a handful of ImpactStory profiles:

    Technology: Python and javascript source on GitHub

    Funding: Sloan foundation, NSF

Valuing diverse research products

In Altmetrics: Value all research products (Nature 2013), Heather Piwowar notes that funding agencies are increasingly conscious of the value of diverse research products and claims, “Altmetrics give a fuller picture of how research products have influenced conversation, thought and behaviour.”

Altmetrics.com, also under the Digital Science umbrella, compiles web analytics for papers. Plum Analytics rates individual researchers and whole departments. Microsoft's Academic Search ranks the top authors, publications, conferences, and journals in a given field. Google Scholar keeps profiles for individual researchers.

Given plenty of performance data at the individual level in a highly competitive field, the result will be a Moneyball of science, in which scientists rack up statistics like baseball players.

But, if alternative metrics are to have real value, it will be in re-aligning incentives in scholarly work towards the desired outputs: not just solid reproducible results, but all sorts of specialized supporting contributions that are often undervalued today.

Social Science

There are two contenders for the Facebook or Linked-In of science, although whether such a thing is needed remains to be seen. ResearchGate seems to have made the pleasing insight that the co-authorship graph is a ready-made social network.

Mendeley is a reference manager and academic social network which surpassed 2 million users before being bought by Elsevier in April of 2013 for something like $69M-$100M. With the buyout, some believe an opportunity to create an iTunes for scientific papers was lost.

The link to openness is this: Whatever value comes from these sites (also, to some extent, from open lab notebooks) would be from the ability to build collaboration across organizational boundaries.

Collaborative tools

The journal Push whose topic is described mysteriously as Research & Applied Theory In Writing With Source is an experiment in doing something really different. The entire journal, the source code for its web site, every issue and all articles live in a single GitHub repository and adopt the workflow that implies. Sadly, Push is empty, so far.

SciGit, a collaborative tool for writing scientific papers, is another experiment in layering functionality on top of version control. Authorea is a collaborative scientific typesetting system, also with git on the back end.

Integrating code and prose is great way to communicate and replicate computational methods. In the Python community IPython notebooks fills this niche, while in R there's Knitr and Sweave. These executable documents can be version controlled, collaboratively edited and automated.

If papers are transforming into web-native formats, peer review should follow suit.

  • Hypothes.is will be an open platform for the collaborative evaluation of knowledge. Emphasizes reputation and evaluation using an annotation tool for web pages.

    Team: heavy on start-up experience and light on scientists.

    Technology: Python, javascript, coffeescript. Source on GitHub.

    Funding: Sloan, Shuttleworth and Mellon Foundations.

  • Publons is a platform for open quantifiable peer review, either pre- or post-publication, inspired by Stack Overflow. The founders describe their ambitions for the project in The Future of Academic Research.

    Technology: Postgres, Python/Django, Boostrap, and D3.js. No open repos.

    Team: Based in Wellington, New Zealand. Emerged from startup accelerator Lightning Lab.

Organizations

A number of organizations advocate and develop tools for open science.

Parting thoughts

That's my attempt to get a handle on open science. Of course, it's incomplete and lopsided. In wrapping up, here are a few stray thoughts:

An open system interfaces and interacts with other systems. Network effects kick in as more capabilities plug in. A virtual lab, made up of people distributed across the globe could crowdsource funding (Microryza, Consano), farm out experiments (Science Exchange, Assay Depot), and develop an open challenge (DREAM, nature open innovation pavilion) to analyze the data.

In the future, the humans at the center of that virtual lab may be replaced by an algorithms. “...robotic scientists crawling the web of literature extracting, combining and computing on existing research, deducing new knowledge, finding gaps and inconsistencies, and proposing experiments to resolve them.

Many open science pioneers cite the open source software movement as an inspiration and as a foundation on which to build upon. Open science needs open source tools. Reproducibility practically requires it. The success of open source demonstrates that the open model can work.

In order to succeed, innovation needs to be compatible with and complimentary to existing infrastructure. As tempting as it is to start tossing Molotov cocktails, the better strategy is to quietly pry off functionality and perform it competently, offering improvements that are hard for closed incumbents to match.

Open science is fundamentally about increasing trust, within science but also the public and policy-makers' trust in science. It may offer a solution for reproducibility problems, and maybe a better realization of a science that aspires to truth in a way that is dependable and verifiable.

Thanks to Brian Bot, my colleague at Sage Bionetworks, who introduced me to many of the open science projects linked here and for reading a draft of this post and giving insightful feedback.

Monday, January 06, 2014

Transforming Code into Beautiful, Idiomatic Python

Python core developer Raymond Hettinger shows how to make your code more idiomatic and faster in this talk from PyCon US 2013:

Transforming Code into Beautiful, Idiomatic Python

Here are the slides.