Rufus Pollock on the Use of Open Source Principles for Open Science

From P2P Foundation
Jump to navigation Jump to search

Voices from the future of science: Rufus Pollock of the Open Knowledge Foundation



Intro and first question:

"If there’s a single quote that best captures the ethos of open science, it might be the following bon mot from Rufus Pollock, digital rights activist, economist at the University of Cambridge and a founder of the Open Knowledge Foundation: “The best thing to do with your data will be thought of by someone else.”

It’s also a pithy way to convey both the challenge and opportunity for publishers of scientific research and data. How can we best capitalize on the lessons from the rise of the Web and open source software to accelerate scientific research? What’s the optimal way to package data so it can be used in ways no one anticipates?

I talked to Pollock, who’s been a driving force behind efforts to improve sharing and reuse of data, about where we stand in developing a common legal, technical and policy infrastructure to make open science happen, and what he thinks the next steps should be.

What strategies and concepts can we use from open source to foster open science? Can you give us a big picture description of the role you see the Open Knowledge Foundation playing?

I’d say that in terms of applying lessons from open source, the biggest thing to look at is data. Code and data have so many similarities — indeed, in many ways, the distinction between code and data are beginning to blur. The most important similarity is that both lend themselves naturally to being broken down into smaller chunks, which can then be reused and recombined.

This breaking down into smaller, reusable chunks is something we at the Open Knowledge Foundation refer to as “componentization.” You can break down projects, whether they are data sets or software programs, into pieces of a manageable size — after all, the human brain can only handle so much data — and do it in a way that makes it easier to put the pieces back together again. You might call this the Humpty Dumpty principle. And splitting things up means people can work independently on different pieces of a project, while others can work on putting the pieces back together — that’s where “many minds” come in.

What’s also crucial here is openness: without openness, you have a real problem putting things together. Everyone ends up owning a different piece of Humpty, and it’s a nightmare getting permission to put him back together (to use jargon from economics, you have an anti-commons problem). Similarly, if a data set starts off closed, it’s harder for different people to come along and begin working on bits of it. It’s not impossible to do componentization under a proprietary regime, but it is a lot harder.

With the ability to recombine information as the goal, it’s critical to be explicit about openness — both about what it is, and about what you intend when you make your work available. In the world of software, the key to making open source work is licensing, and I believe the same is true for science. If you want to enable reuse — whether by humans, or more importantly, by machines operated by humans — you’ve got to make it explicit what can be used, and how. That’s why, when we started the Open Knowledge Foundation back in 2004, one of the first things we focused on was defining what “open” meant. That kind of work, along with the associated licensing efforts, can seem rather boring, but it’s absolutely crucial for putting Humpty back together. Implicit openness is not enough.

So, in terms of open science, one of the main things the Open Knowledge Foundation has been doing is conceptual work — for example, providing an explicit definition of openness for data and knowledge in the form of the open knowledge/data definition, and then explaining to people why it’s important to license their data so it conforms to the definition.

So, to return to the main question, I think one of the strategies we should be taking from open source is its approach to the Humpty Dumpty problem. We should be creating and sharing “packages” of data, using the same principles you see at work in Linux distributions — building a Debian of data, if you like. Debian has currently got something like 18,000 software packages, and these are maintained by hundreds, if not thousands, of people — many of whom have never met. We envision the community being able to do the same thing with scientific and other types of data. This way, we can begin to divide and conquer the complexity inherent in the vast amounts of material being produced — complexity I don’t see us being able to manage any other way." (