Open Government Data movement

"There is a global movement to liberate government-"owned" data sets, such as census data, environmental data, and data generated by government-funded research projects. This open data movement aims to make these datasets available, at no cost, to citizens, citizen groups, non-governmental-organizations (NGOs) and businesses. The arguments are many: such data spurs economic activity, helps citizens make better decisions, and helps us understand better who we are and where we are going as a country. Further, these data were collected using tax dollars, yet the government holds a monopoly which makes data available only to those able to pay the high access fees, while some data is not made available at all."

"Some aspects of the open data movement (see also the Hatcher article in this issue) include the following:

  • Open Access ( OA, which aims to end restrictive licenses on university research and data as seen in initiatives such as Open Access News
  • data visualization projects which combine design and data in creative ways to make information more accessible, such as Gapminder
  • grassroots citizen projects using government data sets to improve cities and towns, such as FixMyStreet"


"access to government data is hampered by four main factors: i) the high cost of available data sets; ii) arbitrary decisions about availability of data sets to the public; iii) restrictive licenses; and iv) inaccessible data formats." (

Civic Data

"Civic data are a public good, and more specifically, are "numerical quantities or other factual attributes generated by scientists, derived during the research process through observations, experiments, calculations and analysis". It is also "facts, ideas, or discrete pieces of information, especially when in the form originally collected and unanalyzed", and also, from the Report of the National Science Board, "numbers, images, video or audio streams, software and software versioning information, algorithms, equations, animations, or models/simulations". Distinctions are made between raw or level 0 data and derived, refined, synthesized or processed data. Raw data are normally unprocessed; examples include digital signals from a sensor or an instrument (e.g. unprocessed satellite image, thermometer), facts derived from a sample collected for an experiment (e.g. blood sample, ice core), and facts collected by human observation (e.g. mine tailings, census). Computations and data manipulations are related to research objectives and methodologies. Refined or processed data are raw data that have been manipulated, undergone computational modeling, been filtered through an algorithm, sorted into a table or rendered into a map. In these cases, access to the models is as important as access to the output results of those data.

In other words, civic data are the data created and maintained by public organizations and paid for by the public purse as part of the ongoing day-to-day activities of governing. Public data can include crime data at the neighbourhood scale, the number of traffic violations for certain streets, election results, census data, road networks, non-private health data, government expenditure data, school board catchment area boundaries, aggregated test results, environmentally sensitive or contaminated areas, or basic framework map data that include census areas, administrative boundaries, postal code areas and geo-referenced satellite images. Framework data are particularly important as these are the foundational data sets upon which other datasets can be organized. Civic data also includes those created as part of government funded research organizations such as the Social Sciences and Humanities Council of Canada (SSHRC) and the Natural Sciences and Engineering Research Council (NSERC) or any other outsourced publicly funded data and information creation activity." (


Principles as agreed at the Open Government Meeting:

Government data shall be considered open if it is made public in a way that complies with the principles below:

1. Complete

All public data is made available. Public data is data that is not subject to valid privacy, security or privilege limitations.

2. Primary

Data is as collected at the source, with the highest possible level of granularity, not in aggregate or modified forms.

3. Timely

Data is made available as quickly as necessary to preserve the value of the data.

4. Accessible

Data is available to the widest range of users for the widest range of purposes.

5. Machine processable

Data is reasonably structured to allow automated processing.

6. Non-discriminatory

Data is available to anyone, with no requirement of registration.

7. Non-proprietary

Data is available in a format over which no entity has exclusive control.

8. License-free

Data is not subject to any copyright, patent, trademark or trade secret regulation. Reasonable privacy, security and privilege restrictions may be allowed.


Concerns with government data

Alex Steffen [1]:

"All those qualifications were the subject of substantial discussion, some of which is ongoing on a wiki, which you’re welcome to contribute towards. It was a much faster process to draft a introduction - a mini-manifesto of sorts - which reads in part:

The Internet is the public space of the modern world, and through it governments now have the opportunity to better understand the needs of their citizens and citizens may participate more fully in their government. Information becomes more valuable as it is shared, less valuable as it is hoarded. Open data promotes increased civil discourse, improved public welfare, and a more efficient use of public resources.

The definition will surely evolve, especially as we get input from people who make government policy decisions on matters of data access and security. And there are a couple of questions that couldn’t be addressed in the course of a weekend meeting.

One concerns how broad the definition should be of “government data”. If it includes all data paid for by public funds, then a call for open data has substantial overlap with the Open Access Movement, which seeks to unlock scholarly materials published in licensed journals and make those materials available under less arduous licenses, trying to share scholarly research with people in developing nations. (Much of the scholarship Open Access seeks to unlock is produced with government funding - OA advocates argue that research paid for by public funds needs to be broadly available to the public.) While it would be exciting to see solidary between these movements, that definition is probably broader than what most of the people in the room were considering when they thought about government data.

A second concern regards non-digital data. The principles above apply to data that’s available in a digital form - they don’t apply to the vast stacks of paper records most governments have accumulated, or obsolete media like inaccessible computer tapes or disks. Ideally, governments will begin to make this material available, but there are unanswered questions of costs incurred during digitization and the priority of bringing old records online. There’s a danger that keeping records in analog format will become a way to avoid digital scrutiny. Before dismissing this as absurd, keep in mind that the current US administration evidently does not use the White House email system for fear of subpoena, and uses laptops issued by the RNC to keep their proceedings from public scrutiny. At some point, a statement of open data principles will need to address the desirability of ensuring that government data becomes digital as soon as reasonably feasible." (

An information architecture for open data

Jennifer Bell:


"The UK's Power of Information Task Force has proposed an application framework for implementing government transparency. In a thoughtful blog post this past June, Richard Allen proposed [2] the following re-visioning of the way that the data in a government website is used. Instead of a closed model where the presentation, analysis, and data layers are locked together, Allen presents a model with access layers between data, analysis, and presentation, and an interaction layer laid over top. These access layers give third parties the flexibility to hook into the data directly to provide their own analysis or to use information from the government's analysis layer to provide their own presentation interfaces. Finally, the interaction layer allows people to discuss the information and provide feedback." (


"David Robinson, of Princeton's Center for Information Technology Policy, takes the concept of fitting access layers into existing government IT architectures one step further. In his paper Government Data and the Invisible Hand, Robinson argues [3] that intra-departmental reporting channels should be exposed to the public, who can provide external validation to complement internal checks and balances.

If this model were followed by the Canadian federal government, data provided to the Auditor General for fulfilling its mandate of "holding the federal government accountable for its stewardship of public funds" would be opened up to access by external agencies. Like the Peer-to-Patent model, the Auditor General would begin to benefit from scrutiny of the data by external bodies. Systems may well evolve that relieve the burden of oversight from the staff of the Auditor General altogether, allowing them the leisure to pay attention only when issues are reported. With a system built on openness, the public may also start to trust that the government in Canada is in fact well run, instead of being required to take it on faith." (

Transparency Recommendations for Open Government

Jennifer Bell:

"The rewards of a civil service career are asymmetrical and civil servants often feel that they live in a fish bowl. This fish bowl is made of a particular type of filtered glass: one where only the bad light gets through. Overwhelmingly, the disclosed information that gets publicized by the media is the negative, career-destroying kind. Information that points to success and improvement are rarely publicly celebrated. This is something that has to change.

Recognizing that the incentives against transparency outweigh the incentives for, OMBWatch has recommendations for institutionalizing open. These include:

  • having the government leader instruct agencies to request sufficient resources in funding, personnel, and technical capacity, to implement the vision of a more transparent government
  • making transparency part of federal job evaluations where it is part of the job description
  • implement directives protecting whistle-blowers who disclose waste, fraud, or abuse within an agency
  • creating a system of transparency scorecards for rating agencies
  • giving out transparency awards to celebrate achievements and best practices

Beyond these recommendations, external bodies that use government information should, as much as possible, build systems that create heroes rather than scapegoats. Individuals who find ways to save money, increase efficiency, or deliver a valuable service in an innovative way should be publicly rewarded, either through external financial compensation or public recognition." (


  1. Ethan Zuckerman reviews open government initiatives [4]
  2. The Guardian features/pictures 19 initiatives in mostly anglo-saxon countries

Annotation Tools: Django/Metavid

"Adrian Holovaty is one of the superstars in this field, known for creating digital journalism tools like the Chicago Crime mashup and the Django web development framework. He shows off a tool created to help him co-author a book on Django. Rather than putting the text of the book into a wiki and allowing anyone to edit it, the system allows fine-grained commenting on a fixed text. While the book isn’t currently open for commenting, you can see the comments placed on each paragraph of text, often suggesting very specific refinements to the book.

There’s the interesting potential for this model for document annotation to start discussions around political documents. It probably doesn’t make sense to put the text of a political speech in a wiki - the speech was delivered and the discussion is around interpretation of the words of that speech. There’s the exciting possibility that document annotation could become a new form of community interaction. Tom Steinberg of MySociety pointed out that the Free Software Foundation is trying an annotation method to allow group discussion of the new GNU Public Licenses which shows lines that are uncontroversial or more controversial based on the number of comments they’ve received. There’s a sense in which tools for allowing group development of software - versioning systems, repositories - might be applied to group authorship of text as well.

Michael Dale from Metavid has created a remarkable tool for annotating video through a wiki model. It’s a bit like Democracy Player/Miro, DotSub and MediaWiki colliding at high speed. The current MediaWiki site hints at what the future will look like - it currently provides video from CSPAN correlated with transcripts, with the transcript and video embeddable within other publishign platforms. The forthcoming version allows users to improve these captions in wiki form, to search video via captioning, and to edit and package video for export. It looks like it’s going to be an amazing and powerful system when it’s released.

Sunlight Labs

Greg Elin with Sunlight Labs is a master of meshing sets of political data. He talks about Sunlight’s holy grail - one click disclosure - integrating data from, Open Congress, Center for Responsive Politics, and others. Sunlight has taken steps to ensure that these sites are cross-referenced and integrated, so you can view portraits of US politicians that include information on fundraising, contributions from lobbyists, voting on earmarks, etc. In the long term, Sunlight is looking into doing real-time analysis of newsfeeds from sources like AP, feeding the data through “data chewers” that monitor the articles for information on politicians and link the references to detailed profiles on the individuals in question. Elin points out that most newspapers don’t have the technical capacity to integrate this sort of data into online stories - his goal is to create a “journalists’ desktop” that puts this information at the hands of every reporter, and makes it as easy as possible for a paper to integrate this information into their coverage.

My Society

Tom Steinberg of MySociety is responsible for some of the most innovative projects in UK politics and online organizing. (Tom was very careful to correct me, reminding me that he’s not a programmer and that MySociety projects are put together by a team of paid and volunteer programmers and designers who work with him - he gives that team the credit for these remarkable projects.) He explains that They Work For You, a site he’s largely responsible for, began as a project to make the Hansard (the record of parliamentary proceedings) accessible, annotatable, and linkable. In the process, TWFY created profile pages for each UK parliamentarian, which includes information characterizing how they’ve voted, how many questions they ask in session and how well they respond to constituent questions.

These pages are often the best linked pages for UK parliamentarians, and they’re generated automatically, based on information reported by the UK government… and from Tom’s scripts as well. One of his sites invites constituents to ask questions of their parliamentarians, and surveys them two weeks later to see whether questions have been answered. This information is included in the profiles of MPs, which gives them a strong incentive to be responsive to constituent questions. (Steinberg has seen evidence that TWFY is so effective that some politicians have resorted to “spam speeches”, attempting to goose their numbers on TWFY to improve their electability.)

Other projects from MySociety focus on more personal aspects of politics. Fix My Street invites citizens to document problems in their local areas, including photos and geolocation information, so that local officials can see problems under their jurisdiction. The site has a comprehensive set of rules for routing email reporting problems to the proper authorities and has registered over 10,000 reported issues thus far. The Travel Time Maps project appeals directly to the heart of many Britons, showing the average commute time per neighborhood for areas across the nation. The isochrome maps make very clear what neighborhoods are and are not well served by public transit." (

Status Report


There are currently a number of exciting initiatives to release government data in bulk, these include:

  • United States: On 21 May 2009 the US Government launched whose purpose is to give direct public access to machine-readable datasets generated by the Executive Branch of the US Federal Government. An initial 47 datasets are on line, of the thousands planned for release.
  • United Kingdom: Working with Tim Berners-Lee, one of the inventors of the World Wide Web, the UK government has created, a single online access point for government data, launched on 21 January 2010.
  • Australia: the website encourages users to “make government information even more useful by mashing-up the data to create something new and exciting!”
  • New Zealand: a portal for accessing government databases is located at Recent release include a database from the food safety authority with a breakdown of the major causes of food recalls, and total number of recalls 2001 – 2009, and hospital performance data from the Ministry of Health.
  • Denmark: Danish National IT and Telecom Agency has created a meta-portal to link, to guide users to available public data.

What about civil society initiatives?

• At the EU level the Public Geodata Campaign which formed in response to the EU’s INSPIRE Directive establishing a framework for spatial data infrastructure in Europe – activists criticise the Directive for its failure to guarantee access to geodata for European citizens and businesses;

• In the UK the Free Our Data campaign which argues that data created with taxpayers money, such as ordinance survey data (mapping), should not be sold to the public. In a victory for the campaign, UK Ordinance Survey (mapping) data will be available free of charge from April 2010;

• In New Zealand, an independent website, the Open Data Catalogue, provides a portal to local government datasets in NZ;.

• In Slovenia the speleological association won access to a database of caving information without having to pay for it; the Information Commissioner ruled that when the use of public data was for not for profit purposes, it should be free of charge.

• In the United States in December 2007 a group of 30 experts and activists in the US produced the “Open Government Data Principles”. The principles were adopted in order “to develop a more robust understanding of why open government data is essential to democracy” and to develop principles that would enable governments of the world to become “more effective, transparent, and relevant to our lives”." (


Peter Suber:

"With or without mandates, more governments committed themselves to OA for publicly funded data. Norway adopted an OA mandate for public geodata. Canada, Ireland, and Australia began providing OA to publicly funded digital mapping data, without a mandate. After long resistance, the UK Ordnance Survey began to do the same, at least experimentally. (Earlier in the year, a legal analysis by Charlotte Waelde, an expert on intellectual property at the University of Edinburgh, concluded that the data are not protected by copyright but at most, only by the database right; a JISC report recommended a general UK policy of OA for research data; and the new UK Prime Minister Gordon Brown endorsed the principle of public access to public data.) The Committee of Ministers of the Council of Europe recommended "wide public access to research results to which no copyright restrictions apply" (i.e. data). Publishing consultant Eve Gray reported that the South African government was moving toward a policy of OA for publicly funded research data. The Australian government proposed an Australian National Data Service to promote OA and re-use of publicly funded research data. The Organisation for Economic Co-operation and Development (OECD) issued principles and guidelines to implement its 2004 Declaration on Access to Research Data from Public Funding. California is about to adopt the strongest and broadest OA mandate for greenhouse gas data in the US, and Pennsylvania is about to join the other 49 states in mandating OA for state statutes. And the UN Convention on Long-range Transboundary Air Pollution (LRTAP) adopted an OA mandate for most kinds of data covered by the convention.

The US Government Accountability Office called on four major federal funding agencies (DOE, NASA, NOAA, and NSF) to enforce their existing policies on data sharing. Twenty-two US federal government agencies formed an Interagency Working Group on Digital Data (IWGDD), plan to deposit the data generated by their research grantees in a network of OA repositories, and are considering an OA mandate. The US National Archives joined the OA web portal Geospatial One Stop. The NSF Office of Cyberinfrastructure launched a data interoperability project (INTEROP). Google created a Public Sector Initiative to improve its crawling of OA databases hosted by federal, state, and local government agencies in the US. A group of open government activists convened by O'Reilly Media and Public.Resource.Org drafted principles for open government data. For the first time the US made progress toward OA for its three most notorious non-OA government resources: PACER (Public Access to Court Electronic Records), the database of federal court docket information; NTIS (National Technical Information Service), the online databases of research and business data; and CRS Reports, the highly regarded reports from the Congressional Research Service. The first two began offering OA to selected portions of their content, previously TA, and the third is the subject of a new bill in the Senate to mandate OA.

Nature editorialized in favor of e-notebook science and data sharing, and Nature Biotech recommended "that raw data from proteomics and molecular-interaction experiments be deposited in a public [OA] database before manuscript submission." Maxine Clarke, Publishing Executive Editor at Nature, said that the journal would consider requiring and not merely recommending OA for multimedia data if there were a suitable OA repository supporting annotation and long-term preservation. Wiley threatened legal action when Shelley Batts, a graduate student at the University of Michigan, posted a chart from a Wiley article from the Journal of the Science of Food and Agriculture on her blog; when she replaced it with her own chart of the same data and blogged Wiley's threat, the blogosphere exploded and Wiley said it was all a misunderstanding.

Data-sharing policies were adopted by the UK Medical Research Council, the Ethics Committee of France's Centre National de la Recherche Scientifique (CNRS), the Audiovisual Communications Laboratory at Switzerland's Ecole Polytechnique Fédérale de Lausanne, and the International Telecommunications Union. The NIH launched a new data-sharing program for its neuroscience research. There are too many new OA databases to name separately, but since I've mentioned the NIH, I should add that it launched the Database of Genotype and Phenotype (dbGaP) and SHARe (SNP Health Association Resource). It described SHARe as "one of the most extensive collections of genetic and clinical data ever made freely available to researchers worldwide."

Google began helping researchers exchange datasets up to 120 terabytes in size, too large for ordinary online uploads and downloads. At no charge to the researchers, it will ship a brick-sized box of hard drives from one research team to another, provided that the data have no copyright or licensing restrictions and the bricks stop first at Google headquarters for copying and offline storage. In time, Google hopes to make the datasets OA. The company also began sharing files of its own data with researchers on the condition that they make the results of their research OA.

The year 2007 saw a wave of general OA data repositories spring up, many with built-in features for graphics and analysis: for example, Dabble, Data360, Freebase, Many Eyes, Open Economics, StatCrunch, Swivel, and WikiProteins. At the same time, several projects worked to facilitate the deposit of data in OA repositories, such as EDINA's DataShare and JISC's SPECTRa (Submission, Preservation and Exposure of Chemistry Teaching and Research Data), or to enhance the interface between data repositories and literature repositories, such as JISC's StORe (Source-to-Output Repositories).

By my informal estimate, the fields with the largest advances in OA data during 2007 were archaeology, astronomy, chemistry, the environment (including climate change), geography (including mapping), and medicine (especially, genomics and clinical drug trials)." (;view=text;rgn=main;idno=3336451.0011.110)

