Open Data: Difference between revisions

From P2P Foundation
Jump to navigation Jump to search
No edit summary
 
(25 intermediate revisions by 2 users not shown)
Line 1: Line 1:
The concept of '''Open Data''' is used in two different contexts, i.e. both as the availability of scientific raw data and as open access to publicly funded, 'government' information.  
'''= "The technological perspective represented by the Open Data Movement. Open data is a philosophy and practice requiring that certain data be freely available to everyone, without restrictions from copyright, patents or other mechanisms of control'''." [http://www.philippmueller.de/open-statecraft-for-a-brave-new-world/]


(There is of course an obvious overlap when the scientific data are produced by public funding or government institutions.)
URL = http://okfn.org/opendata/
 
=Definition=
 
'''1. Open data is data that can be freely used, shared and built-on by anyone, anywhere, for any purpose.''' [http://blog.okfn.org/2013/10/03/defining-open-data/]
 
 
'''2. OpenDefinition.org:''' “Open data is data that can be freely used, reused and redistributed by anyone – subject only, at most, to the requirement to attribute and sharealike.”  - [http://okfn.org/opendata/]
 
'''3. From the Wikipedia''' at http://en.wikipedia.org/wiki/Open_Data
 
"'''Open Data is a philosophy and practice requiring that certain data are freely available to everyone, without restrictions from copyright, patents or other mechanisms of control'''. It has a similar ethos to a number of other "Open" movements and communities such as [[Open Source]] and Open access.
 
Open Data is often focussed on non-textual material such as maps, genomes, chemical compounds, mathematical and scientific formulae, medical data and practice, bioscience and biodiversity. Problems often arise because these are commercially valuable or can be aggregated into works of value. Access to, or re-use of, the data are controlled by organisations, both public and private. Control may be through access restrictions, licenses, copyright, patents and charges for access or re-use. Advocates of Open Data argue that these restrictions are against the communal good and that these data should be made available without restriction or fee. In addition, it is important that the data are re-usable without requiring further permission, though the types of re-use (such as the creation of derivative works) may be controlled by license."
 
 
==What is Open?==
 
"The full Open Definition provides a precise definition of what open data is. There are 2 important elements to openness:
 
* Legal openness: you must be allowed to get the data legally, to build on it, and to share it. Legal openness is usually provided by applying an appropriate (open) license which allows for free access to and reuse of the data, or by placing data into the public domain.
 
* Technical openness: there should be no technical barriers to using that data. For example, providing data as printouts on paper (or as tables in PDF documents) makes the information extremely difficult to work with. So the Open Definition has various requirements for “technical openness,” such as requiring that data be machine readable and available in bulk.
 
 
There are a few key aspects of open which the Open Definition explains in detail. Open Data is useable by anyone, regardless of who they are, where they are, or what they want to do with the data; there must be no restriction on who can use it, and commercial use is fine too.
 
Open data must be available in bulk (so it’s easy to work with) and it should be available free of charge, or at least at no more than a reasonable reproduction cost. The information should be digital, preferably available by downloading through the internet, and easily processed by a computer too (otherwise users can’t fully exploit the power of data – that it can be combined together to create new insights).
 
Open Data must permit people to use it, re-use it, and redistribute it, including intermixing with other datasets and distributing the results.
 
The Open Definition generally doesn’t allow conditions to be placed on how people can use Open Data, but it does permit a data provider to require that data users credit them in some appropriate way, make it clear if the data has been changed, or that any new datasets created using their data are also shared as open data.
 
There are 3 important principles behind this definition of open, which are why Open Data is so powerful:
 
* Availability and Access: that people can get the data
 
* Re-use and Redistribution: that people can reuse and share the data
 
* Universal Participation: that anyone can use the data,"
(http://blog.okfn.org/2013/10/03/defining-open-data/)
 
=Characteristics=
 
 
'''1. The Open Definition''' gives full details on the requirements for ‘open’ data and content. Key features are:
 
* '''Availability and Access''': the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the internet. The data must also be available in a convenient and modifiable form.
 
* '''Reuse and Redistribution''': the data must be provided under terms that permit reuse and redistribution including the intermixing with other datasets. The data must be machine-readable.
 
* '''Universal Participation''': everyone must be able to use, reuse and redistribute – there should be no discrimination against fields of endeavour or against persons or groups. For example, ‘non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes (e.g. only in education), are not allowed."
(http://okfn.org/opendata/)
 
 
 
'''2. Fundamental Open Data Rights:'''
 
"Arguments made on behalf of Open Data include:
 
   
* "Data belong to the human race". Typical examples are genomes, data on organisms, medical science, environmental data.
   
* Public money was used to fund the work and so it should be universally available.
   
* It was created by or at a government institution (this is common in US National Laboratories and government agencies)
   
* Facts cannot legally be copyrighted.
   
* Sponsors of research do not get full value unless the resulting data are freely available
   
* Restrictions on data re-use create an anticommons
   
* Data are required for the smooth process of running communal human activities (map data, public institutions)
   
* In scientific research, the rate of discovery is accelerated by better access to data."
(http://en.wikipedia.org/wiki/Open_Data)
 
 
'''3. Relation to other open activities:'''
 
"There are a number of other "Open" philosophies which are similar to, but not synonymous with Open Data but which may overlap, be supersets, or subsets. Here they are briefly listed and compared.
 
   
* [[Open Source Software]] is concerned with the licenses under which computer programs can be distributed and is not normally concerned primarily with data.
   
* [[Open Content]] has similarities to Open Data and may be seen as a superset but differs in that it emphasizes creative works while Open Data is more oriented towards factual data and the output of the scientific research process.
       
* [[Open Knowledge]]. The Open Knowledge Foundation argues for Openness in a range of issues including, but not limited to, those of Open Data. It covers (a) scientific, historical, geographic or otherwise (b) Content such as music, films, books (c) Government and other administrative information
(http://en.wikipedia.org/wiki/Open_Data)
 
 
'''4. Open Data are opposed by Closed Data:'''
 
"Several intentional or unintentional mechanisms exist for restricting access to or re-use of data. They include:
 
   
* compilation in databases or websites to which only registered members or customers can have access.
   
* use of a proprietary or closed technology or encryption which creates a barrier for access.
   
* copyright forbidding (or obfuscating) re-use of the data.
   
* license forbidding (or obfuscating) re-use of the data
   
* patent forbidding re-use of the data (for example the 3-dimensional coordinates of some experimental protein structures have been patented)
   
* restriction of robots to websites, with preference to certain search engines
   
* aggregating factual data into "databases" which may be covered by "database rights" or "database directives" (e.g. Directive on the legal protection of databases)
   
* time-limited access to resources such as e-journals (which on traditional print were available to the purchaser indefinitely)
   
* political, commercial or legal pressure on the activity of organisations providing Open Data (for example the American Chemical Society lobbied the US Congress to limit funding to the National Institutes of Health for its Open Pubchem data."
(http://en.wikipedia.org/wiki/Open_Data)
 
 
 
=How-To=
 
==3 Key Rules==
 
"There are three key rules we recommend following when opening up data:
 
'''Keep it simple.''' Start out small, simple and fast. There is no requirement that every dataset must be made open right now. Starting out by opening up just one dataset, or even one part of a large dataset, is fine — of course, the more datasets you can open up the better.
 
Remember this is about innovation. Moving as rapidly as possible is good because it means you can build momentum and learn from experience — innovation is as much about failure as success and not every dataset will be useful.
 
'''Engage early and engage often'''. Engage with actual and potential users and re-users of the data as early and as often as you can, be they citizens, businesses or developers. This will ensure that the next iteration of your service is as relevant as it can be.
It is essential to bear in mind that much of the data will not reach ultimate users directly, but rather via ‘info-mediaries’. These are the people who take the data and transform or remix it to be presented. For example, most of us don’t want or need a large database of GPS coordinates, we would much prefer a map. Thus, engage with infomediaries first. They will re-use and repurpose the material.
 
'''Address common fears and misunderstandings'''. This is especially important if you are working with or within large institutions such as government. When opening up data you will encounter plenty of questions and fears. It is important to (a) identify the most important ones and (b) address them at as early a stage as possible."
(http://okfn.org/opendata/)
 
==The Four Steps==
 
"These are in very approximate order – many of the steps can be done simultaneously.
 
* Choose your dataset(s). Choose the dataset(s) you plan to make open. Keep in mind that you can (and may need to) return to this step if you encounter problems at a later stage.
 
* Apply an open license.
 
* Determine what intellectual property rights exist in the data.
 
* Apply a suitable ‘open’ license that licenses all of these rights and supports the definition of openness discussed in the section above on ‘What Open Data’
 
* Make the data available – in bulk and in a useful format. You may also wish to consider alternative ways of making it available such as via an API.
 
* Make it discoverable – post on the web and perhaps organize a central catalog to list your open datasets."
 
(http://okfn.org/opendata/)
 
=Discussion=
 
==Why open data may be more important than open source==
 
 
Ian Davis:
 
"data outlasts code which lead me to then assert that therefore open data is more important than open source. This appears to be controversial.
 
First, it’s important to note what I did not say. I did not say that open source is not important. On the contrary I said that open source was extremely important and it has sounded the death knell for proprietary software. Later speakers at the conference referred to this statement as controversial too :). (What I actually meant to say was that open source has sounded the death knell for propietary software models). I also mentioned that open source and free software has a long history and that open data is where open source was 25 years ago (I am using the term open source and free software interchangeably here).
 
I also did not say that code does not last nor that algorithms do not last. Of course they last, but data lasts longer. My point was that code is tied to processes usually embodied in hardware whereas data is agnostic to the hardware it resides on. The audience at the conference understand this already: they are archivists and librarians and they deal with data formats like MARC which has had superb longevity. Many of them deal with records every day that are essentially the same as they were two or three decades ago. Those records have gone through multiple generations of code to parse and manipulate the data.
 
It’s true that you need code to access data, but critically it doesn’t have to be the same code from year to year, decade to decade, century to century. Any code capable of reading the data will do, even if it’s proprietary. You can also recreate the code whereas the effort involved in recreating the data could be prohibitively high. This is, of course, a strong argument for open data formats with simple data models: choosing CSV, XML or RDF is going to give you greater data longevity than PDF, XLS or PST because the cost of recreating the parsing code is so much lower.
 
Here’s the central asymmetry that leads me to conclude that open data is more important than open source: if you have data without code then you could write a program to extract information from the data, but if you have code without data then you have lost that information forever.
 
Consider also, the rise of software as a service. It really doesn’t matter whether the code they are built on are open source or not if you cannot access the data they manage for you. Even if you reproduce the service completely, using the same components, your data is buried awayout of your reach. However, if you have access to the data then you can achieve continuity even if you don’t have access to the underlying source of the application. I’ll say it again: open data is more important than open source.
 
Of course we want open standards, open source and open data. But in one or two hundred years which will still be relevant? Patents and copyrights on formats expire, hardware platforms and even their paradigms shift and change. Data persists, open data endures.
 
The problem we have today is that the open data movement is in its infancy when compared to open source. We have so far to go, and there are many obstacles. One of the first steps to maturity is to give people the means to express how open their data is, how reusable it is. The Open Data Commons is an organisation explicitly set up to tackle the problem of open data licensing. If you are publishing data in any way you ought to check out their licences and see if any meet with your goals. If you licence your data openly then it will be copied and reused and will have an even greater chance of persisting over the long term."
(http://iandavis.com/blog/2009/03/open-data-open-source)
 
 
 
=Resources=
 
 
==Open Data Policies==
 
 
RECOMMENDATIONS from the U.S. Public Policy Committee of the ACM (USACM):


   
* Data published by the government should be in formats and approaches that promote analysis and reuse of that data.


   
* Data republished by the government that has been received or stored in a machine-readable format (such as online regulatory filings) should preserve the machine-readability of that data.


==Open Access to Government Information==
   
* Information should be posted so as to also be accessible to citizens with limitations and disabilities.


Refers to the campaign for the openness of data collected by government, against company-centric licensing regimes which withhold access to publicly funded data to the public at large.
   
* Citizens should be able to download complete datasets of regulatory, legislative or other information, or appropriately chosen subsets of that information, when it is published by government.  


   
* Citizens should be able to directly access government-published datasets using standard methods such as queries via an API (Application Programming Interface).


===Description===
   
* Government bodies publishing data online should always seek to publish using data formats that do not include executable content.


From the key essay by Peter Weiss, [[Borders in Cyberspace]]


"Many nations are embracing the concept of open and unrestricted access to public sector information -- particularly scientific, environmental, and statistical information of great public benefit. Federal information policy in the US is based on the premise that government information is a valuable national resource and that the economic benefits to society are maximized when taxpayer funded information is made available inexpensively and as widely as possible. This policy is expressed in the Paperwork Reduction Act of 1995 and in Office of Management and Budget Circular No. A-130, “Management of Federal Information Resources.”[1] This policy actively encourages the development of a robust private sector, offering to provide publishers with the raw content from which new information services may be created, at no more than the cost of dissemination and without copyright or other restrictions.
   
* Published content should be digitally signed or include attestation of publication/creation date, authenticity, and integrity.  
(http://www.acm.org/public-policy/open-government)


In other countries, particularly in Europe, publicly funded government agencies treat their information holdings as a commodity used to generate short-term revenue. They assert monopoly control on certain categories of information to recover the costs of its collection or creation. Such arrangements tend to preclude other entities from developing markets for the information or otherwise disseminating the information in the public interest.


In the US, open and unrestricted access to public sector information has resulted in the rapid growth of information intensive industries particularly in the geographic information and environmental services sectors. Similar growth has not occurred in Europe due to restrictive government information practices. As a convenient shorthand, one might label the American and European approaches as ‘open access’ and ‘cost recovery’, respectively. The cost recovery model is now being challenged on a variety of grounds."
==Open Data Organizations==
(http://www.primet.org/documents/weiss%20-%20borders%20in%20cyberspace.htm)


===OECD Public Sector Information definition===
* [[CODATA]]
* [[Science Commons]]
* [[Free Our Data]] ([[The Guardian]] technology section), http://www.freeourdata.org.uk/index.php
* [http://www.okfn.org/ The Open Knowledge Foundation]
* [http://www.talis.com/ Talis]
* [http://www.web2express.org/ Web2Express.org, Open data on semantic web]
* [http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData Linking Open Data on the Semantic Web]


From http://www.firstmonday.org/issues/issue12_6/wunsch/index.html:
==Open Data Companies==


"Public sector information which often has characteristics of being: dynamic and continually generated, directly generated by the public sector, associated with the functioning of the public sector (for example, meteorological data, business statistics), and readily useable in commercial applications; and,
"“'''Open data is to media what open source is to technology. Open data is an approach to content creation that explicitly recognizes the value of implicit user data'''. The internet is the first medium to give a voice to the attention that people pay to it. Successful open data companies listen for and amplify the rich data that their audiences produce.”
(http://www.attentiontrust.org/node/430)


Public content which often has characteristics of being: static (i.e. it is an established record), held by the public sector rather than being directly generated by it (cultural archives, artistic works where third–party rights may be important), not directly associated with the functioning of government, and not necessarily associated with commercial uses but having other public good purposes (culture, education).
* Adaptive Blue- Extended browsing
* Aggregate Knowledge- Outsourced recommendations
* Atten.TV- Attention media
* Buzzlogic- Tracking influence
* ClearForest – Text analytics
* Daylife- Hi-touch algorithmic news
* Feedburner- RSS content management
* Lijit Networks- Ranking people
* Majestic Research- Online behavior for investors
* Meetup- America offline
* MyBlogLog- Reader communities
* Omnidrive- Open data storage
* Right Media- Transparent ad network
* Stumbleupon- The "forward" button
(http://www.attentiontrust.org/node/428)


The first category comprises public sector “knowledge” which may be the basis for information–intensive industries; these employ the raw data to produce increasingly sophisticated products. The second refers to cultural, educational and scientific public knowledge where wide public diffusion and long–term preservation (e.g. via museums, libraries, schools) are major governmental objectives."
==Open Data Repositories==
(http://www.firstmonday.org/issues/issue12_6/wunsch/index.html)


open data sets availiable on the Web.


===2006 Open Data Movement Status===
Examples include [[Wikipedia]], [[Geonames]], [[MusicBrainz]]


By Peter Suber at http://www.earlham.edu/~peters/fos/newsletter/01-02-07.htm
See also: [[PubChem]]


" 2006 was another big year for [[Open Access]] to data.  China's Ministry of Science and Technology mandated OA to about 80% of the data generated by publicly-funded research.  The Canadian Institutes of Health Research wrote a draft OA policy that would not only mandate OA to research articles but also some of the data files resulting from CIHR-funded research.  The Gates Foundation required data sharing for its HIV/AIDS research.  The Global Initiative on Sharing Avian Influenza Data was one of several initiatives to encourage OA to avian flu data, breaking the previous, widespread national practices of hoarding it to head off agricultural boycotts or help local scientists scoop foreigners.  The US National Science Foundation's Cyberinfrastructure Vision For 21st Century Discovery endorsed open access to data.  The Governing Board of the Global Biodiversity Information Facility adopted a Recommendation On Open Access To Biodiversity Data, reaffirming and extending its OA statement from last year.  The Conference of the Parties to the Convention on Biological Diversity endorsed OA for biodiversity data.  The NIH's OA data repository for biochemistry, PubChem, prevailed against the attempt by the American Chemical Society to defund it or scale it back, and began attracting content from commercial players like Thomson Scientific.  The ALPSP and STM, which resist the growth of OA archiving, called for OA to raw data, especially data underlying published journal articles.  The Guardian launched the Free Our Data campaign and pressed the UK government to provide OA to publicly-funded data, especially geospatial data.  The UK Office of Fair Trading estimated that lack of OA to public data costs the country £500 million/year.  The Public Geo Data launched an online petition calling for OA to EU-collected geospatial data.  The Commission to the European Parliament published recommended OA to publicly-funded EU geodata.  The European Parliament reached a compromise on the INSPIRE Directive (Infrastructure for Spatial Information in Europe), providing OA to some and providing other data on a cost-recovery basis.  The Universal Protein Resource became the first database to use a Creative Commons license to encourage re-use, and Science Commons wrote an FAQ on using CC licenses for databases.  The SPARC discussion list on Open Data, moderated by Peter Murray-Rust, though launched in late 2005, came to life in 2006.  At least two powerful tools, FortiusOne and Swivel, launched to host and analyze OA data."
(http://www.earlham.edu/~peters/fos/newsletter/01-02-07.htm)


=Open Data Domains=


The concept of '''Open Data''' is used in different contexts, i.e. mostlhy as either the availability of scientific raw data and as open access to publicly funded, 'government' information.


(There is of course an obvious overlap when the scientific data are produced by public funding or government institutions.)


===More Information===


==Open Access to Government Information==


More info at http://www.re-public.gr/en/?p=98. This article specifically focuses on geographic datasets in the UK.
See [[Open Access to Government Information]], as well as [[Open Government Data]] and [[Open Public Data]]


See the sites of UK-based organizations such as [[Free Our Data]] and [[Public Geodata]].




==Open Data in Science==
==Open Data in Science==


===Definition===
See: [[Open Data in Science]]
 
 


Peter Murray-Rust of the Unilever Centre for Molecular Sciences Informatics at the University of Cambridge (UK):


“The emerging Open Data movement shares many goals with the Open Access and Open Source movements, but encompasses its own distinct issues that are in need of examination by the scientific community. Many advocates of Open Data believe that, although there are substantial potential benefits from sharing and reusing digital data upon which scientific advances are built, today much of it is being lost or underutilized because of legal, technological and other barriers."
=Status Report 2007=
(http://www.arl.org/sparc/announce/102405.html)


Peter Suber:


"With or without mandates, more governments committed themselves to OA for publicly funded data. Norway adopted an OA mandate for public geodata. Canada, Ireland, and Australia began providing OA to publicly funded digital mapping data, without a mandate. After long resistance, the UK Ordnance Survey began to do the same, at least experimentally. (Earlier in the year, a legal analysis by Charlotte Waelde, an expert on intellectual property at the University of Edinburgh, concluded that the data are not protected by copyright but at most, only by the database right; a JISC report recommended a general UK policy of OA for research data; and the new UK Prime Minister Gordon Brown endorsed the principle of public access to public data.) The Committee of Ministers of the Council of Europe recommended "wide public access to research results to which no copyright restrictions apply" (i.e. data). Publishing consultant Eve Gray reported that the South African government was moving toward a policy of OA for publicly funded research data. The Australian government proposed an Australian National Data Service to promote OA and re-use of publicly funded research data. The Organisation for Economic Co-operation and Development (OECD) issued principles and guidelines to implement its 2004 Declaration on Access to Research Data from Public Funding. California is about to adopt the strongest and broadest OA mandate for greenhouse gas data in the US, and Pennsylvania is about to join the other 49 states in mandating OA for state statutes. And the UN Convention on Long-range Transboundary Air Pollution (LRTAP) adopted an OA mandate for most kinds of data covered by the convention.


The US Government Accountability Office called on four major federal funding agencies (DOE, NASA, NOAA, and NSF) to enforce their existing policies on data sharing. Twenty-two US federal government agencies formed an Interagency Working Group on Digital Data (IWGDD), plan to deposit the data generated by their research grantees in a network of OA repositories, and are considering an OA mandate. The US National Archives joined the OA web portal Geospatial One Stop. The NSF Office of Cyberinfrastructure launched a data interoperability project (INTEROP). Google created a Public Sector Initiative to improve its crawling of OA databases hosted by federal, state, and local government agencies in the US. A group of open government activists convened by O'Reilly Media and Public.Resource.Org drafted principles for open government data. For the first time the US made progress toward OA for its three most notorious non-OA government resources: PACER (Public Access to Court Electronic Records), the database of federal court docket information; NTIS (National Technical Information Service), the online databases of research and business data; and CRS Reports, the highly regarded reports from the Congressional Research Service. The first two began offering OA to selected portions of their content, previously TA, and the third is the subject of a new bill in the Senate to mandate OA.


Nature editorialized in favor of e-notebook science and data sharing, and Nature Biotech recommended "that raw data from proteomics and molecular-interaction experiments be deposited in a public [OA] database before manuscript submission." Maxine Clarke, Publishing Executive Editor at Nature, said that the journal would consider requiring and not merely recommending OA for multimedia data if there were a suitable OA repository supporting annotation and long-term preservation. Wiley threatened legal action when Shelley Batts, a graduate student at the University of Michigan, posted a chart from a Wiley article from the Journal of the Science of Food and Agriculture on her blog; when she replaced it with her own chart of the same data and blogged Wiley's threat, the blogosphere exploded and Wiley said it was all a misunderstanding.


===Requirements for Open Data in science===
Data-sharing policies were adopted by the UK Medical Research Council, the Ethics Committee of France's Centre National de la Recherche Scientifique (CNRS), the Audiovisual Communications Laboratory at Switzerland's Ecole Polytechnique Fédérale de Lausanne, and the International Telecommunications Union. The NIH launched a new data-sharing program for its neuroscience research. There are too many new OA databases to name separately, but since I've mentioned the NIH, I should add that it launched the Database of Genotype and Phenotype (dbGaP) and SHARe (SNP Health Association Resource). It described SHARe as "one of the most extensive collections of genetic and clinical data ever made freely available to researchers worldwide."


Quoted from http://www.windley.com/archives/2006/05/free_the_data.shtml
Google began helping researchers exchange datasets up to 120 terabytes in size, too large for ordinary online uploads and downloads. At no charge to the researchers, it will ship a brick-sized box of hard drives from one research team to another, provided that the data have no copyright or licensing restrictions and the bricks stop first at Google headquarters for copying and offline storage. In time, Google hopes to make the datasets OA. The company also began sharing files of its own data with researchers on the condition that they make the results of their research OA.
 
* Re-use structures including schemas and ontologies. It’s more important to use well-understood structures than to use any particular idiom.
* Re-use the licenses that have already been developed. Licensing meta-data (ala Creative Commons) is also important.
* Enable re-use of ideas (contrasted with the expression of the idea). We have to find the proper scope of ‘derivative works’ and re-examine the issue of database copyright. Shockingly, copying the bibliographic data from a work (for purposes of citation) can be seen as a violation of some licenses.
* Attach policy information that says how the information can be used. Some experimental data depends critically on personally identifying information. Anonymization is a hard task either not working well or being at odds with the underlying research purpose of the data.
* Use open standards
(Weitzner presentation at http://www.w3.org/2006/Talks/0525-web-data-publishing/#(3); qutoed here [http://www.windley.com/archives/2006/05/free_the_data.shtml])


===Status Report: Access to Research Data in the OECD===
The year 2007 saw a wave of general OA data repositories spring up, many with built-in features for graphics and analysis: for example, Dabble, Data360, Freebase, Many Eyes, Open Economics, StatCrunch, Swivel, and WikiProteins. At the same time, several projects worked to facilitate the deposit of data in OA repositories, such as EDINA's DataShare and JISC's SPECTRa (Submission, Preservation and Exposure of Chemistry Teaching and Research Data), or to enhance the interface between data repositories and literature repositories, such as JISC's StORe (Source-to-Output Repositories).


From http://www.firstmonday.org/issues/issue12_6/wunsch/index.html:
By my informal estimate, the fields with the largest advances in OA data during 2007 were archaeology, astronomy, chemistry, the environment (including climate change), geography (including mapping), and medicine (especially, genomics and clinical drug trials)."
(http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0011.110)


"Throughout OECD Member countries, continuously growing quantities of data are collected by publicly funded researchers and research institutions. This rapidly expanding body of research data represents both a massive investment of public funds and a potential source of the knowledge needed to address the myriad challenges facing humanity.


To promote improved scientific and social return on the public investments in research data, OECD member countries have established a variety of laws, policies and practices concerning access to research data at the national level. In this context, it was recognized that international guidelines would be an important contribution to fostering the global exchange and use of research data.
=Discussion=


At the outset, the third OECD Global Research Village Conference addressed policy implications of the use of Information and Communication Technologies (ICT) for the global science system in 2000 [7]. In particular, the conference discussed issues of access to publicly financed research related to ICT as for instance access to intellectual property and data resources. In 2001, the OECD’s Committee for Scientific and Technological Policy (CSTP) agreed to the establishment of a Working Group to draw up commonly agreed principles to guide access to publicly financed research. Access to and sharing of research data from public funding was chosen as the most appropriate focus for the activities of the Working Group [8]. Collaborations with similar working groups such as CODATA [9] were sought.
==Open data and the Commons==


In 2004, OECD Science and Technology Ministers declared that fostering broader, open access to and wide use of research data will enhance the quality and productivity of science systems worldwide. Ministers adopted a Declaration on Access to Research Data from Public Funding, asking the OECD to take further steps towards proposing Principles and Guidelines on Access to Research Data from Public Funding, based on commonly agreed principles to facilitate optimal cost-effective access to digital research data from public funding, and taking into account possible restrictions related to security, property rights and privacy (Annex) [10]. It recognizes “that open access to, and unrestricted use of, data promotes scientific progress and facilitates the training of researchers” and “will maximize the value derived from public investments in data collection efforts”, and entrusted the OECD ’s Committee for Scientific and Technological Policy (CSTP) to work towards the establishment of access regimes for digital research data from public funding. The Ministers asked for the guidelines to be endorsed by the OECD Council at a later stage.
an old story? by Simon Chignard:


An expert group was formed to support this objective of translating Minister’s goals into an OECD policy instrument. The objective of the Expert Group is to draft useful and relevant guidelines that can be used by national governments and a wide variety of research organizations to facilitate and improve the international sharing of, and access to, digital research data gathered with the assistance of public funding.
"There is a direct link between the open data movement and the philosophy of common goods. Open data are an illustration of the notion of common informational goods proposed by Elinor Ostrom, winner of the 2009 Nobel Prize for economics. Open data belong to everyone and, unlike water and air (and other common goods), they are non-exclusive: their use by one does not prevent others. If I reuse an open data set, this does not prevent other reusers from doing so. This proximity between the commons and open data is also suggested by the presence of the initiator of Creative Commons licences, Lawrence Lessig, at the 2007 Sebastopol meeting in which the concept of open data itself was defined.


The nature of “public funding” of research varies significantly from one country to the next, as do existing data access policies and practices at the national, disciplinary and institutional levels. These differences call for a flexible approach to data access and recognition that one size does not fit all."
But despite the strong conceptual and historical linkages, it seems that we, as actors of open data, are often shy to reaffirm the relationship. In our efforts to encourage public and private bodies to embrace open data, we seem almost embarrassed of this cornerstone philosophy. The four proposals I make here aim at one thing: not letting it drop!"
(http://www.firstmonday.org/issues/issue12_6/wunsch/index.html)
(http://blog.okfn.org/2013/01/10/4-ideas-for-defending-the-open-data-commons/)


===More Information===
=More Information=


SPARC Open Data Email Discussion List, at http://www.arl.org/sparc/opendata/index.html
#[[Open Science]]
#[[Open Notebook Science]]


[[2004 OECD Ministerial Declaration on Access to Digital Research Data from Public Funding]]


[[Category:Encyclopedia]]
[[Category:Encyclopedia]]
Line 105: Line 315:
[[Category:Standards]]
[[Category:Standards]]
[[Category:Policy]]
[[Category:Policy]]
[[Category:Open]]
[[Category:Licensing]]
[[Category:Open Data]]

Latest revision as of 22:15, 10 March 2014

= "The technological perspective represented by the Open Data Movement. Open data is a philosophy and practice requiring that certain data be freely available to everyone, without restrictions from copyright, patents or other mechanisms of control." [1]

URL = http://okfn.org/opendata/

Definition

1. Open data is data that can be freely used, shared and built-on by anyone, anywhere, for any purpose. [2]


2. OpenDefinition.org: “Open data is data that can be freely used, reused and redistributed by anyone – subject only, at most, to the requirement to attribute and sharealike.” - [3]

3. From the Wikipedia at http://en.wikipedia.org/wiki/Open_Data

"Open Data is a philosophy and practice requiring that certain data are freely available to everyone, without restrictions from copyright, patents or other mechanisms of control. It has a similar ethos to a number of other "Open" movements and communities such as Open Source and Open access.

Open Data is often focussed on non-textual material such as maps, genomes, chemical compounds, mathematical and scientific formulae, medical data and practice, bioscience and biodiversity. Problems often arise because these are commercially valuable or can be aggregated into works of value. Access to, or re-use of, the data are controlled by organisations, both public and private. Control may be through access restrictions, licenses, copyright, patents and charges for access or re-use. Advocates of Open Data argue that these restrictions are against the communal good and that these data should be made available without restriction or fee. In addition, it is important that the data are re-usable without requiring further permission, though the types of re-use (such as the creation of derivative works) may be controlled by license."


What is Open?

"The full Open Definition provides a precise definition of what open data is. There are 2 important elements to openness:

  • Legal openness: you must be allowed to get the data legally, to build on it, and to share it. Legal openness is usually provided by applying an appropriate (open) license which allows for free access to and reuse of the data, or by placing data into the public domain.
  • Technical openness: there should be no technical barriers to using that data. For example, providing data as printouts on paper (or as tables in PDF documents) makes the information extremely difficult to work with. So the Open Definition has various requirements for “technical openness,” such as requiring that data be machine readable and available in bulk.


There are a few key aspects of open which the Open Definition explains in detail. Open Data is useable by anyone, regardless of who they are, where they are, or what they want to do with the data; there must be no restriction on who can use it, and commercial use is fine too.

Open data must be available in bulk (so it’s easy to work with) and it should be available free of charge, or at least at no more than a reasonable reproduction cost. The information should be digital, preferably available by downloading through the internet, and easily processed by a computer too (otherwise users can’t fully exploit the power of data – that it can be combined together to create new insights).

Open Data must permit people to use it, re-use it, and redistribute it, including intermixing with other datasets and distributing the results.

The Open Definition generally doesn’t allow conditions to be placed on how people can use Open Data, but it does permit a data provider to require that data users credit them in some appropriate way, make it clear if the data has been changed, or that any new datasets created using their data are also shared as open data.

There are 3 important principles behind this definition of open, which are why Open Data is so powerful:

  • Availability and Access: that people can get the data
  • Re-use and Redistribution: that people can reuse and share the data
  • Universal Participation: that anyone can use the data,"

(http://blog.okfn.org/2013/10/03/defining-open-data/)

Characteristics

1. The Open Definition gives full details on the requirements for ‘open’ data and content. Key features are:

  • Availability and Access: the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the internet. The data must also be available in a convenient and modifiable form.
  • Reuse and Redistribution: the data must be provided under terms that permit reuse and redistribution including the intermixing with other datasets. The data must be machine-readable.
  • Universal Participation: everyone must be able to use, reuse and redistribute – there should be no discrimination against fields of endeavour or against persons or groups. For example, ‘non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes (e.g. only in education), are not allowed."

(http://okfn.org/opendata/)


2. Fundamental Open Data Rights:

"Arguments made on behalf of Open Data include:


  • "Data belong to the human race". Typical examples are genomes, data on organisms, medical science, environmental data.
  • Public money was used to fund the work and so it should be universally available.
  • It was created by or at a government institution (this is common in US National Laboratories and government agencies)
  • Facts cannot legally be copyrighted.
  • Sponsors of research do not get full value unless the resulting data are freely available
  • Restrictions on data re-use create an anticommons
  • Data are required for the smooth process of running communal human activities (map data, public institutions)
  • In scientific research, the rate of discovery is accelerated by better access to data."

(http://en.wikipedia.org/wiki/Open_Data)


3. Relation to other open activities:

"There are a number of other "Open" philosophies which are similar to, but not synonymous with Open Data but which may overlap, be supersets, or subsets. Here they are briefly listed and compared.


  • Open Source Software is concerned with the licenses under which computer programs can be distributed and is not normally concerned primarily with data.
  • Open Content has similarities to Open Data and may be seen as a superset but differs in that it emphasizes creative works while Open Data is more oriented towards factual data and the output of the scientific research process.
  • Open Knowledge. The Open Knowledge Foundation argues for Openness in a range of issues including, but not limited to, those of Open Data. It covers (a) scientific, historical, geographic or otherwise (b) Content such as music, films, books (c) Government and other administrative information

(http://en.wikipedia.org/wiki/Open_Data)


4. Open Data are opposed by Closed Data:

"Several intentional or unintentional mechanisms exist for restricting access to or re-use of data. They include:


  • compilation in databases or websites to which only registered members or customers can have access.
  • use of a proprietary or closed technology or encryption which creates a barrier for access.
  • copyright forbidding (or obfuscating) re-use of the data.
  • license forbidding (or obfuscating) re-use of the data
  • patent forbidding re-use of the data (for example the 3-dimensional coordinates of some experimental protein structures have been patented)
  • restriction of robots to websites, with preference to certain search engines
  • aggregating factual data into "databases" which may be covered by "database rights" or "database directives" (e.g. Directive on the legal protection of databases)
  • time-limited access to resources such as e-journals (which on traditional print were available to the purchaser indefinitely)
  • political, commercial or legal pressure on the activity of organisations providing Open Data (for example the American Chemical Society lobbied the US Congress to limit funding to the National Institutes of Health for its Open Pubchem data."

(http://en.wikipedia.org/wiki/Open_Data)


How-To

3 Key Rules

"There are three key rules we recommend following when opening up data:

Keep it simple. Start out small, simple and fast. There is no requirement that every dataset must be made open right now. Starting out by opening up just one dataset, or even one part of a large dataset, is fine — of course, the more datasets you can open up the better.

Remember this is about innovation. Moving as rapidly as possible is good because it means you can build momentum and learn from experience — innovation is as much about failure as success and not every dataset will be useful.

Engage early and engage often. Engage with actual and potential users and re-users of the data as early and as often as you can, be they citizens, businesses or developers. This will ensure that the next iteration of your service is as relevant as it can be. It is essential to bear in mind that much of the data will not reach ultimate users directly, but rather via ‘info-mediaries’. These are the people who take the data and transform or remix it to be presented. For example, most of us don’t want or need a large database of GPS coordinates, we would much prefer a map. Thus, engage with infomediaries first. They will re-use and repurpose the material.

Address common fears and misunderstandings. This is especially important if you are working with or within large institutions such as government. When opening up data you will encounter plenty of questions and fears. It is important to (a) identify the most important ones and (b) address them at as early a stage as possible." (http://okfn.org/opendata/)

The Four Steps

"These are in very approximate order – many of the steps can be done simultaneously.

  • Choose your dataset(s). Choose the dataset(s) you plan to make open. Keep in mind that you can (and may need to) return to this step if you encounter problems at a later stage.
  • Apply an open license.
  • Determine what intellectual property rights exist in the data.
  • Apply a suitable ‘open’ license that licenses all of these rights and supports the definition of openness discussed in the section above on ‘What Open Data’
  • Make the data available – in bulk and in a useful format. You may also wish to consider alternative ways of making it available such as via an API.
  • Make it discoverable – post on the web and perhaps organize a central catalog to list your open datasets."

(http://okfn.org/opendata/)

Discussion

Why open data may be more important than open source

Ian Davis:

"data outlasts code which lead me to then assert that therefore open data is more important than open source. This appears to be controversial.

First, it’s important to note what I did not say. I did not say that open source is not important. On the contrary I said that open source was extremely important and it has sounded the death knell for proprietary software. Later speakers at the conference referred to this statement as controversial too :). (What I actually meant to say was that open source has sounded the death knell for propietary software models). I also mentioned that open source and free software has a long history and that open data is where open source was 25 years ago (I am using the term open source and free software interchangeably here).

I also did not say that code does not last nor that algorithms do not last. Of course they last, but data lasts longer. My point was that code is tied to processes usually embodied in hardware whereas data is agnostic to the hardware it resides on. The audience at the conference understand this already: they are archivists and librarians and they deal with data formats like MARC which has had superb longevity. Many of them deal with records every day that are essentially the same as they were two or three decades ago. Those records have gone through multiple generations of code to parse and manipulate the data.

It’s true that you need code to access data, but critically it doesn’t have to be the same code from year to year, decade to decade, century to century. Any code capable of reading the data will do, even if it’s proprietary. You can also recreate the code whereas the effort involved in recreating the data could be prohibitively high. This is, of course, a strong argument for open data formats with simple data models: choosing CSV, XML or RDF is going to give you greater data longevity than PDF, XLS or PST because the cost of recreating the parsing code is so much lower.

Here’s the central asymmetry that leads me to conclude that open data is more important than open source: if you have data without code then you could write a program to extract information from the data, but if you have code without data then you have lost that information forever.

Consider also, the rise of software as a service. It really doesn’t matter whether the code they are built on are open source or not if you cannot access the data they manage for you. Even if you reproduce the service completely, using the same components, your data is buried awayout of your reach. However, if you have access to the data then you can achieve continuity even if you don’t have access to the underlying source of the application. I’ll say it again: open data is more important than open source.

Of course we want open standards, open source and open data. But in one or two hundred years which will still be relevant? Patents and copyrights on formats expire, hardware platforms and even their paradigms shift and change. Data persists, open data endures.

The problem we have today is that the open data movement is in its infancy when compared to open source. We have so far to go, and there are many obstacles. One of the first steps to maturity is to give people the means to express how open their data is, how reusable it is. The Open Data Commons is an organisation explicitly set up to tackle the problem of open data licensing. If you are publishing data in any way you ought to check out their licences and see if any meet with your goals. If you licence your data openly then it will be copied and reused and will have an even greater chance of persisting over the long term." (http://iandavis.com/blog/2009/03/open-data-open-source)


Resources

Open Data Policies

RECOMMENDATIONS from the U.S. Public Policy Committee of the ACM (USACM):


  • Data published by the government should be in formats and approaches that promote analysis and reuse of that data.


  • Data republished by the government that has been received or stored in a machine-readable format (such as online regulatory filings) should preserve the machine-readability of that data.


  • Information should be posted so as to also be accessible to citizens with limitations and disabilities.


  • Citizens should be able to download complete datasets of regulatory, legislative or other information, or appropriately chosen subsets of that information, when it is published by government.


  • Citizens should be able to directly access government-published datasets using standard methods such as queries via an API (Application Programming Interface).


  • Government bodies publishing data online should always seek to publish using data formats that do not include executable content.


  • Published content should be digitally signed or include attestation of publication/creation date, authenticity, and integrity.

(http://www.acm.org/public-policy/open-government)


Open Data Organizations

Open Data Companies

"“Open data is to media what open source is to technology. Open data is an approach to content creation that explicitly recognizes the value of implicit user data. The internet is the first medium to give a voice to the attention that people pay to it. Successful open data companies listen for and amplify the rich data that their audiences produce.” (http://www.attentiontrust.org/node/430)

  • Adaptive Blue- Extended browsing
  • Aggregate Knowledge- Outsourced recommendations
  • Atten.TV- Attention media
  • Buzzlogic- Tracking influence
  • ClearForest – Text analytics
  • Daylife- Hi-touch algorithmic news
  • Feedburner- RSS content management
  • Lijit Networks- Ranking people
  • Majestic Research- Online behavior for investors
  • Meetup- America offline
  • MyBlogLog- Reader communities
  • Omnidrive- Open data storage
  • Right Media- Transparent ad network
  • Stumbleupon- The "forward" button

(http://www.attentiontrust.org/node/428)

Open Data Repositories

open data sets availiable on the Web.

Examples include Wikipedia, Geonames, MusicBrainz

See also: PubChem


Open Data Domains

The concept of Open Data is used in different contexts, i.e. mostlhy as either the availability of scientific raw data and as open access to publicly funded, 'government' information.

(There is of course an obvious overlap when the scientific data are produced by public funding or government institutions.)


Open Access to Government Information

See Open Access to Government Information, as well as Open Government Data and Open Public Data


Open Data in Science

See: Open Data in Science



Status Report 2007

Peter Suber:

"With or without mandates, more governments committed themselves to OA for publicly funded data. Norway adopted an OA mandate for public geodata. Canada, Ireland, and Australia began providing OA to publicly funded digital mapping data, without a mandate. After long resistance, the UK Ordnance Survey began to do the same, at least experimentally. (Earlier in the year, a legal analysis by Charlotte Waelde, an expert on intellectual property at the University of Edinburgh, concluded that the data are not protected by copyright but at most, only by the database right; a JISC report recommended a general UK policy of OA for research data; and the new UK Prime Minister Gordon Brown endorsed the principle of public access to public data.) The Committee of Ministers of the Council of Europe recommended "wide public access to research results to which no copyright restrictions apply" (i.e. data). Publishing consultant Eve Gray reported that the South African government was moving toward a policy of OA for publicly funded research data. The Australian government proposed an Australian National Data Service to promote OA and re-use of publicly funded research data. The Organisation for Economic Co-operation and Development (OECD) issued principles and guidelines to implement its 2004 Declaration on Access to Research Data from Public Funding. California is about to adopt the strongest and broadest OA mandate for greenhouse gas data in the US, and Pennsylvania is about to join the other 49 states in mandating OA for state statutes. And the UN Convention on Long-range Transboundary Air Pollution (LRTAP) adopted an OA mandate for most kinds of data covered by the convention.

The US Government Accountability Office called on four major federal funding agencies (DOE, NASA, NOAA, and NSF) to enforce their existing policies on data sharing. Twenty-two US federal government agencies formed an Interagency Working Group on Digital Data (IWGDD), plan to deposit the data generated by their research grantees in a network of OA repositories, and are considering an OA mandate. The US National Archives joined the OA web portal Geospatial One Stop. The NSF Office of Cyberinfrastructure launched a data interoperability project (INTEROP). Google created a Public Sector Initiative to improve its crawling of OA databases hosted by federal, state, and local government agencies in the US. A group of open government activists convened by O'Reilly Media and Public.Resource.Org drafted principles for open government data. For the first time the US made progress toward OA for its three most notorious non-OA government resources: PACER (Public Access to Court Electronic Records), the database of federal court docket information; NTIS (National Technical Information Service), the online databases of research and business data; and CRS Reports, the highly regarded reports from the Congressional Research Service. The first two began offering OA to selected portions of their content, previously TA, and the third is the subject of a new bill in the Senate to mandate OA.

Nature editorialized in favor of e-notebook science and data sharing, and Nature Biotech recommended "that raw data from proteomics and molecular-interaction experiments be deposited in a public [OA] database before manuscript submission." Maxine Clarke, Publishing Executive Editor at Nature, said that the journal would consider requiring and not merely recommending OA for multimedia data if there were a suitable OA repository supporting annotation and long-term preservation. Wiley threatened legal action when Shelley Batts, a graduate student at the University of Michigan, posted a chart from a Wiley article from the Journal of the Science of Food and Agriculture on her blog; when she replaced it with her own chart of the same data and blogged Wiley's threat, the blogosphere exploded and Wiley said it was all a misunderstanding.

Data-sharing policies were adopted by the UK Medical Research Council, the Ethics Committee of France's Centre National de la Recherche Scientifique (CNRS), the Audiovisual Communications Laboratory at Switzerland's Ecole Polytechnique Fédérale de Lausanne, and the International Telecommunications Union. The NIH launched a new data-sharing program for its neuroscience research. There are too many new OA databases to name separately, but since I've mentioned the NIH, I should add that it launched the Database of Genotype and Phenotype (dbGaP) and SHARe (SNP Health Association Resource). It described SHARe as "one of the most extensive collections of genetic and clinical data ever made freely available to researchers worldwide."

Google began helping researchers exchange datasets up to 120 terabytes in size, too large for ordinary online uploads and downloads. At no charge to the researchers, it will ship a brick-sized box of hard drives from one research team to another, provided that the data have no copyright or licensing restrictions and the bricks stop first at Google headquarters for copying and offline storage. In time, Google hopes to make the datasets OA. The company also began sharing files of its own data with researchers on the condition that they make the results of their research OA.

The year 2007 saw a wave of general OA data repositories spring up, many with built-in features for graphics and analysis: for example, Dabble, Data360, Freebase, Many Eyes, Open Economics, StatCrunch, Swivel, and WikiProteins. At the same time, several projects worked to facilitate the deposit of data in OA repositories, such as EDINA's DataShare and JISC's SPECTRa (Submission, Preservation and Exposure of Chemistry Teaching and Research Data), or to enhance the interface between data repositories and literature repositories, such as JISC's StORe (Source-to-Output Repositories).

By my informal estimate, the fields with the largest advances in OA data during 2007 were archaeology, astronomy, chemistry, the environment (including climate change), geography (including mapping), and medicine (especially, genomics and clinical drug trials)." (http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0011.110)


Discussion

Open data and the Commons

an old story? by Simon Chignard:

"There is a direct link between the open data movement and the philosophy of common goods. Open data are an illustration of the notion of common informational goods proposed by Elinor Ostrom, winner of the 2009 Nobel Prize for economics. Open data belong to everyone and, unlike water and air (and other common goods), they are non-exclusive: their use by one does not prevent others. If I reuse an open data set, this does not prevent other reusers from doing so. This proximity between the commons and open data is also suggested by the presence of the initiator of Creative Commons licences, Lawrence Lessig, at the 2007 Sebastopol meeting in which the concept of open data itself was defined.

But despite the strong conceptual and historical linkages, it seems that we, as actors of open data, are often shy to reaffirm the relationship. In our efforts to encourage public and private bodies to embrace open data, we seem almost embarrassed of this cornerstone philosophy. The four proposals I make here aim at one thing: not letting it drop!" (http://blog.okfn.org/2013/01/10/4-ideas-for-defending-the-open-data-commons/)

More Information

  1. Open Science
  2. Open Notebook Science