Big Data Commons

From P2P Foundation
Jump to: navigation, search

= "Big Data Commons: Imagining and constructing the big data commons in order to renegotiate the role and value of data in our post-digital societies and in order to create something like Commons Data".

For details see


Berliner Gazette:

"The industrial or post-industrial age is increasingly being referred to as the “anthropocene,” referring to an era in which humans are “one of the most important factors that influence the biological, geological and atmospheric processes of the earth” (Wikipedia). Must our understanding of the commons shift as a result? Are our ideas and responsibilities actually evolving together with this shift? We have tended to associate the idea of the commons with the collective and community-based protection and cultivation of “natural” phenomena, from the lands that peasants tended and harvested together in the medieval period to today’s rivers and lakes that social movements seek to protect as commons. But can we expand the idea of commons to those ephemeral “second nature” products of the anthropocene like data itself? Such questions have been posed for many years by free/libre and open source campaigners and others concerned about intellectual property regimes. Drawing on these and other traditions of thought and activism, we want to ask: can big data be understood and (re)claimed as a commons? By whom and under what circumstance? And with what consequences?

* From top-down to bottom-up big data

Wikipedia reports that “big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set.” In addition to standard definitions of the technology, politics and history of big data, there has developed, over the last few years, a critical discourse that looks at the concept from the perspective of ideology, hegemony, and social and economic power. In this discourse, big data is often viewed, amongst other things, as the “oil” of the 21st century, as a post-democratic means of surveillance and control or as a theory of everything. We take our cue from this critical discourse and aim to forge a new dimension, building upon the notion that big data are a “phenomenon of co-production” (Jeanette Hofmann) and therefore demand other modes of more democratic governance.

Berliner Gazette:

"This issue not only concerns data that are generated by institutions and are made publicly accessible by them (as an “act of generosity”), but it also looks at the following potential development:

– focus on all types of data that are generated (including personal data and data produced by individuals) and

– assign them the status of a democratically administered and regulated good, access to which can be negotiated in a democratic manner.

One general issue here, and in particular with respect to the above, is whether individuals should have their “right to be forgotten” extended to also cover big data and search engine data, i.e. whether it is necessary to store a certain type of data in the first place. Not storing such data would make it unnecessary later to administer these data (regardless of who would administer them).

It needs to be clear what part of the big data pool is “private”. This private part should, and needs to, remain private property that belongs to an individual’s personal realm, and is fenced-in by data protection rules. Examples of “private data” are: the content of e-mails and other personal communications; data on which web pages an individual has accessed; records of the data that have been transmitted to websites; and personal data on social networks that are only meant to be shared with friends.

Unlike private data, “public data” have been published on purpose. Examples of public data are: an individual’s website, public profiles on social networks (published by artists or journalists for instance), or blogs. An intermediate position is occupied by individuals who appear in public, but use an alias in doing so in order to escape persecution for their political beliefs, for instance." (

From this, it follows that the remaining data (i.e. non-private data) may be used and classified as “commons data”. Before elaborating on this, please note the following:

The line between public and private

We think it is best to say that the question of the line between public and private must be debated and determined by the same authority-yet-to-come that will govern and manage the data commons, rather than suggested before the fact by “us.” Just a brief example: what about consumer data? What we buy online can reveal a great deal. And what about the “private” data of public figures whose “private” communications have an impact on the public good? We should discuss whether the criteria here is to be a suggestion, or a rule.

The legal dimension

Data protection has historically been understood to be an individual right, i.e. the law defines it as a right that an individual is entitled to. However, in the wake of the NSA surveillance scandal a broad debate has broken out over the question whether data protection or the right to informational self-determination are, in fact, not part of a government’s obligation to protect its citizens, i.e. whether data protection is a collective right that the state must actively protect. The concept of socializing big data or of looking at big data from a collective perspective takes its cue from this debate.

Ownership of data

The question of ownership with regard to data remains completely open. Data protection was established as a right of protection, not as a proprietary right. From a legal perspective we are in uncharted waters here. German law only provides for one area where a public good is produced according to predefined values more or less on a trust basis, this being the Interstate Broadcasting Treaty (the “Rundfunkstaatsvertrag” in Germany). On the basis of this treaty, the parliaments of the German Länder (federal states) commission various bodies to determine how to deal with the public good that is “broadcasting”. This treaty might function as a model for all kinds of legal and practical aspects.

The value of data

We must finally clarify what value data actually have. The reason for this is that the demand for a commons inherently implies a shift from private property to common governance and responsibility. This leads us to try to solve the following issues: 1) what belongs, or should belong, to whom, 2) what is actually worth how much to people, and 3) what something (i.e. in this case “data”) is actually worth, and how we should assess its value. Today, there is a rapidly growing industry dedicated to measuring, monetizing and commodifying data for use by corporations. Here its value is determined solely by markets. We suggest that the assessment of the value of data should be dependent on a dynamic, continuous reassessment of “our” own values as they correspond to social realities. In other words, instead (or in addition to) the economic valuation of data, we must also assess its value in terms of how it can serve the principles like democracy, health and wellness, freedom, environmental sustainability and peace.

Assessments by critical thinkers, advertising professionals or terror prevention analysts have one thing in common: The big data discourse is primarily concerned with data that are personal or have been produced by individuals. After all we, the users, produce more than 75% of the data that make up our digital universe. Asking, “what value do data actually have?”, “who owns the data, who has the power to process and cross-reference it?” and “how can we transform big data into commons data?”, are questions precisely about that type of personal data.

In order to lay a basis for a bottom-up approach to big data, we propose to introduce a strategic distinction between personalized and anonymous data. The question is: Which data are necessarily personalized? Which data can be anonymized?

We do not consider anonymized data as a solution to the overall problem. Rather, we see the creation of an anonymized big data pool as the basis for constructing an area of the big data commons, which can be opened to certain practises of data analytics by third parties. After all a big data commons can only be considered a dynamic and flourishing commons if various ways of commons-based peer production are enabled, for instance, data commons-based peer production in the field of research and science, as proposed by Jane Bambauer in her paper “Tragedy of the Data Commons” (2011).

Against this backdrop one the main objectives on this issue should be to develop a standard that is used by default and which stipulates that data are to be collected anonymously.

Personalized data

The problem with personalized data is that their mere existence can pose a risk to individuals. For instance, many people in the former German Democratic Republic in the 1960s and 1970s unexpectedly did not receive a university place or a promotion they had been counting on. Once the archives of the state security service were opened in the 1990s, they discovered why: information on them gathered by the security service had found its way to other branches of the state like universities or state-owned companies. Often, this information was plain false or construed falsely, but because they were unaware of these secretly collected data, they could not lodge any objection.

To take another example: In countries that had a central register of Jews, the Nazis were able to locate and deport most of the Jews once they had occupied the country in question. In countries that had no such records, the Nazis were only able to locate very few Jews. It is clear that no matter who owns the data or who has control over personalized data (personal profiles), these data can pose a significant risk to individuals.

Note: This is a good historical example, but as they teach in rhetoric: never reach for the Nazi example. It is considered unnecessarily alarmist and often alienates sceptical readers. So it would be great if we could come up with an alternative example.

Today, there are concerns that companies that specialize in tracking personalized data can create highly tailored and specific lists based on everything from Amazon purchases to GPS tracking that can identify, for instance, gay, lesbian, bisexual or people practising otherwise marginalized sexualities. While it is bad enough this data will be sold to marketers, it could also feasibly used by governments or reactionary groups to target individuals.

Anonymized data

Anonymized data are a solution to this problem. The demand for data is growing, and it can probably not be contained by moral, social, political or educational measures alone. But the demand can easily be satisfied by anonymized data. Data collected in such a way delivers the same content (e.g. someone who buys product A will also be interested in product B) without the data being attributable to a specific person. It is important in this case, however, to collect and store each data record separately, because an individual about whom everything except his or her actual identity is known is not really anonymous.

* Commons data

Berliner Gazette:

"After carrying out a subdivision of big data, the data that are to become “commons data” must then to a large extent be anonymized. Anonymization means that (a) the data are no longer private or personal and therefore constitute socialized data, and (b) the data are to a certain degree protected by the anonymization process, i.e., third parties should be prevented by regulation from misusing them; commercial and government bodies should be, for instance, prevented from using them for profiling or surveillance purposes.

The challenge to create regulations for the data protection issue is great: After all computing power is such today that anonymization is easily circumvented by cross-referencing many different datasets. Anonymization is held out as a solution, but a clever algorithm can fairly precisely identify an individual based on anonymous data. Here is a simplified example: One could, feasibly, triangulate an individual from (a) anonymized Uber user data (frequency of travelling to one location, must be their home); (b) anonymized Amazon data (shipped to the same postal code); and (c) a variety of invisible datamarks like type of computer, browser, etc.

According to the current state of the art, in addition to the anonymization of “commons data” it is also possible to anonymize and/or encrypt personal data. The anonymous communication provider Tor, for instance, enables users to visit a website, without their generating data which would permit a third party to establish a relationship between the individual and his or her access to the website. Also, the content of any personal digital communication may be encrypted so that only the sender and the recipient can read it. Some social networks additionally use technology to ensure that the personal data an individual has published are only shared with the individual’s personal contacts." (

  • Standards for data collection

The provision and collection of data in an anonymized way can generally be achieved by introducing technical standards. One example are settings in the internet browser that enable the user to decide which personal data he or she wants to disclose, and what data is to be transmitted anonymously. Once such a standard has been established, it could be transformed into political and social demands. The “political demand” would advocate the passing of statutory requirements for data collection that would represent an implementation of these standards. The “social demand” would come in the form of pressure on commercial companies to implement these standards (due to pressure from the community, WhatsApp for example, an instant messaging app for smartphones, now implements encryption in its software).


– At the individual level: users no longer readily surrender data.

– At the regulatory level: the monopolization of data can be challenged. (This would be a completely new development. Normally, the issue of monopolization is debated from a totally different perspective).

– At the social level: the socialization of data becomes a possibility, which has cognitive, political and economic implications, i.e. people are able to participate

(a) in an existential part of the environment,

(b) in political processes (decision-making on rules, distribution, etc.), and

(c) in economic processes (in which “my” data become a potential economic resource which I am able to exploit myself, or I can have it exploited by third parties).

* Conclusion

With constructing commons data and with democratically negotiating as well as assessing the value of data, we can initiate a shift from a top-down to a bottom-up narrative on big data. Eventually, we will lay the theoretical and political basis for the big data commons." (