Open data and the GDPR

There is an obvious tension between openly publishing data from human subject research and the European data protection legislation GDPR. Transparency guidelines of funders or journals often say you should publicly post your data, the GDPR says you are not allowed to. Or at least, that is what your university’s legal team tells you, or appears to tell you in legalese. However, it’s not necessarily true that GDPR and open science don’t mix, but it is true the relationship is somewhat complicated, in particular in the (experimental) behavioral sciences.

Although the GDPR has been in effect for almost three years now, there are still a lot of unclarities with regard to what to do and what not to do with psychological data. As former department head of my faculty’s research data service, I’ve spent quite a bit of time with ethicists, lawyers, IT-people, and researchers to figure out what is going on. In this quick primer I’d like to share a couple tips and insights that may be helpful if you’re balancing GDPR and open data. This post is necessarily going to be very short and quite superficial - the actual legal theory, but also the ethical and technological aspects of the GDPR and wider topic of how science interfaces with society and technology is (quite literally!) a field of study in itself. Here, I will purely be focussing on several practical aspects and clarifications.

Some legalese and inconvenient truths

The GDPR introduces a couple of terms which are important to understand, because your legal department is going to throw them at you if you ask them for help. Most importantly, the GDPR talks about processing of personal data. Data processing in this sense means everything you do with data: data collection, data storage, data transport, data analysis, data sharing, data publication, even destroying data is data processing. As soon as a participant signs an informed consent sheet, you are processing personal data.

Within the GDPR, there are different ‘roles’ or entities involved with this data processing. Most important are:

the data subject . In behavioral research , that is the participant
the data processors . This is everyone who does something with the data: you, your students, a translation service if you need to translate transcripts, even the organization where you store your data, such as the OSF
the data controller. This the legal entity responsible for compliance with the GDPR (i.e., the organization that will pay the fines if you mess up). In practice, this will be the institution or university you work for.

Another important role defined by the GDPR is that of the data protection officer, a legal professional in an organization who interprets the GDPR for that organization. The DPO is responsible for compliance with the GDPR, and will (for example) be the person running an investigation into a data breach or other data-related incidents. Every organization processing personal data is legally obliged to have a DPO.

Now, an inconvenient truth here is that legally speaking, any data you collect is not yours. It’s the university’s. The university, as data controller, is responsible for correct handling and processing of the data, and as such, it is the university which decides where you put your data, and whether you can publicly share it or not.

Of course, that’s not how it works in practice. If you ask the board of your university for permission to share your dataset of the OSF, probably they’ll look at you as if they see water burning. It is very likely that no one in your institution actually knows who is authorized to make specific decisions with regard to data, and that is a major pain. Get in touch with your data protection officer to ask how these things are arranged in your organization. The very last thing you want is be held personally accountable for a data breach because you published a dataset on OSF that should not have been openly published! People have been fired for less serious data offenses in the private sector, so make sure you know who is responsible for what. At the very least, make sure your local data protection authority does not come knocking on your door in case things go sour. If that means you need to get someone’s permission in order to publish a dataset on the OSF, so be it. You really don’t want to be personally accountable for a data incident - that’s your organization’s responsibility.

Another inconvenient truth is that whether you like it or not, and whether you agree with the GDPR or not, fact is that law and rights of citizens have priority over your work as a scientist - regardless of what funders, publishers or reviewers demand of you. This means you will have to work with the GDPR, and if you want to change something, violating participants’ rights by publicly posting data anyway as an act of civil disobedience won’t help. Do get in touch with your MPs, let your professional societies know about problems, anything - but civil disobedience will only put yourself at risk.

For who(m)?

So, now we know some basic legalese, let’s look at when the GDPR actually applies. The GDPR has a territoriality principle: the regulation applies to citizens of the EU. For any project that has a data processor, data controller or data subject in the EU, the GDPR applies. This means that if you are a researcher working in the EU, or if you are collecting data from EU citizens (even if you are located elsewhere yourself), the GDPR applies. US researchers collecting data from US citizens - GDPR does not apply. US researchers collecting data from EU citizens - yes, you will need to adhere to the GDPR. EU researcher collecting data from EU citizens? Yes, GDPR applies. EU researcher collecting data from US citizens? Yes, GDPR applies in this case too, even though you’re not collecting data from EU participants!

It gets even more complicated than that. US researchers collecting data from US participants, but collaborating with an EU researcher? GDPR applies! US researcher collecting data from US participants, but using an EU transcription service for interviews? GDPR applies!

Now, there is another catch: the GDPR also applies when you re-use data. If you as a EU researcher download a dataset with personal data collected from US citizens from the OSF, you will have to apply the (stricter) regulations of the GDPR.

Of course, all of this works different in practice, but it’s important to keep in the back of your mind that the GDPR does apply in many cases. Don’t be surprised if a collaborator or other party you work with brings it up.

Now, the General Data Protection Regulation is actually poorly named, because the GDPR is not really about data protection. The goal of the GDPR is not to stop parties from processing personal data, or to prevent personal data to be openly shared. The actual goal of the GDPR is the “emancipation of the data subject” - the natural person whose data is being processed. If you are reading this post on micro.blog, you are data subject of micro.blog, which is collecting personal data about you via cookies, for example. The main aim of the GDPR is to allow you, as a data subject, to control or at least be informed about what personal data is being collected about you and how that data is being used. Sadly, this is a bare necessity given that we are sharing our personal data on a global scale, and that is become a lot easier to collect and combine personal data using automated scraping and ‘smart’ algorithms. Of course you’re thinking of microtargeting of advertisements (e.g. Google Ads), but it start to get really creepy when governments start doing this. And they are - a recent scandal in the Netherlands brought to light that our Tax Authorities have been using profiling on a massive scale to determine whether people had committed fraud with child care benefits. First, it lead to an enormous amount of false accusations; second, it turned out that there was a strong racial bias in this profiling.

The GDPR has been conceived to counter such developments. The GDPR gives citizens the right that no party can just freely collect personal data, the right to be informed about how their data is being used, and the right to demand that records about them are corrected if they are wrong or even completely removed from a database. Moreover, the GDPR prohibits parties from collecting special categories of data, such as ethnicity, political views, health data, or sexual orientation - basically, the kind of stuff we typically consider to be private. Finally, the GDPR introduces the concept of purpose limitation: a data processor is only allowed to use data for the purpose it has collected data for.

In other words: the GDPR is designed to give data subjects more control over how their data is being used. Of course, that won’t help much if you as a data subject have to follow up on these rights all the time and prove that a data processor or controller has violated your rights. Therefore, the GDPR puts the burden of proof with regard to demonstrating compliance with the data controller. A data controller is required to have a register of all personal data processing, and can be subjected to audits by data protection authorities.

It’s important to note that the ‘new’ part of the GDPR is not so much the restrictions on processing personal data - most precursor laws already had such restrictions - but this shift in burden of proof in demonstrating compliance.

Exemptions for scientific research

This might sound all very nice and well from the perspective of normal citizens, but as scientists this is a pain. Many people working in the social sciences and humanities rely on personal data for their research, and more often than not need special categories of data for their research. The GDPR therefore has several exemptions for scientific research.

Most importantly, for scientific purposes you are allowed to:

collect special categories of data, such as ethnicity, sexual orientation, political views, etc.
refuse data subjects’ rights to have data corrected or removed from a database
archive data after processing for as long is ‘normal’ in your discipline

Of course it’s a bit ambiguous what ‘scientific research’ is. The WP29, an association of legal scholars from the national data protection agencies of the EU member states, have defined scientific research as an enterprise aimed at increasing knowledge, and guided by specific ‘best practices’ and ethical standards. In particular the latter statement is very important, because effectively this can be interpreted as “it’s not science if an ethics committee has not approved it”. Now, this is not a universally accepted interpretation of the GDPR, but one that works reasonably well.

As long as you process data for scientific research, and your protocol has been approved by an ethics committee, you are not limited by many of the prohibitions in the GDPR (although things like purpose limitation still apply!) So, that’s all good, right? Yes - until you want to publicly share your data.

The open data catch

The main problem with public sharing of data is that your exemptions for scientific research are voided. Since it is impossible to guarantee your data subjects that their data will only be used for scientific purposes once you published it openly, you will need to offer them all the rights given by the GDPR, such as the right to be forgotten, or the right to have data amended. With all due respect to citizen rights and such, that’s not something you want as a researcher. The problem is, though, if you do not do so, or publicly post personal data without explicit permission from your research participants, you are now responsible for a data breach. And that is not a good thing.

Moreover, this is assuming you have written an informed consent form that covers data re-use in a rather broad way. If you did not, well - then you also violated the principle of purpose limitation. That’s also a data breach…

Now, there is some debate on whether data licensing (i.e. putting a statement like “This data can only be used for scientific purposes” or “This data can only be used to verify the original research results”) is sufficient to make sure you meet the requirements of scientific research and/or purpose limitation. Some legal scholars say that secondary data processors (i.e. people who download your dataset) have a responsibility as well; others are of the opinion that the full burden of ensuring that the criteria for scientific research and purpose limitation are met rests with original data controller. In case of doubt, check with your data protection officer!

Anonymization versus pseudonymization

“Hah”, I hear you say, “but I anonymized my data, so the GDPR does not apply!” Well, it is quite likely you are wrong. The GDPR was not written with psychologists in mind, and the official definition of anonymization pretty much means that you will have to mutilate your dataset beyond recognition in order for it to be regarded as actually anonymous. There are of course exemptions and situations in which anonymization is very well possible, but in general it is quite difficult to properly anonymize a dataset and keep it actually useful for other researchers.

The reason is that the GDPR has a very broad definition of personal data. Whereas the HIPAA in the US states that personal data is restricted to 18 key variables, such as names, addresses, account numbers, etc., the GDPR states that any piece of data that in principle can be traced back to an individual is personal data. In a typical psychology dataset, individual rows represent individual participants, which means that, yes, this is personal data under the GDPR - even if you did not include any directly identifying information.

Now, you actually can share or publish this data (even if there is identifying information in it), but not openly. As long as you or your institution can retain a certain amount of control over how data is used, e.g. by restricting access to the data to other researchers, and this is covered in the consent form, there should be no legal problems.

Identifiability and utility

There are some quick heuristics to determine if your data is sufficiently pseudonymized/anonymized to publish it openly. Please note that these are quick heuristics I use and find useful - check with your ethics committee and data protection officer to see if they also work in your case. Basically, what I check are two things: indentifiability and utility.

Identifiability refers to how identifiable individual records are. A computational approach could be to compute k-anonymity. If you collect only two variables, e.g. gender (1 = female, 2 = male, 3 = other) and a response to the question “Do you like ice cream” (1 = yes, 2 = no), in a sample of 1000 respondents, there will inevitably be many identical records. Identifying an individual is going to be next to impossible. However, if you also collect the exact time of the response, and collect your data sequentially, there will be many unique responses, and data can be related to individual subjects. Obviously, the more variables you collect, and the larger the range of those variables, the more unique cases you will have in your dataset. If your dataset contains over 15 variables or more, it’s pretty much a given there will be single, identifiable records in your dataset. As a very simple rule of thumb: if a participant would be able to trace back his/her own record, your dataset is definitely identifiable.

Now, utility refers to how much new information you get from the dataset if you can re-identify someone. In the example above: suppose we do not ask 1000 respondents, but only 100, and two people indicate they identify as non-binary. I happen to know my non-binary friend participated in the experiment, and that they do not like ice-cream. Lo, and behold, on row 29 of the dataset there is a ‘3’ in the gender column, and a ‘2’ in the column ‘Likes ice-cream’. I have now re-identified my friend, but I have learnt exactly nothing new about them. The same principle goes for more complex cases, where you need to combine many variables to come up with a profile of an individual: if you learn nothing new when you have re-identified someone, there is still no reason to panic. The situation is different, though, when the ice-cream survey is coupled with an extensive questionnaire on, let’s say, political views, or - even spicier - sexual behavior. In that case, there clearly is a data-breach.

A combination of high identifiability and high utility is bad news, and reason to not post data publicly, but go for a restricted access repository instead. However, if they are both low, it is very likely you can publish your dataset publicly without problems.

Some concluding remarks

I hope this is at least moderately useful if you’re struggling with balancing the GDPR with open data. Check with your local ethics committee and your DPO in case of doubt. And please note that many legal professionals will have absolutely no idea about what psychological data actually entails. In my experience, legal professionals and data experts typically think of either clinical records or registry data (such as insurance data) when you ask them about how to apply the GDPR to research data. Most psychology data is of a completely different nature and requires a very different approach.

Most importantly, the GDPR is written ‘technology neutral’. There is ‘wiggling room’ in the GDPR: as a scientist, you can demonstrate compliance by showing that you adhere to the best practices in your field. Those best practices can be demonstrated by referring to publications. Therefore, it is very important for the field to publish on how we deal with data, what problems we encounter, and how we can balance the rights of citizens with scientific transparency. In other words, if you are struggling with GDPR and open data, don’t keep struggling, but write about it and share your experiences.