Open data disasters

In a recent Twitter thread, Malte Olson signaled that there are several open datasets on the Open Science Framework that are improperly anonymized, and thus expose research participants to re-identification risk. Unfortunately, this is not an incident, but a (not entirely unexpected) side-effect of the increasing push for open data.

Personally, I am very happy that this finally(!) gets picked up, and I am going to indulge in a little “I told you so” here… without going into details, please find a 2017 preprint here and a presentation on this matter here.

In 2017 I took up the position as department head of the newly formed Research Support department of my faculty, I was responsible for the alignment of institutional policies with the GDPR, the European data legislation introduced in 2018, aimed at giving citizens more rights into determining how their personal data is used. This new legislation has significant implications for research - not so much because it introduces new hurdles and/or restrictions, but rather because it assigns additional duties of care to data processors and controllers (i.e. researchers and institutions). I wrote a very brief introduction about that, which you can find here.

One of the things I needed to do first was to get an idea of how serious the issues regarding open data and the GDPR were. I was very lucky to get help from a very talented student, Yvonne Lomans, who did an inventory of open datasets as part of her master thesis research in 2018. We wanted to get an idea about the incidence of serious issues with (psychology) open datasets in the context of privacy. Yvonne’s thesis has not been published, but it is available via the University Library of the University of Groningen.

The method we used was not ideal, and I have been planning on doing a more formal analysis for the past years, but as it goes with these things, you never have enough time, in particular not as department head. Perhaps I might follow this up in the future, now I have returned to a full time academic position, but for now I will present Yvonne’s data and conclusions.

The idea was very simple: we looked at in total 84 psychology papers from four journals (Psychological Science, PLOS One, Collabra, Journal of Cognition) mandating open data, and checked whether those open datasets complied with the GDPR. That quest started out with some disappointment early on: 8 papers had to be excluded because the alleged ‘open data’ was not open at all - either the data was not present at the linked location, a sign-up was required, or data of a third party was used.

Of the remaining 76 papers, we looked at the repository used, the ethics approval, the informed consent form, and whether information was given about the participant population. All these aspects are important to determine whether data has processed in a GDPR-compliant way. In particular ethics approval and informed consent are very important if you want to re-use data, as these documents contain vital information about terms and conditions for re-use. Without the original informed consent form, you can only guess about what you’re allowed to do with the dataset. Finally, information about the participant pool is also very important: it’s a source of auxiliary information an attacker can use to re-identify individuals. If you know participants were first year psychology students from the English psychology bachelor in Groningen, and the data was collected between September and December 2018, that limits your population to 350 rather easily identified individuals, for example (just check their Facebook Year Group).

To assess GDPR compliance, we checked for re-identifiability, processing of sensitive personal data (prohibited under the GDPR unless for scientific purposes; open data is by definition ‘not scientific’, because the data controller cannot guarantee data will only be used for scientific purposes), and presence of direct identifiers. Please note that the GDPR definition of ‘personal data’ is a lot broader than the 18 HPIAA identifiers in the US. The GDPR defines personal data as any data the related to a natural person. The Dutch Data Protection Authority has confirmed in several reports that behavioral data, such as collected in cognitive tasks, also falls under this definition if reported in such a way that it can be related to an individual (e.g. a data file with average RTs per participant would be ‘personal data’ under this definition).

Anyway, the results of our investigation were not pretty. Overall, 44.7% of the datasets did not comply with the GDPR; of 22.4% we could not be sure; only 32.9% of the datasets were definitely sufficiently anonymized. One dataset (a questionnaire on religious beliefs, a special category of personal data under the GDPR) even included names, e-mail addresses, and phone numbers of minors¹. 39.5% of the studies did not include information on ethics approval and/or the content of the informed consent. As a matter of fact, only one study provided the informed consent form with the dataset (which, again, you need if you want to determine if you can lawfully re-use a dataset).

Now, of course, this was pre-GDPR, and the sample included quite a substantial number of US-based datasets, which may have skewed the results. Nonetheless, we found the results shocking. Our conclusion was that the push for mandatory open data in psychology may have come too early - the maturity level in terms of privacy awareness of researchers is not sufficient to responsibly let them publicly share data. As in our 2017 preprint, we recommended that researchers use restricted access repositories unless they are absolutely certain of what they are doing. Ah, and please note that despite all sampled journals mandating open data, over 10% of our initial sample did in fact not include open data - even though they had an open data badge!

I cannot claim that things have gotten better since, simply because I did not do the research. It would be very interesting, though, to what the situation is like today. Although I do think privacy awareness has increased, I am very doubtful whether is has been enough to avert disasters. Within our department meetings, the biggest fear has always been that a clever investigative journalist will one day go browsing on the OSF and write an article about the flagrant violations of participant privacy in psychological science. I am very afraid that that is still a very real possibility.

Anyway, conclusion: please be very careful with publicly posting your data. In case of doubt, please, please, pretty please with a cherry on top, use a restricted access repository. Your participants will be grateful.

We immediately contacted the researcher, of course, who immediately deleted the dataset from the OSF. Technically, though, this was a serious data breach that under present guidelines (and three years into the GDPR) would have been reported to the Data Protection Authority (yes, this was an EU based dataset). ^[return]