The Ethics of Using Existing Data in Research

In a few weeks I’ll be presenting a brief workshop on using existing sources of data in research as part of a post-graduate seminar series being run this semester in the UQ School of Psychology. It’s a topic I’m pretty keen on, so I thought I’d write a few posts on some aspects of using existing data that I’ve been thinking about lately. The first topic I wanted to write about was the ethical concerns involved in the use of existing data.

Research ethics has been under high scrutiny lately with the revelation that PNAS, a leading academic journal, published a study conducted by Facebook in what seems to be a fairly unethical manner. Some of the discussion around that study focused on whether it was appropriate for Cornell researchers who worked with Facebook on the study to get ethical approval under the lax guidelines surrounding the use of existing data. The main argument seems to be that because these authors had a hand in designing the study (at least according to the attributions in the paper), it wasn’t appropriate to claim it was an ‘existing dataset.’

While I agree that the use of Cornell’s existing data ethics policy was probably inappropriate in this case, what I’m more interested in talking is why it’s acceptable to have lower standards for ethics in the use of existing data than when we’re actually collecting it. The 2007 National Statement on Ethical Conduct in Human Research, which guides Australian human research practices, states that institutions may exempt research involving existing datasets from ethical review if they contain no identifying information about participants, and involve negligible risk to participants. It’s unclear whether negligible risk clause refers to the original research or your particular use of the data, which leaves some room for interpretation. A quick search shows several institutions which waive the need for ethical review for the use of any publicly available data.

Intuitively, this doesn’t seem too unreasonable. It seems difficult to do harm to participants from whom data has already been collected – if they did experience any discomfort or mistreatment, it has already occurred, and your analysis of the data will not change that. But this ignores the fact that the collection of data is at least partially demand-driven. To make an admittedly dramatic comparison, while the elephant may have been shot months before you buy an ivory souvenir, it’s still arguable unethical to increase the demand for such products by purchasing them, leading to increased poaching.

I don’t think the comparison is completely inappropriate. We’re seeing the rise of private companies and organisations with access to data that is incredibly valuable from a scientific standpoint. Dating websites like OkCupid, social network sites like Facebook and Twitter, and search companies like Google have all begun collecting large amounts of detailed data on their users. This is the kind of real-world, detailed data that academic researchers haven’t had a lot of access to until recently, and so I can imagine demand for this data will be high, as it may be highly publishable. The concern then is that if academics are continually willing to analyse and publish data that private companies collect under ethically questionable circumstances, this may drive a push for more unethical research to be conducted.

The take-home message here is not particularly earth-shattering, but it bears saying: just because you didn’t collect the data yourself doesn’t make using it ethical. I think many university ethics boards will adopt higher scrutiny of this kind of data use in light of this latest controversy, but I also think as researchers we need to hold ourselves to a certain standard. I would question heavily whether it’s worth publishing using existing data if you wouldn’t feel comfortable running the research that generated it.