OKCupid data release fiasco: It’s time to rethink ethics education

Annette Markham

[addtoany]

originally published on medium.com, cross posted in the social media collective’s blog.

In mid 2016, we confront another ethical crisis related to personal data, social media, the public internet, and social research. This time, it’s a release of some 70,0000 OKCupid users’ data, including some very intimate details about individuals. Responses from several communities of practice highlight the complications of using outdated modes of thinking about ethics and human subjects when considering new opportunities for research through publicly accessible or otherwise easily obtained data sets (e.g., Michael Zimmer produced a thoughtful response in Wired and Kate Crawford pointed us to her recent work with Jacob Metcalf on this topic). There are so many things to talk about in this case, but here, I’d like to weigh in on conversations about how we might respond to this issue as university educators.

The OKCupid case is just the most recent of a long list of moments that reveal how doing something because it is legal is no guarantee that it is ethical. To invoke Kate Crawford’s apt Tweet from March 3, 2016:

Repeat after me: ‘just using public data’ doesn’t make it ethical. Just because you’re researchers doesn’t mean you’re not doxxing.

— Kate Crawford (@katecrawford) March 3, 2016

This is a key point of confusion, apparently. Michael Zimmer, reviewing multiple cases of ethical problems emerging when large datasets are released by researchers emphasizes the flaw in this response, noting:

This logic of “but the data is already public” is an all-too-familiar refrain used to gloss over thorny ethical concerns (in Wired).

In the most recent case, the researcher in question, Emil Kirkegaard, uses this defense in response to questions asking if he anonymized the data: “No. Data is already public.” I’d like to therefore add a line to Crawford’s simple advice:

Data comes from people. Displaying it for the world to see can cause harm.

A few days after this data was released, it was removed from the Open Science Framework, after a DMCA claim by OKCupid. Further legal action could follow. All of this is a good step toward protecting the personal data of users, but in the meantime, many already downloaded and are now sharing the dataset in other forms. As Scott Weingart, digital humanities specialist at Carnegie Mellon, warns:

Your private life is a few big leaks away from being an inescapable matter of public record, once a statistician with BitTorrent gets bored.

— Scott B. Weingart (@scott_bot) May 11, 2016

As a long term university educator, a faculty member at the same university where Kirkegaard is pursuing his Masters degree, and a researcher of digital ethics, this OKCupid affair frustrates me: How is it possible that we continue to reproduce this logic, despite the multiple times “it’s publicly accessible therefore I can do whatever I want with it” has proved harmful? We must attribute some responsibility to existing education systems. Of course, the problem doesn’t start there and “education system” can be a formal institution or simply the way we learn as everyday knowledge is passed around in various forms. So there are plenty of arenas where we learn (or fail to learn) to make good choices in situations fraught with ethical complexity. Let me offer a few trajectories of thought:

What data means to regulators

The myth of “data is already public, therefore ethically fine to use for whatever” persists because traditional as well as contemporary legal and regulatory statements still make a strong distinction between public and private. This is no longer a viable distinction, if it ever was. When we define actions or information as being either in the private or the public realm, this sets up a false binary that is not true in practice or perception. Information is not a stable object that emerges in and remains located in a particular realm or sphere. Data becomes informative or is noticed only when it becomes salient for some reason. On OKCupid or elsewhere, people publish their picture, religious affiliation, or sexual preference in a dating profile as part of a performance of their identity for someone else to see. This placement of information is intended to be part of an expected pattern of interaction — someone is supposed to see and respond to this information, which might then spark conversation or a relationship. This information is not chopped up into discrete units in either a public or private realm. Rather, it is performative and relational. When we only access regulatory language, the more nuanced subtleties of context are rendered invisible.

What data means to people who produce it

Whether information or data is experienced or felt as something public or private is quite different from the information itself. Violation of privacy can be an outcome at any point. This is not related to the data, but the ways in which the data is used. From this standpoint, data can only logically exist as part of continual flows of timespace contexts; therefore, to extract data as points from one or the other static sphere is illogical. Put more simply, the expectation of privacy about one’s profile information comes into play when certain information is registered and becomes meaningful for others. Otherwise, the information would never enter into a context where ‘public’, ‘private’, ‘intimate’, ‘secret’, or any other adjective operates as a relevant descriptor.

This may not be the easiest idea for us to understand, since we generally conceptualize data as static and discrete informational units that can be observed, collected, and analyzed. In experience, this is simply not true. The treatment of personal data is important. It requires sensitivity to the context as well as an understanding of the tools that can be used to grapple with this complexity.

What good researchers know about data and ethics

Reflexive researchers know that regulations may be necessary, but they are insufficient guides for ethics. While many lessons from previous ethical breaches in scientific research find their way into regulatory guidelines or law, unique ethical dilemmas arise as a natural part of any research of any phenomenon. According to the ancient Greeks, doing the right thing is a matter of phronesis or practical wisdom whereby one can discern what would constitute the most ethical choice in any situation, an ability that grows stronger with time, experience, and reflection.

This involves much more than simply following the rules or obeying the letter of the law. Phronesis is a very difficult thing to teach, since it is a skill that emerges from a deep understand of the possible intimacy others have with what we outsiders might label ‘data.’ This reflection requires that we ask different questions than what regulatory prescriptions might require. In addition to asking the default questions such as “Is the data public or private?” or “does this research involve a ‘human subject’?” we should be asking “What is the relationship between a person and her data?” Or “How does the person feel about his relationship with his data?” These latter questions don’t generally appear in regulatory discussions about data or ethics. These questions represent contemporary issues that have emerged as a result of digitization plus the internet, an equation that illustrates information can be duplicated without limits and is swiftly and easily separated from its human origins once it disseminates or moves through the network. In a broader sense, this line of inquiry highlights the extent to which ‘data’ can be mischaracterized.

Where do we learn the ethic of accountability?

While many scholars concerned with data ethics discuss complex questions, the complexity doesn’t often end up traditional classrooms or regulatory documents. We learn to ask the tough questions when complicated situations emerge, or when a problem or ethical dilemma arises. At this point, we may question and adjust our mindset. This is a process of continual reflexive interrogation of the choices we’re making as researchers. And we get better at it over time and practice.

We might be disappointed but we shouldn’t be surprised that many people end up relying on outdated logic that says ‘if data is publicly accessible, it is fair game for whatever we want to do with it’. This thinking is so much easier and quicker than the alternative, which involves not only judgment, responsibility, and accountability, but also speculation about the potential future impact of one’s research.

Learning contemporary ethics in a digitally-saturated and globally networked epoch involves considering the potential impact of one’s decisions and then making the best choice possible. Regulators are well aware of this, which is why they (mostly) include exceptions and specific case guidance in statements about how researchers should treat data and conduct research involving human subjects.

Teaching ethics as ‘levels of impact’

So, how might we change the ways we talk and teach about ethics to better prepare researchers to take the extra step of reflecting on how their research choices matter in the bigger picture? First, we can make this an easier topic to broach by addressing ethics as being about choices we make at critical junctures; choices that will invariably have impact.

We make choices, consciously or unconsciously, throughout the research process. Simply stated, these choices matter. If we do not grapple with natural and necessary change in research practices our research will not reflect the complexities we strive to understand. — Annette Markham, 2003.

Ethics can be thus considered a matter of methods. “Doing the right thing” is an everyday activity, as we make multiple choices about how we might act. Our decisions and actions transform into habits, norms, and rules over time and repetition. Our choices carry consequences. As researchers, we carry more responsibility than users of social media platforms. Why? Because we hold more cards when we present findings of studies and make knowledge statements intended to present some truth -big or little T- about the world to others.

To dismiss our everyday choices as being only guided by extant guidelines is a naïve approach to how ethics are actually produced. Beyond our reactions to this specific situation, as Michael Zimmer emphasizes in his recent Wired article, we must address the conceptual muddles present in big data research.

This is quite a challenge when the terms are as muddled as the concepts. Take the word ‘ethics.’ Although it’s an important term that operates as an important foundation in our work as researchers, it is also abstract, vague, and daunting because it can feel like you ought to have philosophy training to talk about it. As educators, we can lower the barrier to entry into ethical concepts by taking a ‘what if’ impact approach, or discussing how we might assess the ‘creepy’ factor in our research design, data use, or technology development.

At the most basic level of an impact approach, we might ask how our methods of data collection impact humans, directly. If one is interviewing, or the data is visibly connected to a person, this is easy to see. But a distance principle might help us recognize that when the data is very distant from where it originated, it can seem disconnected from persons, or what some regulators call ‘human subjects.’ At another level, we can ask how our methods of organizing data, analytical interpretations, or findings as shared datasets are being used — or might be used — to build definitional categories or to profile particular groups in ways that could impact livelihoods or lives. Are we contributing positive or negative categorizations? At a third level of impact, we can consider the social, economic, or political changes caused by one’s research processes or products, in both the short and long term. These three levels raise different questions than those typically raised by ethics guidelines and regulations. This is because an impact approach is targeted toward the possible or probable impact, rather than the prevention of impact in the first place. It acknowledges that we change the world as we conduct even the smallest of scientific studies, and therefore, we must take some personal responsibility for our methods.

Teaching questions rather than answers

Over the six years I spent writing guidelines for the updated ‘Ethics and decision making in internet research” document for the International Association of Internet Researchers (AoIR), I realized we had shifted significantly from statements to questions in the document. This shift was driven in part by the fact that we came from many different traditions and countries and we couldn’t come to consensus about what researchers should do. Yet we quickly found that posing these questions provided the only stable anchor point as technologies, platforms, and uses of digital media were continually changing. As situations and contexts shifted, different ethical problems would arise. This seemingly endless variation required us to reconsider how we think about ethics and how we might guide researchers seeking advice. While some general ethical principles could be considered in advance, best practices emerged through rigorous self-questioning throughout the course of a study, from the outset to well after the research was completed. Questions were a form that also allowed us to emphasize the importance of active and conscious decision-making, rather than more passive adherence to legal, regulatory, or disciplinary norms.

A question-based approach emphasizes that ethical research is a continual and iterative process of both direct and tacit decision making that must be brought to the surface and consciously accounted for throughout a project. This process of questioning is most obvious when the situation or direction is unclear and decisions must be made directly. But when the questions as well as answers are embedded in and produced as part of our habits, these must be recognized for what they once were — choices at critical junctures. Then, rather than simply adopting tools as predefined options, or taking analytical paths dictated by norm or convention, we can choose anew.

This recent case of the OKCupid data release provides an opportunity for educators to revisit our pedagogical approaches and to confront this confusion head on. It’s a call to think about options that reach into the heart of the matter, which means adding something to our discussions with junior researchers to counteract the depersonalizing effects of generalized top down requirements, forms with checklists, and standardized (and therefore seemingly irrelevant) online training modules.

This involves questioning as well as presenting extant ethical guidelines, so that students understand more about the controversies and ongoing debates behind the scenes as laws and regulations are developed.
It demands that we stop treating IRB or ethics boards requirements as bureaucratic hoops to jump through, so that students can appreciate that in most studies, ethics require revisiting.
It means examining the assumptions underlying ethical conventions and reviewing debates about concepts like informed consent, anonymizing data, or human subjects, so that students better appreciate these as negotiable and context-dependent, rather than settled and universal concepts.
It involves linking ethics to everyday logistic choices made throughout a study, including how questions are framed, how studies are designed, and how data is managed and organized. In this way students can build a practice of reflection on and engagement around their research decisions as meaningful choices rather than externally prescribed procedures.
It asks that we understand ethics as they are embedded in broader methodological processes — perhaps by discussing how analytical categories can construct cultural definitions, how findings can impact livelihoods, or how writing choices and styles can invoke particular versions of stories. In this way, students can understand that their decisions carry over into other spheres and can have unintended or unanticipated results.
It requires adding positive examples to the typically negative cases, which tend to describe what we should not do, or how we can get in trouble. In this way, students can consider the (good and important) ethics of conducting research that is designed to make actual and positive transformations in the broader world.

This list is intended to spark imagination and conversation more than to explain what’s currently happening (for that, I would point to Metcalf’s 2015 review of various pedagogical approaches to ethics in the U.S.). There are obviously many ways to address or respond to this most recent case, or any of the dozens of cases that pose ethical problems.

I, for one, will continue talking more in my classrooms about how, as researchers, our work can be perceived as creepy, stalking, or harassing; exploring how our research could cause harm in the short or long term; and considering what sort of futures we are facilitating as a result of our contributions in the here and now.

For more about data and ethics, I recommend the annual Digital Ethics Symposium at Loyola University-Chicago; the growing body of work emerging from the Council for Big Data, Ethics, & Society; and the international Association of Internet Studies (AoIR) ethics documents and the work of their longstanding ethics committee members. For current discussions around how we conceptualize data in social research, one might take a look at special issues devoted to the topic, like the 2013 issue on Making Data: Big data and beyond in First Monday, or the 2014 issue on Critiquing Big Data in the International Journal of Communication. These are just the first works off the top of my head that have inspired my own thinking and research on these topics.