Introduction

In their 2012 article, “Critical questions for Big Data: Provocations for a cultural, technological, and scholarly phenomenon”, scholars at Microsoft Research and members of the Microsoft Social Media Collective reject the attempt to define “big data” as a phenomenon characterized by the size or complexity of the data, per se (Boyd and Crawford 2012). They note that long before the current era, massive data sets, such as the US Census, have been routinely collected and analyzed. Instead they define Big Data as a socio-technological phenomenon that integrates—or has the potential to integrate—technology, analysis, and scholarship. Boyd and Crawford (2012) emphasize that Big Data (capitalized to convey its status as a phenomenon to be studied, p. 675, footnote 1) involves scholars from social, quantitative, and technological domains—as well as participants from financial and business domains with limited interest in scholarship, publishing, or academic applications of Big Data. This diversity presents a significant problem: no single discipline exists to govern, or even to create norms around ethical practices in, Big Data use or research.

This normative paper describes an approach to using ethical reasoning to promote professionalism, and to prepare practitioners for responsible and ethical Big Data use, research, and engagement. We draw on the more scholarly domains where Big Data research and tools are developed and employed (e.g., Dourish and Bell 2011, Ch. 1) in order to anchor our approach; however, we believe this paper is also relevant for the more commercial and less academic facets of Big Data, because all should involve properly trained practitioners (e.g., Bollier 2010). We aim to orient readers to the role and relevance of ethical reasoning in a twofold way, both historically and methodologically. We show, first, the centrality of both codes of conduct and ethics in the process of professionalization and discipline formation, and will then outline a way to use existing codes of conduct for the ethical training of practitioners. We focus the discussion of the formation of professional community norms on two key disciplines from which Big Data may attract many practitioners, computer science and statistics (i.e., technology and analysis). However, the approach can be used to amplify the reach and resonance of the code of conduct from any field.

The development of a discipline or profession tends to proceed in tandem with a normative agreement on a code of professional ethics or conduct for the practitioner in the community (Starr 1984; Parker 1968; Abbott 1988; see also Berg and Singer 1995; Krimsky 1984; National Society of Genetic Counselors 1992), but with Big Data attracting practitioners from many diverse backgrounds, the development of a single discipline or profession specific code of conduct or ethics appears unlikely. As a result, norms around ethical practices in Big Data use or research are likely to develop in a fragmented or piecemeal manner, if they develop at all (see, e.g., Lo 1993). In that case, institutional practice and culture would be the only exposure that Big Data practitioners have for developing community and normative behaviors like “professionalism,” or professional conduct.

For undergraduate and graduate students, particularly those in the sciences, the dominant training model for “ethics” in scholarly work, the responsible conduct of research, is a single static training opportunity (e.g., course or module), often for all “researchers” or science students within an institution. However, not all scientific fields involve research, so if an institutional training program for ethics is characterized as supporting only “research”, it limits the relevance—whether perceived or actual—of this training. This limitation may be particularly true for those involved in, or preparing to engage with, Big Data—whether it is viewed as a new paradigm, a new application, or a new “cultural, technological and scholarly phenomenon” (Boyd and Crawford 2012).

Lo (1993) outlines (and addresses) many objections to the formal integration of ethics training into preparation for scientific research (and by extension, the practice of science). By integrating a published model for training in ethical reasoning with codes of professional conduct with which Big Data practitioners and instructors might be familiar, this paper describes a way forward. Ethical reasoning is a learnable and improvable skill set (Tractenberg and FitzGerald 2012) that can be sustainable (Tractenberg et al. in review) beyond the course, and so can serve to both introduce future practitioners to codes of professional conduct, and also prepare them for challenges in their practice that cannot be foreseen.

The dominant training model for “ethics” in scholarly work, the responsible conduct of research, usually requires a single training opportunity, often in the context of early preparation for, or gaining approval to conduct, research. Because major applications of Big Data exist outside of formal “research”, a research-oriented institutional training program for ethics may miss important issues and substantial practitioner communities that do/will engage with Big Data. The common model of “institutional training in the responsible conduct of research” typically only weakly relates to practitioners with Big Data and quantitative scientists. Professional associations that specify codes of ethical conduct may relate just as weakly because they are discipline specific. By contrast, the fact that Big Data is becoming transdisciplinary or discipline independent suggests a general, rather than a domain-, discipline-, or profession-specific model of training and practice in ethical reasoning would be more effective. The ethical reasoning approach we describe targets the decision-making that is inherent to the practitioner’s work (Bollier 2010; see also Boyd and Crawford 2012; Dwork and Mulligan 2013), so that this training can be applied across a wide range of contexts, even if these were not part of the original training program.

The Association of Internet Researchers (AoIR) directly challenges, and offers an alternative to, the dominant responsible conduct of research training model. Specifically, the AoIR’s revised code of conduct, “Ethical Decisionmaking and Internet Research” (AoIR Ethics Committee 2012), includes a set of recommendations focused on the ongoing and iterative decision-making that is required throughout research that involves the Internet (or data derived therefrom), with only passing mention of the core “factual” documentation that comprises the bulk of most responsible conduct of research training “courses” or modules. Moreover, the AoIR specifically targets ethical reflection (contemplation of the ethical implications of decisions made throughout any given research project), and not factual mastery of any sort (which is a more standard approach to the ‘topics to be covered’ in a course on the responsible conduct of research). The National Academy of Engineering (NAE) (NAE 2013) questioned the goals and utility of the typical training course. “…(t)he entire community of scientists and engineers benefits from diverse, ongoing options to engage in conversations about the ethical dimensions of research and (practice),” (Kalichman 2013: 13). The perspectives of both engineering and Internet research emphasize ethical reflection and building capacity for ongoing discussion.

From yet another disciplinary perspective, we, too, published an outline of learnable, improvable reasoning skills (Tractenberg and FitzGerald 2012) that can lead to, and support, the type of ethical reflection the AoIR describes and that the NAE (2013) recommends (see also King and Kitchener 1994). Our approach to ethical reasoning is based on a published career-spanning (developmental) paradigm for training in ethical reasoning: the Mastery Rubric (MR) for ethical reasoning (MR-ER, Tractenberg and FitzGerald 2012). A Mastery Rubric is a curriculum building and evaluation tool, similar to a traditional rubric (e.g., Stevens and Levi 2005) in that the desired knowledge, skills and abilities for acurriculum—rather than an assignment or task—are outlined together with performance levels that characterize the respondent from novice to proficiency (Tractenberg et al. 2010). The Mastery Rubric for Ethical Reasoning treats ethical reasoning (ER) as a learnable, improvable skill set: (identification and assessment of one’s) prerequisite knowledge; recognition of a moral issue; identification of relevant decision-making frameworks; identification and evaluation of alternative actions; making and justifying a decision (about the moral issue); reflection on the decision. These were derived from compendia of scholarly work reflecting ethical decisionmaking (http://www.scu.edu/ethics/). This skill set—if developed and fostered—can be utilized to support decision-making in and around Big Data, in contexts that range from career-specific issues of professionalism (e.g., AoIR 2012; Tractenberg 2013) to ethical, legal, and social issues that require discussion and input representing a wide range of disciplinary perspectives (e.g., “ongoing options to engage in conversations about the ethical dimensions of research and (practice)” (Kalichman 2013: 13)).

This approach is as applicable in other science domains (Tractenberg et al. in review) as it would be in Big Data, and was designed and intended to be utilized throughout the course of a scientific and professional career whether the domain is the Internet, biomedical research, or some combination of these domains with others. Given the importance of Big Data, many institutions have created, or will create, training opportunities (e.g., degree programs, workshops) to prepare people to work in and around the domain. The limited “professional practice” guidelines for Big Data may be one reason why insufficient time, space, and thought have been dedicated to how to train new practitioners to engage with the ethical, legal, and social issues in this new domain. Our paradigm can be used to fill this gap.

Methods

We summarize the histories of the codes of professional conduct from two disciplines with clear involvement in Big Data (collection and analysis): computer science and statistics. Our capsule histories (Results) provide historical analogies between the past and the present that focus on three points: the instability of communities at the early phases of emergent professions and disciplines; individual efforts to catalyze professional communities around clear questions; and the centrality of codes of conduct and ethics in the process of professionalization and discipline formation. In our use of historical analogies from computer science and statistics, we draw upon canonical and collaborative studies published by practitioners in each field, as well as by historians of technology, diplomatic historians, and political scientists that provide guidance for the benefits and limits of historical analogies for policymakers (Mazlish 1965; Neustadt and May 1986). To demonstrate how ethical reasoning can support this training for faculty and students alike, we integrated the ER skill set with the professional codes of conduct from the Association of Computing Machinery (ACM) and the American Statistical Association (ASA). Both the American Statistical Association (ASA, http://www.amstat.org/committees/ethics/) and the Association for Computing Machinery (ACM, http://www.acm.org/about/code-of-ethics) have extensive and well-crafted codes of professional conduct. The codes are among the best-articulated and most relevant to professional involvement with Big Data, and each can be seen as having specific relevance to professional conduct around Big Data. Because Big Data is multidimensional and multi-disciplinary, choosing “the profession” whose code of conduct most relates to one’s Big Data engagement is an excellent starting point; integrating ethical reasoning knowledge, skills and abilities with ANY code of conduct will help sustain the objectives of “training in ethics” beyond the completion of this training (Tractenberg et al. in review). This approach may therefore be useful in stimulating reflection on novel ethical, legal, and social issues (ELSI) that arise or are encountered beyond the context of training.

Finally, we took the mapping of the codes of conduct (ASA, ACM) with the MR-ER and examined their alignment with the seven criteria for teaching goals of ethical training outlined in the NAE conference report (NAE 2013), for ethics training for science, technology, engineering and mathematics (STEM) disciplines, within which Big Data clearly falls.

The seven criteria for characterizing teaching goals of ethical training (NAE 2013; Kalichman 2013: 11) are:

  1. 1.

    (Each ethics teaching) goal should represent something important/relevant to the ethical or responsible conduct of research or practice.

  2. 2.

    Goal should identify and address some concrete deficiency.

  3. 3.

    Achievement of the goal should be independent of other (possibly related) goals.

  4. 4.

    Goal should be actually and observably amenable to an active intervention.

  5. 5.

    Achievement of the goal should be documented/documentable with either quantitative or qualitative <as appropriate> outcomes.

  6. 6.

    Achievement of the goal should result in a change that is detectible and meaningful.

  7. 7.

    Goal should be feasible.

Results

We present the historical summaries of the two codes of conduct first, followed by the mapping of these codes with the MR-ER knowledge skills and abilities, and finally present the assessment of the consistency of these maps with the NAE ethics teaching goals criteria (Kalichman 2013).

Historical Summary 1: Association of Computing Machinery

The Association of Computing Machinery (ACM), today a prestigious international society with over 100,000 members, began humbly and informally at a meeting at Columbia University in the fall of 1947. The meeting’s organizer was Edmund C. Berkeley, an insurance industry expert whose first encounter with the power and charm of computers came during World War II when the Navy assigned him to work at Harvard on Howard Aiken’s Mark I. After the war, Berkeley returned to work at the Prudential Insurance Company, where he explored how Prudential could integrate new machine calculating technologies (Akera 2007). At that time, neither computer technology nor the very meaning of the term “computer” were stable. Moreover, a diverse and disperse group of military and government officials, mathematicians, engineers, and equipment manufacturers were building and using things that subsequent generations would recognize as early computers. Before Berkeley organized the 1947 meeting at Columbia, there was no obvious reason that this group of thinkers, builders, and users would soon coalesce into a single professional community.

Nevertheless, Berkeley and hundreds of fellow computer enthusiasts gave life to the ACM, and the organization grew and flourished during the 1940s and 1950s. Throughout this period the ACM followed the familiar patterns of professionalization established by communities of technical experts in the late 19th century. Organizations such as the American Society for Mechanical Engineering (founded 1880) and the American Institute of Electrical Engineers (founded 1884) thrived because they were able to carve out distinctive occupational niches and propagate effective mechanisms that could facilitate the growth of an expert community of practitioners. Standards for membership and codes of ethics were key institutional expressions of their shared professional identity (McMahon 1984; Sinclair 1980; Abbott 1988).

ACM members—computer experts in industry, academia, and government—confronted conditions of persistent instability as they created boundaries between “computer science,” “computer engineering,” “software engineering,” and related fields (Shapiro 1997; Mahoney and Haigh 2011; Ensmenger 2012; Jesiek 2013). A fundamental problem in the 1950s and 1960s was that the supply of graduates from academic programs was not well aligned with the growing industrial and commercial demand for programming labor. Indeed, as historian Nathan Ensmenger observed, “there was little agreement within the computing community about who exactly qualified as an experienced, professional practitioner” (Ensmenger 2001).

To cultivate a greater sense of prestige and stability within their field, ACM members formed special interest groups in specific technical fields, as well as committees to make recommendations about social issues such as educational curricula and professional conduct. The latter effort began in 1966, when the ACM Council adopted “Guidelines for Professional Conduct in Information Processing.” Donn B. Parker, Chairman of the ACM Professional Standards and Practices Committee, boldly laid out their motivation in a 1968 article: “There are a number of serious ethical problems in the arts and sciences of information processing. […] It is difficult to discuss ethics in our field without considering professionalism […] [but] the diverse backgrounds of people in the field and the diverse applications of computers in other professions are significant problems in an effort to unify.” Parker continued, “Sooner or later some body or group is bound to do something drastic and bring nationwide attention and disgrace to our profession. We are sitting on the proverbial powder keg. […] The press is creating a fear of computers through their personification with such headlines as ‘Meet the Monster that Checks Your Taxes’” (Parker 1968).

The ACM thus initiated their foray into professional ethics as an urgent response to the fear that Parker articulated so vividly: computers—and computer “professionals”—were vulnerable to public mistrust due to misunderstanding, unscrupulous applications, or both. The ACM formalized its 1966 Guidelines into a Code of Professional Conduct, adopted in 1972. Subsequent revisions refined the 1972 Code until the ACM took a new direction with its Code of Ethics and Professional Conduct in 1992 (ACM 1992).

ACM leaders emphasized the pedagogical function of an “educationally oriented code” that would clarify the social responsibilities of the profession. The ACM reframed its 1992 Code as “a basis for ethical decision making in the conduct of professional work,” organized around 24 “imperatives”: “general moral imperatives” concerning honesty, fairness, social responsibility, intellectual property, privacy, and confidentiality; “specific professional responsibilities” concerning the quality of work and unauthorized access to computer systems; and “organizational leadership imperatives” concerning fair and conscientious management of people and resources; and commitments to comply with and uphold the Code. “The future of the computing profession,” the 1992 Code concluded, “depends on both technical and ethical excellence” (Anderson et al. 1993).

The ACM has had mixed success with its plea for technical and ethical excellence in the computing profession. In the twenty-first century, technical and ethical shortcomings became notoriously common throughout computer design, operation, and use (Luca 2014; Scheuerman 2014; Singleton 2014). By design the ACM’s Code of Ethics is voluntary and aspirational—enforceable only through social and professional pressure. The ACM has helped to create and promulgate accreditation criteria for university computer science programs to train students to understand their “professional, ethical, legal, security, and social issues and responsibilities.” But there is no standard curriculum to teach the nuances of ethical reasoning, and high-profile instances of criminal behavior and systematic violations of user privacy undermine the ACM’s aspiration to contribute to social and human well-being.

Historical Summary 2: American Statistical Association

The origins of statistics as a discipline and field are difficult to pinpoint, mainly because there are two common meanings for the term. One definition of statistics, in the form of census (summary) type data was used by the Ancient Greeks, Chinese and Romans. The second meaning of the term “statistic” is as an estimator for a population (i.e., “true”) parameter, and arose after the invention of calculus in 1684 (Leibniz) and 1687 (Newton) which led to the abilities to realize the second meaning of the term.

Possibly the most limiting factor in the development of statistics as a formal discipline was the intractability of the mathematical computations required for anything beyond the most straightforward descriptive statistics for distributions. The advent of computers greatly facilitated calculations that had previously been done by hand and/or with reference to published tables of probabilities. Americans in the 1920s were among the first to apply statistics and computing together to analyze economic, financial, and agricultural issues (Grier 1995).

Mason et al. (1990) describe the creation of the American Statistical Association (ASA) in 1839 while the Royal Statistics Society (RSS) was established in the UK 1834 (Chartered as the “Royal Statistical Society” in 1887). The RSS was created as a special section within the British Association for the Advancement of Science, while the ASA evolved with a close affiliation to, but independent of, the US Federal government. The International Statistical Institute (ISI) was created in 1885.

The original objectives of the ASA were “to collect, preserve, and diffuse statistical information in the different departments of human knowledge,” and to “promote the science of Statistics…” Governments (federal, state, and local) were the main users of statistics in the US in the mid 1800 s, but there were important implications for economics, agriculture, and public health in those same census data. The ASA supported the creation of the US Census Bureau and Bureau of Labor Statistics, and professional statisticians, have contributed critically to agriculture, economics and other social sciences, clinical and biomedical research, and many other fields. However, professionals from other fields, particularly up until the mid twentieth century, have also contributed critically to statistics as a discipline.

In the early to mid-1900 s, interactions between economists, sociologists, and theoretical statisticians (on statistical applications) prompted changes in statistics education. Since their inceptions in the UK (RSS) and the US (ASA), these statistical societies and their members were committed to the promotion of the discipline and to its rigor. The ASA actively encouraged and promoted statistics within other disciplines, particularly in the early part of the twentieth century; this led to the establishment of new disciplines or subsets of disciplines wherein computing or statistics (or both) are featured. The interactivity of statistics and statisticians with other domains is represented concretely in the current (2014) ASA objectives, “…to foster statistics and its applications, to promote unity and effectiveness of effort among all concerned with statistical problems, and to increase the contribution of statistics to human welfare… (it) cooperates with other organizations in the advancement of statistics, stimulates research, promotes high professional standards and integrity in the application of statistics, fosters education in statistics, and, in general, makes statistics of service to society.”

The ASA section on “the training of statisticians” (which later became the section on Statistical Education) was established in 1944, only the second section to be created in the ASA. The International Federation for Information Processing was created in 1960 by UNESCO, and the International Association of Statistical Computing was established within the ISI in 1977, but a committee on Computational Statistics and Data Mining for Knowledge Discovery was only established by the ISI in 2012. Thus, the disciplinary and professional status of statistics has changed over time and the reliance on computers and computing must be considered a primary driver of this evolution.

Possibly reflecting its origins close to Federal statistics and data, the passages of the Freedom of Information and Privacy Acts in 1966 and 1974, respectively, raised issues for professional practice among federal statisticians, and an ASA Ad Hoc committee on Privacy and Confidentiality was appointed. The committee report was published in 1977 (ASA 1977), and later that year, an Ad Hoc “Committee on the Code of Conduct” was appointed. Meanwhile, the Belmont Report (“Ethical Principles and Guidelines for the Protection of Human Subjects of Research”, U.S. Department of Health, Education, and Welfare 1979) was published in 1979, bringing with it significant new initiatives for all participants in research involving human subjects in biomedical and behavioral research. The Belmont Report went far beyond privacy and confidentiality, and ushered in an era of training in “responsible conduct of research” that applied (or was applied) to any and all federally funded research—and researchers—in the U.S. (see, e.g., U.S. Department of Health and Human Services 1979; Bugliarello 1993 and National Academy of Engineering 2013).

The ASA Committee on the Code of Conduct became the “Committee on Professional Ethics” in 1982, and in 1983, this committee published the first ever “Ethical Guidelines for Statistical Practice” (ASA 1983), although “Principles of Professional Statistical Practice” was published in the Annals of Mathematical Statistics in 1965 (Deming 1965). This report can be seen to have integrated elements from the Belmont Report (introducing “basic ethical principles” and the idea that practice and research are two separable aspects of biomedical research) and the ASA Privacy and Confidentiality report from 1977. In 1985, the ISI published its “Declaration on Professional Ethics” (revised in 2010, ISI 2010), and the Encyclopedia of Statistical Sciences (first edition 1986) has included “Principles of Professional Statistical Practice” in each edition (e.g., Demming 2006). Also in 1986, the Committee on Professional Ethics was made a continuing committee (instead of Ad Hoc). The ASA Guidelines were revised in 1999 and are being revised again in 2014. The current domains on which the ASA guidelines focus are: professionalism; competence, judgment, diligence; responsibilities to funders, clients and employers assuring that statistical work is suitable; responsibilities in publications and testimony; responsibilities to research subjects; responsibilities to research team colleagues; responsibilities to other statisticians or statistical practitioners; responsibilities regarding allegations of misconduct; and responsibilities of employers.

The evolution of this code of conduct reflects two key elements of statistical practice and the ASA’s origins: namely, US Federal initiatives in 1966 and 1974 and the Belmont Report in 1979; and the influence those acts had on federally funded research and the training of new federally funded researchers—primarily in biomedical, clinical, and behavioral research domains. The laws represent impetus for statisticians practicing with Federal data (government statisticians), and had the potential to affect academic statisticians—who tend to be supported by federal funds, as most academic researchers in applied statistical domains are.

There have been no efforts to integrate professional conduct or the ASA Ethical Guidelines for Professional Practice into statistics curricula at any level, but standardization efforts for statistics curricula have gone on periodically. The ASA initiated a process to create Guidelines for Assessment and Instruction in Statistics Education for pre K-12th grade (ASA 2007) and undergraduates (ASA 2005), to bring statistical training (to be achieved within a single course) to undergraduate non-majors, and to early education curricula. Perhaps because the target audience is clearly not practitioners, neither of these includes any elements of “professional ethics” or a code of conduct. However, none of the training initiatives of the ASA mention professional ethics, either. The current (2014) undergraduate statistics (for majors) curriculum reform would be the first to mention the guidelines.

Mapping Ethical Reasoning and Its development to Codes of Conduct

These historical summaries outline how and why Big Data scientists-in-training, whose background may be focused more on data analysis (ASA) or on computing (ACM, as examples), might very well come to their practice with no knowledge of a discipline’s codes of conduct: the codes are not usually formally taught in any curriculum. Furthermore, Big Data practitioners might also be unaware of any discipline-specific code of conduct because they do not identify themselves strictly with one or another profession or professional society. This approach by no means is limited to computing and statistics, since it is well known that Big Data draws practitioners from a wide range of disciplines. The approach is also not unique to training in Big Data; our recent experiences involve students from neuroscience, molecular biology, immunology, and clinical and translational research (Tractenberg et al. in review).

Two examples of syllabi, geared towards graduate level students, are provided in the Appendices 1 and 2. Appendix 1 presents a semester course syllabus intended to introduce students to the eight fundamental moral imperatives of the ACM code, and also give them opportunities to learn, practice, and get feedback on the individual ethical reasoning skills. These skills are generally supportive of academic activities (writing, speaking, training others), and the ACM’s “General Moral Imperatives” are also widely applicable. However, the syllabus could be modified to capture other issues or domains of interest or relevance to the community (or the time), creating at least a semester’s worth of “…options to engage in conversations about the ethical dimensions of research and (practice)” (Kalichman 2013: 13).

Appendix 2 presents descriptions of meeting objectives for a semester course syllabus intended to teach the ethical reasoning KSAs using the eight domains of the ASA code of conduct and also, the NIH “responsible conduct of research topics” list. We did not include the overview, prerequisite/co-requisite, required text, and other details that the full syllabus given in Appendix 1 contains for simplicity. These ancillary and structural materials work equally well for the course designed with the ACM and the ASA codes and would also work for courses that integrate other professional codes with ethical reasoning training as well.

Instructors of a course on ethics for professional practice (in many different domains and disciplines) can take the syllabi we have created and teach these semester courses, or they can modify them so as to capture any other code of conduct. Appendix 2 shows a course designed to ensure that any federally-funded students who must take this “responsible conduct of research” training will not only fulfill this requirement, but will also learn ethical reasoning skills and a code of conduct that (unlike the “responsible conduct of research” topics, Antes et al. 2010) could be useful to them in their professional lives.

Table 1 aligns the two syllabi with the seven NAE criteria for ethics education goals articulated earlier (Kalichman 2013).

Table 1 Alignment of ethics course goals criteria (NAE 2013) with syllabi mapping MR-ER to codes of conduct

It can be seen that syllabi organized such as those given in Appendices 1 and 2 will meet the NAE ethical training goal criteria. A course organized using Appendix 2 will also meet NIH (2009) and NSF (2009) training requirements for “responsible conduct of research”. Finally, courses such as these also achieve the overarching goal of teaching “…people who want to do good science about the ethical standards and issues in their work, and how to deal with ethical problems that they encounter as scientists” (Swazey and Bird 1997: 5). As is outlined in the historical summaries, these codes of professional conduct were created from within the disciplines, and based on professional practitioner experiences “with ethical problems they encounter”.

Discussion

The codes of professional conduct we have outlined clearly show that the decisions that go into the everyday practice of these disciplines, as well as others in STEM, are expected to be grounded in professionalism and responsibility. This is true whether or not practitioners support research specifically. The historical summaries describe the development of their respective codes, but also represent how each discipline assumes its membership will adhere to a code of professional ethics, with neither group formally inculcating its membership to this code.

To date, discussions of “ethics” in Big Data (research or commercial applications) tend to be focused on privacy issues relating to the collection—and not the use—of Big Data (e.g., Ringelheim 2007; European Union 2009; Federal Trade Commission 2012; White House 2012; Polonetsky and Tene 2013). For example, a 2009 edited volume, “The Fourth Paradigm: Data-Intensive Scientific Discovery” (Hey et al. 2009), has one chapter (of 28 content chapters) that focuses on policy (“The future of data policy”), with nothing specifically on the ethical use (or collection) of Big Data. In contrast, Dwork and Mulligan (2013) suggest that the focus on privacy and its protection are targeted by the computer science, policy, and software engineering domains because such solutions are easy (or, easier than the other, more complex problems), stating, “The ease with which policy and technical proposals revert to solutions focused on individual control over personal information reflects a failure to accurately conceptualize other concerns” (see Dwork and Mulligan 2013 for discussion of what other than privacy should be concerning Big Data as of September 2013). Our point is that the relevant ethical issues cannot be predicted, and they currently cannot even be agreed upon in some cases; however, the ability to reason ethically is a stable—learnable and improvable—skill set that students can begin to develop. Engaging a professional code of conduct to introduce ethical reasoning can bring the idea of professionalism, and possibly relevance, to training in ethics. We argue here that this is an important consideration for Big Data practitioners in training, but this is important for adequate preparation in all sciences.

Like computer experts in the late 1940s, Big Data researchers are an inchoate group of technical experts who (currently) only share a general set of tools and techniques in common. The ACM and ASA communities eventually coalesced around their respective common professional aspirations or associations; like statistics at its inception, Big Data draws its practitioners from across and within a variety of disciplines. However, unlike statistics and computing machinery, “Big Data” in the 2010s is a term that does not represent a coherent or stable community of practice or academic discipline, but rather a collection of tools used to collect and manipulate large data sets. Therefore, practitioners of Big Data are unlikely to arrive at a single, uniform code of professional conduct. We have yet to see within Big Data research or practice a concerted drive for professionalization, in the shape of an overarching membership organization, a distinct identity of a research community, an organized study of educational curricula, or a widely-shared code of ethics or professional conduct. However, we have also seen that two disciplines contributing to Big Data have all of these things except a “widely-shared code of ethics”—the codes are published, but they are not shared or taught.

The MR-ER is clearly supportive of, and aligned with, the sort of training that would promote the introduction of a code of conduct such as those outlined by the AoIR (Association of Internet Researchers 2012), ASA (e.g., see Tractenberg 2013), or ACM. It is also consistent with existing RCR training materials (e.g., Steneck 2007) and also promotes curriculum development and evaluation explicitly. The MR-ER is focused on decision-making and reasoning, two critical elements required for thoughtful contemplation of ethical, legal, and social issues (with respect to Big Data, or any field; e.g., Tractenberg et al. 2014; see also Schmaling and Blume 2009.).

There is currently no formal integration of the ASA code of professional conduct into master’s or doctoral training of statisticians (ASA Executive Director, personal communication, July 2013). We were similarly unable to find formal incorporations of the ACM and AoIR recommendations for ethical research into any of the graduate programs described in online materials in the U.S. This suggests that, although our emphasis here is on bringing some structure and support to the integration of ethical training into the preparation of individuals who will work in/with Big Data, our focus could be relevant to other disciplines as well. We propose that academic disciplines that train students to collect and/or use Big Data should incorporate formal training in ethical reasoning—irrespective of whether there are specific (e.g., federal funding) requirements for training in the “responsible conduct of research”. Although our approach is to introduce new practitioners to ethical reasoning and professional conduct with a single semester course, the MR-ER framework also promotes ongoing engagement with both throughout the curriculum and along the entire career trajectory (Tractenberg and FitzGerald 2012; see also Powell et al. 2007).

Big Data research and practice need not wait until there is a single or even a “most relevant” professional organization before trainees and practitioners are introduced to, and oriented towards, ethical reasoning skills that support professional conduct and ethical research. Moreover, the approach we have outlined can be adapted for the code of conduct in any discipline. Individuals in leadership positions may be especially obliged to pursue this ethical development; the fact that training and achievement should differ depending on the level of the professional practitioner is one of our most important contributions to the discussion of ethical training that is required for competent, ethical, professional practice in and around Big Data. With the approach we have outlined, individuals can document their higher-level achievement, qualifying themselves to be these leaders and introduce the new professionals to these ways of thinking and being.