Recommendation 7
Judith Canner (chair), Matthew Hayat, Jessica Utts, Donna Lalonde
Emphasize responsible and ethical conduct in the collection and use of data and in their analysis.
Ethics must be integrated and discussed at every stage of the statistics and data science process to ensure trustworthy results and minimize risk to others. The complex technology landscape coupled with an increasing availability of real-world data makes it essential that statistics and data science instruction incorporate the ethical aspects of data collection, analysis, and dissemination of results (Raman et al. 2023). As articulated in the Ethical Guidelines for Statistical Practice as published by the American Statistical Association (2022), “Good statistical practice is fundamentally based on transparent assumptions, reproducible results, and valid interpretations.” The following outlines the process of statistics and data science with an emphasis on integrating ethical discussion throughout.
Formulate a Research Question: The first step in any research is to formulate a question or questions, and to determine whether appropriate data collection methods are available. The discussion can include issues such as whether a large enough sample can be collected to provide meaningful results, and whether the study has ecological validity (i.e., mimics how treatments would be applied in the real world as opposed to in an experiment). As we integrate data with context and purpose instructors can demonstrate ways that data can be used for good (e.g., identify equity gaps for intervention) or for oppression (e.g., facial recognition, inappropriate use of private data).
Data Collection and Preparation: The ethicality of data collection and preparation can include the following issues.
- Were the data collection methods appropriate for the research question(s)?
- Is there reason to question the accuracy of the data based on how and from whom they were collected?,
- Is there reason to think that the data may be biased or cause misleading results, for instance if dropout rates differed among treatment groups?
- Was appropriate consent given? If the data are from human subjects, did they give permission for their data to be used this way, for instance to be shared publicly or to be used for a purpose other than that stated in the original informed consent?
- Are appropriate measures being taken to ensure data protection and confidentiality?
- How will data cleaning, handling missing data, and transforming data impact our results, and have all such steps been documented?
Exploratory Data Analysis (EDA): In addition to data visualization and software tools, students should be able to discuss the ethics of data representation - relating–relating back to how to appropriately answer a research question, choosing the right statistic(s) to address the question, making comparisons when groups differ at baseline, and other ways EDA could be misused or misinterpreted (e.g., Prison COVID Cases Fuel New Rural Hotspot; State’s Map Draws Criticism | Georgia Public Broadcasting; Visualizing the Virus). In addition, this step can be used to discuss virtue ethics, such as compassionate data visualization - how–how do we communicate both the humanity and reality of the data in a visualization?
Analysis: In model development, discussions about selecting appropriate data analysis or modeling methods based on data are important as students consider the limitations of inference and the assumptions underlying the modelling method. Every analysis should be reproducible1(New Report Examines Reproducibility and Replicability in Science, Recommends Ways to Improve Transparency and Rigor in Research | National Academies) .
Drawing Conclusions and Communicating Results: Presentation of results should include statements related to causality and the generalizability of the results. For example, 1) how does data collection limit how the results can be generalized?, 2) can we use the analysis to make reliable predictions?, 3) who is responsible for how others use a model or result? 4) is the full story presented, or are only the most striking results presented, thus increasing the likelihood of a false positive?
Considering Changes in Data and Technology: In practice, one of the most important steps in both data science and statistics is recognizing that they are iterative processes. The acquisition of new knowledge may require the modification of initial findings. Data and analysis may need to be revisited, and decisions based on inference may need to be reformulated in different contexts or under varying conditions.
The field of statistics is changing, and with these changes come new ethical imperatives in statistics education. As Rameela Raman, Jessica Utts, Andrew Cohen, and Matthew Hayat write in a 2023 article in The American Statistician, “with technological advancement and the increase in availability of real-world datasets, it is necessary that instruction also integrate the ethical aspects around data sources, such as privacy, how the data were obtained and whether participants consent to the use of their data.”
Additional Resources
Annotated bibliography
Examples
Assessments
References
Footnotes
Reproducibility in analysis requires that all computational analyses, when repeated on the same data, produce consistent results.↩︎