DL Seminar | The Unbearable Lightness of Teaching Responsible Data Science
Updated: Apr 24
Individual reflections by Wenbo Huang, and Ruoxi Zhang (scroll down).
By Wenbo Huang
The Unbearable Lightness of Teaching Responsible Data Science
Data reflects the world we live in. It captures and records every little progress of society. In the past, data scientists focused on data’s capability. The accuracy and efficiency of their developed programs become the key metrics of success. However, as capability increases, it becomes increasingly important for data scientists to think carefully about their responsibility to process data. Irresponsible use of data can harm individuals, groups, and even society in various ways, such as increasing inequality and amplifying systematic racism. Therefore, data should be used in accordance with ethical and moral norms, and we should combine technology development with legal and policy considerations.
On Feb 25th, Julia Stoyanovich from the Center of Data Science at New York University introduced her experience of teaching and developing the course Responsible Data Science since 2019. Unlike previous pedagogical approaches used in ethical data science courses where students focus on texts, Dr. Stoyanovich embedded coding into the material. Students will get hands-on opportunities to explore the holistic data science lifecycle and then study how social sciences interface with technology. For example, students will go through the entire data analysis process, including data collection, cleaning, feature engineering, and other pre-processing steps, before studying the fairness classification method's fairness. For the course, Dr. Stoyanovich focuses on four modules: Fairness, Data Science Lifecycle, Data Protection, and Transparency and Interpretability.
The first topic, fairness, tackles the issue rising from statistical and societal bias in the dataset. Data reflect the world, but it may be distorted due to issues such as sampling errors. Even if data genuinely reflects reality, it does not show people what the world could or should be. It is up to data scientists to determine if the dataset reflects their expectations. Second, for Data Science Lifecycle, Dr. Stoyanovich elaborates on three types of bias in computation systems: pre-existing bias originates from the society, and it is independent of algorithms; technical bias was introduced or exacerbated by the technical processes, and emergent bias arises due to the context of use.
There are also three solutions: representation equity determines if the data faithfully reflects the world, access equity shows the information needed to study inequity, and outcome equity suggests if it is possible to assess and mitigate inequities outside the system’s direct control. The third topic, data protection and privacy, focuses on whether the data shows sufficient representation or sufficient transparency without compromising individuals' privacy. The fourth topic covers transparency and interpretability. The automated decision algorithms are often black-box models, and therefore it is important to interrogate and regulate these models to ensure inequity does not arise from them. From the four topics covered, we can see that all solutions are socio-technical because some problems have their origins on the societal side and cannot be solved with technology alone. As data analysts, we are compelled to develop skills for operationalizing responsibility, and ethical and moral norms should be embedded into the development of algorithmic techniques.
I recently finished reading the book Weapons of Math Destruction by Cathy O’Neil, in which she illustrated how irresponsible big data algorithms would increase inequality and threatens democracy. In the book, the author pointed out that mathematical algorithms that claim to quantify people’s important traits (i.e., credit score, risk of recidivism, etc.) may be in reality due to their black-box nature cause harm more than they benefit people. For example, a model that measures individuals’ risk of recidivism uses the person’s first encounter with law enforcement as an input. However, in reality, due to racist policing practices (i.e., stop and frisk), the probability that black people have their first encounter earlier than white people is high. Therefore, the output may also be biased, and the model designed to fight against injustice leads to the opposite result. This statement is essentially similar to Dr. Stoyanovich’s view on transparency. Technical algorithms have the opacity attribute, which means that people could not understand how the model interprets and generates its outputs. Data quality is important because if bias is passed into the model, the output may likely be biased, eventually leading to undesired results.
As Dr. Stoyanovich suggests, the lesson is that data should be used in accordance with ethical and moral norms. Technology development and legal and policy considerations should not be separated because irresponsible use of data can harm people as much as or more than they benefit society.
By Ruoxi Zhang
The Unbearable Lightness of Teaching Responsible Data Science
On February 25th 2021, Cornell Digital Life Research Seminar welcomed Julia Stoyanovich to share her insights on the topic of teaching responsible data science. She is an Assistant Professor in the Department of Computer Science and Engineering at the Tandon School of Engineering, and the Center for Data Science. Her undergraduate degrees are computer science, mathematics and statistics, and her master’s degree is computer science focusing on data management - a classical way of how a computer scientist learns and grows, in which it’s fairly unlikely for such a computer scientist to study philosophy or ethics. However, through her work, the intersection of technology and its societal impacts drew her attention and triggered her initiative to dive more into the ethical aspects of data science. Julia Stoyanovich pinpointed that in that field, the major thing that is missing and in huge need is not clever algorithms or computational systems, but a structural methodology to teach computer scientists or people who are on their way to become a computer scientist about how to navigate in the current social technical world. In order to cater for that need, Julia Stoyanovich developed the Responsible Data Science course, and she has been teaching it to technical students at New York University since 2019.
What is Responsible Data Science and why is it important?
Julia Stoyanovich started with a very interesting and concrete example to illustrate the concept of responsible data science: the automated hiring systems. The system acts like a funnel process, consisting of five stages. The first one is sourcing: companies get candidates to apply by posting ads via various platforms. The second stage is screening: an automatic machine is used to screen candidates. If the candidate passes the screening, then the next comes an interview. The third stage, interview, is also machine-assisted with potentially some human evaluation on the other end. The fourth stage is background checks, which could rely on predictive analysis. For example, when you search a name that sounds like an African American name on Google, an ad may pop up and asks you if you’d like to see the criminal history of that person. With the background checks passed, an offer will be issued. With such an automated funnel, there are numerous opportunities the system could treat applicants in a discriminative way by race, gender, disability status and more.
Data science should be responsible for making reasonable decisions by learning. And a more detailed breakdown on the components of responsible data science will be provided in a later paragraph. Julia Stoyanovich commented that the main issue about the system is that the system gets things out of the context, but does not learn on its own. Therefore, for technical students who will work for companies that produce similar systems or technologies, it is crucial to teach and educate them about the potential harms such a technology may bring to the world and let them be aware of the ethics and responsibilities.
RDS is fairness, data science lifecycle, data protection & privacy, and transparency and interpretability.
There are four modules that compose the entire RDS course as listed above. The first one, algorithmic fairness is the issue when the data is distorted and does not represent the world correctly, or the data does not represent how the world should be. In that case, bias does not come from the statistical aspect but the societal side, and the data scientist needs to give it a twist to make it fair.
The data science lifecycle is used to observe and intervene on inequities. Specifically, there are three types of biases in computer systems: pre-existing, technical, and emergent. To address those biases, three types of equities are used: representation equity, access equity, and outcome equity.
The main goal of data protection & privacy is to assure high-quality surveillance and privacy of individuals with sufficient representation and public transparency of the technical system. Besides technology, legal frameworks such as Canadian directive on automated decision-making are also covered in RDS. Similarly, transparency and interpretability are used to assure that no inequalities arise in various circumstances.
A quest for balance is made in the Responsible Data Science course.
The RDS course is meant to talk productively with technical students about the ethical aspects of data science without drifting away from the technical training on data science skills. Just like traditional computer science courses, the responsible data science course covers the pragmatic part of coding tasks in which students get their hands on productive projects. But before the production, students go through the processes of reading and critical thinking to adjust their design on the technology they’re about to build in order to reflect the correct world.
The Responsible Data Science course is growing at NYU CDS.
The RDS course is open to undergraduate and graduate students who major in computer science or related major. The course is designed for technical students who are in the beginning stages of their DS journey, so the course prerequisite is set in a low bar. This is because it’s ideal to get people thinking critically about data science before they get deep and attached to it. Regarding the biggest challenge in teaching this course is how to reconcile technical training with interdisciplinary ethical concerns.
Since 2019, the course has gotten more and more popular and the class cohort is steadily growing. The course consists of lectures as well as labs in which students engage with technical material in Jupiter Notebooks in Python. Julia Stoyanovich stated that among the computer science courses she has taught over the past ten years, the RDS cohort is the most diverse one demographically. This reflects the fact that the topic is attractive as the course is not required.