Jessie G Taft
- Mar 28, 2019
- 3 min read

DL Seminar | Towards an Ecology of Care for Data-Driven Decision-Making

Updated: Apr 24, 2019

Debarun Dhar | MS Student, Cornell Tech

With all the recent successes and democratization of modern machine learning and

data analysis techniques, a common question which comes to mind is how might we use these tools to improve decision making and outcomes. On March 14th, Angela Zhou, a Cornell Tech PhD student and DLI Doctoral Fellow, gave her talk at the Digital Life Seminar where she provided an overview of data-driven decision making and many of the common pitfalls associated with utilizing “black-box” models with real-world observational data. This blog post provides a brief summary of some of the main themes covered in the talk, with examples drawn from the challenging domain of healthcare.

While operations research methods and machine learning have existed and been put

to use for quite some time, we now have an abundance of rich data in the present era which

opens up new possibilities for what can be achieved using this information. This has garnered the interest of various institutional actors, whether they be in government or in hospitals, to use their data for inform and improve their processes, decisions and outcomes. However, Zhou argues that within the “ecology” of data-driven decision making there exist a number of challenges which need to be addressed when working with observational data.

The first core set of problems arise from the nature of observational data itself. In

medical research, randomized controlled trials (RCTs) are the gold standard for understanding the general effects of a particular drug or treatment on a given patient population. By design, these trials focus on studying outcomes for large groups, without accounting for each individual patient’s specific context. On the other hand, there has been a recent push for personalized medicine which is born out of utilizing the large amounts of observational data abound in electronic health records. A fundamental problem with this is that unlike the case of RCTs, the data in these observational studies is extremely messy and there is no control over treatment assignment, which can lead to highly error-prone predictions and spurious correlations. Confounding, defined as the presence of alternative explanations for a given statistical observation, also plays a major role. This is illustrated by Zhou who described a set of parallel clinical trials and observational studies conducted by the Women’s Health Initiative (WHI) to understand the effects of hormone replacement therapy for coronary heart disease prevention. During the study the clinical trials needed to be cut short due to an increase in the rate of heart attacks, cancer and stroke, while in contrast according to the observational studies the therapy had a positive protective effect. This disparity demonstrates the difficulty in extracting insights about causal effects from observational data.

Another issue associated with the use of observational data stems from the biases

born under the specific context under which the data was collected. A case study to illustrate

this is described in the paper “Intelligible Models for HealthCare: Predicting Pneumonia Risk

and Hospital 30-day Readmission” by Caruana et al, where the researchers aimed to predict

the probability of death for patients with pneumonia, with the goal of assisting with better

triage of patients. When analysing the well-performing models, they found the surprising

result that according to the models, patients who had pneumonia AND asthma had a lower

risk of death. This was obviously untrue and was purely an artifact of the fact that these

patients were admitted directly into the ICU and had received aggressive care. This

information was not accounted for in the dataset and so led to a form of “label leakage”

which generated this anomalous result.

In addition to discussing the problems with observational data, Zhou stresses that a

number of issues can arise during the modeling process itself. A lot of parametric models are

generally built on top of strong assumptions which may not be valid for settings with

observational data about people, such as in cases where we want to predict a medical or

social outcome. Therefore, when these models are used as “black-boxes” without any

adjustments the underlying assumptions are often frequently and systematically violated.

Another strong assumption made by many methods which learn to personalize from

observational data is that of “unconfoundedness” or that the effect of unmeasured

confounders is not significant. For example, in the case of electronic health records, one

might assume that as the data gets richer it will become possible to observe every factor

which affects a given treatment. However, in practice, this is untrue as there will always be

some residual confounding due to human decision making.

By going over each of the challenges described above during her talk, Zhou provided a

comprehensive summary of many of the common pitfalls which practitioners may encounter

when applying data-driven methods to decision making. Despite this, Zhou maintains that

while these models may have the wrong assumptions, they can still be useful and that

researchers should pay particular attention to understanding the prescriptive validity of their

presented evidence. There is a lot of possible value in the use of large-scale observational

data, and more research is needed to unlock this potential.

DL Seminar | Towards an Ecology of Care for Data-Driven Decision-Making

Recent Posts

Contact