DL Seminar | Choices, Risks and Reward Reports

Jessie G Taft
Apr 9, 2022
5 min read

Combined reflections by Jacob Nadelman and Vilom Oza (scroll below).

By Jacob Nadelman and Vilom Oza

Cornell Tech

In his talk discussing “Choices, Risks and Reward Reports: Charting Public Policy for Reinforcement Learning Systems,” Tom Gilbert (pictured above) considers a fundamental question: “what makes AI systems good.” As he regularly repeats, being good does not necessarily mean that these systems are “fair, transparent, or explainable.” AI systems may have those traits, but as a means to an end rather than as a goal itself. Probing further, to describe a system as “good” we need a method of documentation for said systems that states their goal and the criteria to judge them on. With that in mind, Gilbert also asks “what does it mean to document systems in a way that makes it possible to deliberate about what we want them to do, how we want them to behave, and how do we make them into the kind of system … operating in a way we substantively want.” With this in mind, Gilbert breaks the talk into three distinct pieces.

1. Reinforcement Learning (RL) and how is it different from static Machine Learning (ML)

Central to understanding how AI systems can be evaluated is a particular subdomain called reinforcement learning (RL). RL is a new and emerging subfield within AI/ML that differs from the traditional fields in a few key ways. In particular, traditional ML subareas are static: unsupervised learning summarizes unlabeled information and supervised learning predicts values using labeled information. These environments are set and can be thought of as “one off”.

Reinforcement Learning differs from traditional Machine Learning in that it solves “sequential decision-making problems” using whatever observation techniques the designer specifies for it. Put another way, reinforcement learning “navigates” a dynamic environment by selecting actions that maximize potential future rewards. Additionally, in RL the agent collects its own data. You set up an environment, define a reward, and the agent learns what actions bring it closer to the reward.

Consider an example that makes this tractable. In your home you set a thermostat and tell it how hot or cold you want your house to be. The system uses a traditional ML algorithm to figure out what to do to reach the acceptable range of temperature after taking a temperature measurement. This is called control feedback. In addition, the system will begin a “trial and error” approach to understand the environment in the form of a sequence of actions over time to see how much reward it can accumulate. This is behavioral feedback. For the thermostat example, Gilbert claims this would be akin to the thermostat learning how quickly to adjust the temperature so the room’s temperature remains steadier over time.

Finally, Gilbert introduces one more layer of complexity: exo-feedback. As the RL system adjusts temperature, the way other agents respond to the system changes. Perhaps our thermostat learns that it is far easier to maintain temperature if all members of the household are in one room. It could make one room pleasant and the rest far too cold, causing the family to all be in this one space. Under Gilbert’s reasoning, this amounts to a ‘shift’ in the actual dynamic of what it means to inhabit the house.

These shifts are fundamental to the overarching promise and problem of RL (and repeated use of traditional ML systems). The formulation of ML and use systems can reinforce or actually change dynamics over time, fundamentally altering what it means to interact in those environments. These changes can have quite pernicious results on the user (think how social media use has led users to focus on gathering likes rather than connecting with friends), and Gilbert next discusses strategies to deal with these risks.

2. Distinct Risks in the Formation of RL systems.

To mitigate these risks posed by AI systems, we first need to understand how they are designed. Gilbert focuses on four different design choices in particular.

1. Scoping the horizon

Determining the timescale and space of an agent’s environment has an incredible impact on behavior. For example: For self-driving cars, one of the design choices is about how far a car can see i.e. what can be observed about a road.

2. Defining rewards

It is not straightforward to define rewards in any complex environment. As a result, designers have to define proxies that can make the agent learn behaviors to maximize the required output. However, in the real world, this can often result in unexpected and exploitative behavior aka ‘Reward Hacking’.

3. Pruning information

A common practice in RL research is to change the environment to fit your needs. In the real world, modifying the environment is changing the information flow from the environment to your agent. Doing so can dramatically change what the reward function means for your agent and offload the risk to external systems.

4. Training multiple agents

Multi-agent training is one of the most rapidly growing and talked about subdomains; however, there is very little information known as to how the learning systems will interact. Ex: Driving a self-driving car on a road predominantly occupied by human drivers might be very different compared to driving on a road where the majority of cars are self-driven.

3. Reward Reports: Proposing a New Kind of AI Documentation

Each of the above key design choices comes with its set of risks. Listing the pitfalls of just one can be quite complicated: when it comes to social media, reward hacking is a major design risk. A user does not get onto social media with the intention of being engaged but rather to connect with friends or to network. However, due to reward hacking, it is common to see that social media usage leads to declining user wellbeing and increasing public distrust. Hence there is an increasing need to understand the dynamics of the system, what is at stake, and what it means to optimize the system. Given that these dynamics are widespread, it would be useful to have a summary report that can be applied to any system.

To achieve this goal, Gilbert proposes a Reward Report. Right now, the forms of documentation that exist only explain how accurate a model is, how (and by whom) the data was collected, or under what sets of features the model operates. These systems do not take into account the dynamic nature of models or the ex ante assumptions which can drastically alter their real world effects. Consequently, reward reports will include domain assumptions, the horizon and specification components. In addition, they need to be regularly updated to ensure compliance with regulation and the designer's intentions to keep up with the dynamic nature of these models. Gilbert’s contests that these changes will ensure better monitoring as these systems become more widespread and further allow for administrative or legal oversight if needed.

Conclusion

Gilbert’s talk proposes some provocative ideas surrounding the implications of AI, how to understand these implications with Reinforcement Learning, and a mitigation strategy to address the risks. Should Reward Reports be adopted, legislators and oversight groups would have significantly more information and control to make informed decisions regarding the success or failure of an AI system to achieve its goal. It is only with this extreme level of vetting that we can say with any certainty that an AI system is ‘good.’