What Is the Credit Assignment Problem?

Last updated: March 18, 2024

credit assignment problem reward

  • Machine Learning
  • Reinforcement Learning

announcement - icon

It's finally here:

>> The Road to Membership and Baeldung Pro .

Going into ads, no-ads reading , and bit about how Baeldung works if you're curious :)

1. Overview

In this tutorial, we’ll discuss a classic problem in reinforcement learning: the credit assignment problem. We’ll present an example that demonstrates the problem.

Finally, we’ll highlight some solutions to solve the credit assignment problem.

2. Basics of Reinforcement Learning

Reinforcement learning (RL) is a subfield of machine learning that focuses on how an agent can learn to make independent decisions in an environment in order to maximize the reward. It’s inspired by the way animals learn via the trial and error method. Furthermore, RL aims to create intelligent agents that can learn to achieve a goal by maximizing the cumulative reward.

In RL, an agent applies some actions to an environment. Based on the action applied, the environment rewards the agent. After getting the reward, the agents move to a different state and repeat this process. Additionally, the reward can be positive as well as negative based on the action taken by an agent:

reward

The goal of the agent in reinforcement learning is to build an optimal policy that maximizes the overall reward over time. This is typically done using an iterative process . The agent interacts with the environment to learn from experience and updates its policy to improve its decision-making capability.

3. Credit Assignment Problem

The credit assignment problem (CAP) is a fundamental challenge in reinforcement learning. It arises when an agent receives a reward for a particular action, but the agent must determine which of its previous actions led to the reward.

In reinforcement learning, an agent applies a set of actions in an environment to maximize the overall reward. The agent updates its policy based on feedback received from the environment. It typically includes a scalar reward indicating the quality of the agent’s actions.

The credit assignment problem refers to the problem of measuring the influence and impact of an action taken by an agent on future rewards. The core aim is to guide the agents to take corrective actions which can maximize the reward.

However, in many cases, the reward signal from the environment doesn’t provide direct information about which specific actions the agent should continue or avoid. This can make it difficult for the agent to build an effective policy.

Additionally, there’re situations where the agent takes a sequence of actions, and the reward signal is only received at the end of the sequence. In these cases, the agent must determine which of its previous actions positively contributed to the final reward.

It can be difficult because the final reward may be the result of a long sequence of actions. Hence, the impact of any particular action on the overall reward is difficult to discern.

Let’s take a practical example to demonstrate the credit assignment problem.

Suppose an agent is playing a game where it must navigate a maze to reach the goal state. We place the agent in the top left corner of the maze. Additionally, we set the goal state in the bottom right corner. The agent can move up, down, left, right, or diagonally. However, it can’t move through the states containing stone:

credit assignment problem

As the agent explores the maze, it receives a reward of +10 for reaching the goal state. Additionally, if it hits a stone, we penalize the action by providing a -10 reward. The goal of the agent is to learn from the rewards and build an optimal policy that maximizes the gross reward over time.

The credit assignment problem arises when the agent reaches the goal after several steps. The agent receives a reward of +10 as soon as it reaches the goal state. However, it’s not clear which actions are responsible for the reward. For example, suppose the agent took a long and winding path to reach the goal. Therefore, we need to determine which actions should receive credit for the reward.

Additionally, it’s challenging to decide whether to credit the last action that took it to the goal or credit all the actions that led up to the goal. Let’s look at some paths which lead the agent to the goal state:

goal state

As we can see here, the agent can reach the goal state with three different paths. Hence, it’s challenging to measure the influence of each action. We can see the best path to reach the goal state is path 1.

Hence, the positive impact of the agent moving from state 1 to state 5 by applying the diagonal action is higher than any other action from state 1. This is what we want to measure so that we can make optimal policies like path 1 in this example.

5. Solutions

The credit assignment problem is a vital challenge in reinforcement learning. Let’s talk about some popular approaches for solving the credit assignment problem. Here we’ll present three popular approaches: temporal difference (TD) learning , Monte Carlo methods , and eligibility traces method .

TD learning is a popular RL algorithm that uses a bootstrapping approach to assign credit to past actions. It updates the value function of the policy based on the difference between the predicted reward and the actual reward received at each time step. By bootstrapping the value function from the predicted rewards of future states, TD learning can assign credit to past actions even when the reward is delayed.

Monte Carlo methods are a class of RL algorithms that use full episodes of experience to assign credit to past actions. These methods estimate the expected value of a state by averaging the rewards obtained in the episodes that pass through that state. By averaging the rewards obtained over several episodes, Monte Carlo methods can assign credit to actions that led up to the reward, even if the reward is delayed.

Eligibility traces are a method for assigning credit to past actions based on their recent history. Eligibility traces keep track of the recent history of state-action pairs and use a decaying weight to assign credit to each pair based on how recently it occurred. By decaying the weight of older state-action pairs, eligibility traces can assign credit to actions that led up to the reward, even if they occurred several steps earlier.

6. Conclusion

In this tutorial, we discussed the credit assignment problem in reinforcement learning with an example. Finally, we presented three popular solutions that can solve the credit assignment problem.

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

What is the credit assignment problem?

In reinforcement learning (RL), the credit assignment problem (CAP) seems to be an important problem. What is the CAP? Why is it relevant to RL?

  • reinforcement-learning
  • definitions
  • credit-assignment-problem

nbro's user avatar

In reinforcement learning (RL), an agent interacts with an environment in time steps. On each time step, the agent takes an action in a certain state and the environment emits a percept or perception, which is composed of a reward and an observation , which, in the case of fully-observable MDPs, is the next state (of the environment and the agent). The goal of the agent is to maximise the reward in the long run.

The (temporal) credit assignment problem (CAP) (discussed in Steps Toward Artificial Intelligence by Marvin Minsky in 1961) is the problem of determining the actions that lead to a certain outcome.

For example, in football, at each second, each football player takes an action. In this context, an action can e.g. be "pass the ball", "dribbe", "run" or "shoot the ball". At the end of the football match, the outcome can either be a victory, a loss or a tie. After the match, the coach talks to the players and analyses the match and the performance of each player. He discusses the contribution of each player to the result of the match. The problem of determinig the contribution of each player to the result of the match is the (temporal) credit assignment problem.

How is this related to RL? In order to maximise the reward in the long run, the agent needs to determine which actions will lead to such outcome, which is essentially the temporal CAP.

Why is it called credit assignment problem? In this context, the word credit is a synonym for value. In RL, an action that leads to a higher final cumulative reward should have more value (so more "credit" should be assigned to it) than an action that leads to a lower final reward.

Why is the CAP relevant to RL? Most RL agents attempt to solve the CAP. For example, a $Q$ -learning agent attempts to learn an (optimal) value function. To do so, it needs to determine the actions that will lead to the highest value in each state.

There are a few variations of the (temporal) CAP problem. For example, the structural CAP , that is, the problem of assigning credit to each structural component (which might contribute to the final outcome) of the system.

You must log in to answer this question.

Not the answer you're looking for browse other questions tagged reinforcement-learning definitions credit-assignment-problem ..

  • Featured on Meta
  • Bringing clarity to status tag usage on meta sites
  • Announcing a change to the data-dump process

Hot Network Questions

  • Expensive constructors. Should they exist? Should they be replaced?
  • User & hostname suddenly changed after update, now sudo and internet issues
  • What happens to entropy during compression?
  • Does strong convergence of operators imply pointwise convergence of their unbounded inverses?
  • Light switch that is flush or recessed (next to fridge door)
  • Is loss of availability automatically a security incident?
  • Hardware debouncing of 3.3V high signal for an ESP32 turned on via optocoupler
  • How should I secure ceiling drywall with no edge backing?
  • What would happen if someone were to wear a breathing restriction mask when sleeping over several nights to acclimatize for mountaineering?
  • Velocity dispersion of stars in galaxies
  • Can my employer require me to wear a dirty uniform and refuse to provide additional shirts?
  • Can Ontario municipal zoning by-laws prohibit site-built tiny homes?
  • MANIFEST_UNKNOWN error: OCI index found, but Accept header does not support OCI indexes
  • How to prevent my frozen dessert from going solid?
  • How would humans actually colonize mars?
  • Second Derivative for Non standard Calculus
  • What rules of legal ethics apply to information a lawyer learns during a consultation?
  • When you use the subjunctive tense for events that have happened?
  • Does it make sense for the governments of my world to genetically engineer soldiers?
  • Do I need to validate a Genoa MET daily ticket every time?
  • Flight delayed, risk of missing connection, can I cancel and get refund?
  • What is Zion's depth in the Matrix?
  • How to resolve hostname by mDNS?
  • Why are poverty definitions not based off a person's access to necessities rather than a fixed number?

credit assignment problem reward

MINI REVIEW article

Solving the credit assignment problem with the prefrontal cortex.

\r\nAlexandra Stolyarova*

  • Department of Psychology, University of California, Los Angeles, Los Angeles, CA, United States

In naturalistic multi-cue and multi-step learning tasks, where outcomes of behavior are delayed in time, discovering which choices are responsible for rewards can present a challenge, known as the credit assignment problem . In this review, I summarize recent work that highlighted a critical role for the prefrontal cortex (PFC) in assigning credit where it is due in tasks where only a few of the multitude of cues or choices are relevant to the final outcome of behavior. Collectively, these investigations have provided compelling support for specialized roles of the orbitofrontal (OFC), anterior cingulate (ACC), and dorsolateral prefrontal (dlPFC) cortices in contingent learning. However, recent work has similarly revealed shared contributions and emphasized rich and heterogeneous response properties of neurons in these brain regions. Such functional overlap is not surprising given the complexity of reciprocal projections spanning the PFC. In the concluding section, I overview the evidence suggesting that the OFC, ACC and dlPFC communicate extensively, sharing the information about presented options, executed decisions and received rewards, which enables them to assign credit for outcomes to choices on which they are contingent. This account suggests that lesion or inactivation/inhibition experiments targeting a localized PFC subregion will be insufficient to gain a fine-grained understanding of credit assignment during learning and instead poses refined questions for future research, shifting the focus from focal manipulations to experimental techniques targeting cortico-cortical projections.

Introduction

When an animal is introduced to an unfamiliar environment, it will explore the surroundings randomly until an unexpected reward is encountered. Reinforced by this experience, the animal will gradually learn to repeat those actions that produced the desired outcome. The work conducted in the past several decades has contributed a detailed understanding of the psychological and neural mechanisms that support such reinforcement-driven learning ( Schultz and Dickinson, 2000 ; Schultz, 2004 ; Niv, 2009 ). It is now broadly accepted that dopamine (DA) signaling conveys prediction errors, or the degree of surprise brought about by unexpected rewards, and interacts with cortical and basal ganglia circuits to selectively reinforce the advantageous choices ( Schultz, 1998a , b ; Schultz and Dickinson, 2000 ; Niv, 2009 ). Yet, in naturalistic settings, where rewards are delayed in time, and where multiple cues are encountered, or where several decisions are made before the outcomes of behavior are revealed, discovering which choices are responsible for rewards can present a challenge, known as the credit assignment problem ( Mackintosh, 1975 ; Rothkopf and Ballard, 2010 ).

In most everyday situations, the rewards are not immediate consequences of behavior, but instead appear after substantial delays. To influence future choices, the teaching signal conveyed by DA release needs to reinforce synaptic events occurring on a millisecond timescale, frequently seconds before the outcomes of decisions are revealed ( Izhikevich, 2007 ; Fisher et al., 2017 ). This apparent difficulty in linking preceding behaviors caused by transient neuronal activity to a delayed feedback has been termed the distal reward or temporal credit assignment problem ( Hull, 1943 ; Barto et al., 1983 ; Sutton and Barto, 1998 ; Dayan and Abbott, 2001 ; Wörgötter and Porr, 2005 ). Credit for the reward delayed by several seconds can frequently be assigned by establishing an eligibility trace, a molecular memory of the recent neuronal activity, allowing modification of synaptic connections that participated in the behavior ( Pan et al., 2005 ; Fisher et al., 2017 ). On longer timescales, or when multiple actions need to be performed sequentially to reach a final goal, intermediate steps themselves can acquire motivational significance and subsequently reinforce preceding decisions, such as in temporal-difference (TD) learning models ( Sutton and Barto, 1998 ).

Several excellent reviews have summarized the accumulated knowledge on mechanisms that link choices and their outcomes through time, highlighting the advantages of eligibility traces and TD models ( Wörgötter and Porr, 2005 ; Barto, 2007 ; Niv, 2009 ; Walsh and Anderson, 2014 ). Yet these solutions to the distal reward problem can impede learning in multi-choice tasks, or when an animal is presented with many irrelevant stimuli prior to or during the delay. Here, I only briefly overview the work on the distal reward problem to highlight potential complications that can arise in credit assignment based on eligibility traces when learning in multi-cue environments. Instead, I focus on the structural (or spatial ) credit assignment problem, requiring animals to select and learn about the most meaningful features in the environment and ignore irrelevant distractors. Collectively, the reviewed evidence highlights a critical role for the prefrontal cortex (PFC) in such contingent learning.

Recent studies have provided compelling support for specialized functions of the orbitofrontal (OFC) and dorsolateral prefrontal (dlPFC) cortices in credit assignment in multi-cue tasks, with fewer experiments targeting the anterior cingulate cortex (ACC). For example, it has seen suggested that the dlPFC aids reinforcement-driven learning by directing attention to task-relevant cues ( Niv et al., 2015 ), the OFC assigns credit for rewards based on the causal relationship between trial outcomes and choices ( Jocham et al., 2016 ; Noonan et al., 2017 ), whereas the ACC contributes to unlearning of action-outcome associations when the rewards are available for free ( Jackson et al., 2016 ). However, this work has similarly revealed shared contributions and emphasized rich and heterogeneous response properties of neurons in the PFC, with different subregions monitoring and integrating the information about the task (i.e., current context, available options, anticipated rewards, as well as delay and effort costs) at variable times within a trial (upon stimulus presentation, action selection, outcome anticipation, and feedback monitoring; ex., Hunt et al., 2015 ; Khamassi et al., 2015 ). In the concluding section, I overview the evidence suggesting that contingent learning in multi-cue environments relies on dynamic cortico-cortical interactions during decision making and outcome valuation.

Solving the Temporal Credit Assignment Problem

When outcomes follow choices after short delays (Figure 1A ), the credit for distal rewards can frequently be assigned by establishing an eligibility trace, a sustained memory of the recent activity that renders synaptic connections malleable to modification over several seconds. Eligibility traces can persist as elevated levels of calcium in dendritic spines of post-synaptic neurons ( Kötter and Wickens, 1995 ) or as a sustained neuronal activity throughout the delay period ( Curtis and Lee, 2010 ) to allow for synaptic changes in response to reward signals. Furthermore, spike-timing dependent plasticity can be influenced by neuromodulator input ( Izhikevich, 2007 ; Abraham, 2008 ; Fisher et al., 2017 ). For example, the magnitude of short-term plasticity can be modulated by DA, acetylcholine and noradrenaline, which may even revert the sign of the synaptic change ( Matsuda et al., 2006 ; Izhikevich, 2007 ; Seol et al., 2007 ; Abraham, 2008 ; Zhang et al., 2009 ). Sustained neural activity has been observed in the PFC and striatum ( Jog et al., 1999 ; Pasupathy and Miller, 2005 ; Histed et al., 2009 ; Kim et al., 2009 , 2013 ; Seo et al., 2012 ; Her et al., 2016 ), as well as the sensory cortices after experience with consistent pairings between the stimuli and outcomes separated by predictable delays ( Shuler and Bear, 2006 ).

www.frontiersin.org

Figure 1 . Example tasks highlighting the challenge of credit assignment and learning strategies enabling animals to solve this problem. (A) An example of a distal reward task that can be successfully learned with eligibility traces and TD rules, where intermediate choices can acquire motivational significance and subsequently reinforce preceding decisions (ex., Pasupathy and Miller, 2005 ; Histed et al., 2009 ). (B) In this version of the task, multiple cues are present at the time of choice, only one of which is meaningful for obtaining rewards. After a brief presentation, the stimuli disappear, requiring an animal to solve a complex structural and temporal credit assignment problem (ex., Noonan et al., 2010 , 2017 ; Niv et al., 2015 ; Asaad et al., 2017 ; while the schematic of the task captures the challenge of credit assignment, note that in some experimental variants of the behavioral paradigm stimuli disappeared before an animal revealed its choice, whereas in others the cues remained on the screen until the trial outcome was revealed). Under such conditions, learning based on eligibility traces is suboptimal, as non-specific reward signals can reinforce visual cues that did not meaningfully contribute, but occurred close, to beneficial outcomes of behavior. (C) On reward tasks, similar to the one shown in (B) , the impact of previous decisions and associated rewards on current behavior can be assessed by performing regression analyses ( Jocham et al., 2016 ; Noonan et al., 2017 ). Here, the color of each cell in a matrix represents the magnitude of the effect of short-term choice and outcome histories, up to 4 trials into the past (red-strong influence; blue-weak influence on the current decision). Top: an animal learning based on the causal relationship between outcomes and choices (i.e., contingent learning). Middle: each choice is reinforced by a combined history of rewards (i.e., decisions are repeated if beneficial outcomes occur frequently). Bottom: the influence of recent rewards spreads to unrelated choices.

On extended timescales, when multiple actions need to be performed sequentially to reach a final goal, the distal reward problem can be solved by assigning motivational significance to intermediate choices that can subsequently reinforce preceding decisions, such as in TD learning models ( Montague et al., 1996 ; Sutton and Barto, 1998 ; Barto, 2007 ). Assigning values to these intervening steps according to expected future rewards allows to break complex temporal credit assignment problems into smaller and easier tasks. There is ample evidence for TD learning in humans and other animals that on the neural level is supported by transfer of DA responses from the time of reward delivery to preceding cues and actions ( Montague et al., 1996 ; Schultz, 1998a , b ; Walsh and Anderson, 2014 ).

Both TD learning and eligibility traces offer elegant solutions to the distal reward problem, and models based on cooperation between these two mechanisms can predict animal behavior as well as neuronal responses to rewards and predictive stimuli ( Pan et al., 2005 ; Bogacz et al., 2007 ). Yet assigning credit based on eligibility traces can be suboptimal when an animal interacts with many irrelevant stimuli prior to or during the delay (Figure 1B ). Under such conditions sensory areas remain responsive to distracting stimuli and the arrival of non-specific reward signals can reinforce intervening cues that did not meaningfully contribute, but occurred close, to the outcome of behavior ( FitzGerald et al., 2013 ; Xu, 2017 ).

The Role of the PFC in Structural Credit Assignment

Several recent studies have investigated the neural mechanisms of appropriate credit assignment in challenging tasks where only a few of the multitude of cues predict rewards reliably. Collectively, this work has provided compelling support for causal contributions of the PFC to structural credit assignment. For example, Asaad et al. (2017) examined the activity of neurons in monkey dlPFC while subjects were performing a delayed learning task. The arrangement of the stimuli varied randomly between trials and within each block either the spatial location or stimulus identity was relevant for solving the task. The monkeys' goal was to learn by trial-and-error to select one of the four options that led to rewards according to current rules. When stimulus identity was relevant for solving the task, neural activity in the dlPFC at the time of feedback reflected both the relevant cue (regardless of its spatial location) and the trial outcome, thus integrating the information necessary for credit assignment. Such responses were strategy-selective: these neurons did not encode cue identity at the time of feedback when it was not necessary for learning in the spatial location task, in which making a saccade to the same position on the screen was reinforced within a block of trials. Previous research has similarly indicated that neurons in the dlPFC respond selectively to behaviorally-relevant and attended stimuli ( Lebedev et al., 2004 ; Markowitz et al., 2015 ) and integrate information about prediction errors, choice values as well as outcome uncertainty prior to trial feedback ( Khamassi et al., 2015 ).

The activity within the dlPFC has been linked to structural credit assignment through selective attention and representational learning ( Niv et al., 2015 ). Under conditions of reward uncertainty and unknown relevant task features, human participants opt for computational efficiency and engage in a serial-hypothesis-testing strategy ( Wilson and Niv, 2011 ), selecting one cue and its anticipated outcome as the main focus of their behavior, and updating the expectations associated exclusively with that choice upon feedback receipt ( Akaishi et al., 2016 ). Niv and colleagues tested participant on a three-armed bandit task, where relevant stimulus dimensions (i.e., shape, color or texture) predicting outcome probabilities changed between block of trials ( Niv et al., 2015 ). In such multidimensional environment, reinforcement-driven learning was aided by attentional control mechanisms that engaged the dlPFC, intraparietal cortex, and precuneus.

In many tasks, the credit for outcomes can be assigned according to different rules: based on the causal relationship between rewards and choices (i.e., contingent learning), their temporal proximity (i.e., when the reward is received shortly after a response), or their statistical relationship (when an action has been executed frequently before beneficial outcomes; Jocham et al., 2016 ; Figure 1C ). The analyses presented in papers discussed above did not allow for the dissociation between these alternative strategies of credit assignment. By testing human participants on a task with continuous stimulus presentation, instead of a typical trial-by-trial structure, Jocham et al. (2016) demonstrated that the tendency to repeat choices that were immediately followed by rewards and causal learning operate in parallel. In this experiment, activity within another subregion of the PFC, the OFC, was associated with contingent learning. Complementary work in monkeys revealed that the OFC contributes causally to credit assignment ( Noonan et al., 2010 ): animals with OFC lesions were unable to associate a reward with the choice on which it was contingent and instead relied on temporal and statistical learning rules. In another recent paper, Noonan and colleagues (2017) extended these observations to humans, demonstrating causal contributions of the OFC to credit assignment across species. The participants were tested on a three-choice probabilistic learning task. The three options were presented simultaneously and maintained on the screen until the outcome of a decision was revealed, thus requiring participants to ignore irrelevant distractors. Notably, only patients with lateral OFC lesions displayed any difficulty in learning the task, whereas damage to the medial OFC or dorsomedial PFC preserved contingent learning mechanisms. However, it is presently unknown whether lesions to the dlPFC or ACC affect such causal learning.

In another test of credit assignment in learning, contingency degradation, the subjects are required to track causal relationships between the stimuli or actions and rewards. During contingency degradation sessions, the animals are still reinforced for responses, but rewards are also available for free. After experiencing non-contingent rewards, control subjects reliably decrease their choices of the stimuli. However, lesions to both the ACC and OFC inhibit contingency degradation ( Jackson et al., 2016 ). Taken together, these observations demonstrate causal contributions of the PFC to appropriate credit assignment in multi-cue environments.

Cooperation Between PFC Subregions Supports Contingent Learning in Multi-Cue Tasks

Despite the segregation of temporal and structural aspects of credit assignment in earlier sections of this review, in naturalistic settings the brains frequently need to tackle both problems simultaneously. Here, I overview the evidence favoring a network perspective, suggesting that dynamic cortico-cortical interactions during decision making and outcome valuation enable adaptive solutions to complex spatio-temporal credit assignment problems. It has been previously suggested that feedback projections from cortical areas occupying higher levels of processing hierarchy, including the PFC, can aid in attribution of outcomes to individual decisions by implementing attention-gated reinforcement learning ( Roelfsema and van Ooyen, 2005 ). Similarly, recent theoretical work has shown that even complex multi-cue and multi-step problems can be solved by an extended cascade model of synaptic memory traces, in which the plasticity is modulated not only by the activity within a population of neurons, but also by feedback about executed decisions and resulting rewards ( Urbanczik and Senn, 2009 ; Friedrich et al., 2010 , 2011 ). Contingent learning, according to these models, can be supported by the communication between neurons encoding available options, committed choices and outcomes of behavior during decision making and feedback monitoring. For example, at the time of outcome valuation, information about recent choices can be maintained as a memory trace in the neuronal population involved in action selection or conveyed by an efference copy from an interconnected brain region ( Curtis and Lee, 2010 ; Khamassi et al., 2011 , 2015 ). Similarly, reinforcement feedback is likely communicated as a global reward signal (ex., DA release) as well as projections from neural populations engaged in performance monitoring, such as those within the ACC ( Friedrich et al., 2010 ; Khamassi et al., 2011 ). The complexity of reciprocal and recurrent projections spanning the PFC ( Barbas and Pandya, 1989 ; Felleman and Van Essen, 1991 ; Elston, 2000 ) may enable this network to implement such learning rules, integrating the information about the task, executed decisions and performance feedback.

In many everyday decisions, the options are compared across multiple features simultaneously (ex., by considering current context, needs, available reward types, as well as delay and effort costs). Neurons in different subregions of the PFC exhibit rich response properties, signaling these features of the task at various time epochs within a trial. For example, reward selectivity in response to predictive stimuli emerges earlier in the OFC and may then be passed to the dlPFC that encodes both the expected outcome and the upcoming choice ( Wallis and Miller, 2003 ). Similarly, on trials where options are compared based on delays to rewards, choices are dependent on interactions between the OFC and dlPFC ( Hunt et al., 2015 ). Conversely, when effort costs are more meaningful for decisions, it is the ACC that influences choice-related activity in the dlPFC ( Hunt et al., 2015 ). The OFC is required not only for the evaluation of stimuli, but also more complex abstract rules, based on rewards they predict ( Buckley et al., 2009 ). While both the OFC and dlPFC encode abstract strategies (ex., persisting with recent choices or shifting to a new response), such signals appear earlier in the OFC and may be subsequently conveyed to the dlPFC where they are combined with upcoming response (i.e., left vs. right saccade) encoding ( Tsujimoto et al., 2011 ). Therefore, the OFC may be the first PFC subregion to encode task rules and/or potential rewards predicted by sensory cues; via cortico-cortical projections, this information may be subsequently communicated to the dlPFC or ACC ( Kennerley et al., 2009 ; Hayden and Platt, 2010 ) to drive strategy-sensitive response planning.

The behavioral strategy that the animal follows is influenced by recent reward history ( Cohen et al., 2007 ; Pearson et al., 2009 ). If its choices are reinforced frequently, the animal will make similar decisions in the future (i.e., exploit its current knowledge). Conversely, unexpected omission of expected rewards can signal a need for novel behaviors (i.e., exploration). Neurons in the dlPFC carry representations of planned as well as previous choices, anticipate outcomes, and jointly encode the current decisions and their consequences following feedback ( Seo and Lee, 2007 ; Seo et al., 2007 ; Tsujimoto et al., 2009 ; Asaad et al., 2017 ). Similarly, the ACC tracks trial-by-trial outcomes of decisions ( Procyk et al., 2000 ; Shidara and Richmond, 2002 ; Amiez et al., 2006 ; Quilodran et al., 2008 ) as well as reward and choice history ( Seo and Lee, 2007 ; Kennerley et al., 2009 , 2011 ; Sul et al., 2010 ; Kawai et al., 2015 ) and signals errors in outcome prediction ( Kennerley et al., 2009 , 2011 ; Hayden et al., 2011 ; Monosov, 2017 ). At the time of feedback, neurons in the OFC encode committed choices, their values and contingent rewards ( Tsujimoto et al., 2009 ; Sul et al., 2010 ). Notably, while the OFC encodes the identity of expected outcomes and the value of the chosen option after the alternatives are presented to an animal, it does not appear to encode upcoming decisions ( Tremblay and Schultz, 1999 ; Wallis and Miller, 2003 ; Padoa-Schioppa and Assad, 2006 ; Sul et al., 2010 ; McDannald et al., 2014 ), therefore it might be that feedback projections from the dlPFC or ACC are required for such activity to emerge at the time of reward feedback.

To capture the interactions between PFC subregions in reinforcement-driven learning, Khamassi and colleagues have formulated a computation model in which action values are stored and updated in the ACC and then communicated to the dlPFC that decides which action to trigger ( Khamassi et al., 2011 , 2013 ). This model relies on meta-learning principles ( Doya, 2002 ), flexibly adjusting the exploration-exploitation parameter based on performance history and variability in the environment that are monitored by the ACC. The explore-exploit parameter then influences action-selection mechanisms in the dlPFC, prioritizing choice repetition once the rewarded actions are discovered and encouraging switching between different options when environmental conditions change. In addition to highlighting the dynamic interactions between the dlPFC and ACC in learning, the model similarly offers an elegant solution to the credit assignment problem by restricting value updating only to those actions that were selected on a given trial. This is implemented by requiring the prediction error signals in the ACC to coincide with a motor efference copy sent by the premotor cortex. The model is endorsed with an ability to learn meta-values of novel objects in the environment based on the changes in the average reward that follow the presentation of such stimuli. While the authors proposed that such meta-value learning is implemented by the ACC, it is plausible that the OFC also plays a role in this process based on its contributions to stimulus-outcome and state learning ( Wilson et al., 2014 ; Zsuga et al., 2016 ). Intriguingly, this model could reproduce monkey behavior and neural responses on two tasks: four-choice deterministic and two-choice probabilistic paradigms, entailing a complex spatio-temporal credit assignment problem as the stimuli disappeared from the screen prior to action execution and outcome presentation ( Khamassi et al., 2011 , 2013 , 2015 ). Model-based analyses of neuronal responses further revealed that information about prediction errors, action values and outcome uncertainty is integrated both in the dlPFC and ACC, but at different timepoints: before trial feedback in the dlPFC and after feedback in the ACC ( Khamassi et al., 2015 ).

Collectively, these findings highlight the heterogeneity of responses in each PFC subregion that differ in temporal dynamics within a single trial and suggest that the cooperation between the OFC, ACC and dlPFC may support flexible, strategy- and context-dependent choices. This network perspective further suggests that individual PFC subregions may be less specialized in their functions than previously thought. For example, in primates both the ACC and dlPFC participate in decisions based on action values ( Hunt et al., 2015 ; Khamassi et al., 2015 ). And more recently, it has been demonstrated that the OFC is involved in updating action-outcome values as well ( Fiuzat et al., 2017 ). Analogously, while it has been proposed that the OFC is specialized for stimulus-outcome and ACC for action-outcome learning ( Rudebeck et al., 2008 ), lesions to the ACC have been similarly reported to impair stimulus-based reversal learning ( Chudasama et al., 2013 ), supporting shared contributions of the PFC subregions to adaptive behavior. Indeed, these brain regions communicate extensively, sharing the information about presented options, executed decisions and received rewards (Figure 2 ), which can enable them to assign credit for outcomes to choices on which they are contingent ( Urbanczik and Senn, 2009 ; Friedrich et al., 2010 , 2011 ). Attention-gated learning likely relies on the cooperation between PFC subregions as well: for example, coordinated and synchronized activity between the ACC and dlPFC aids in goal-directed attentional shifting and prioritization of task-relevant information ( Womelsdorf et al., 2014 ; Oemisch et al., 2015 ; Voloh et al., 2015 ).

www.frontiersin.org

Figure 2 . Cooperation between PFC subregions in multi-cue tasks. In many everyday decisions, the options are compared across multiple features simultaneously (ex., by considering current context, needs, available reward types, as well as delay and effort costs). Neurons in different subregions of the PFC exhibit rich response properties, integrating many aspects of the task at hand. The OFC, ACC and dlPFC communicate extensively, sharing the information about presented options, executed decisions and received rewards, which can enable them to assign credit for outcomes to choices on which they are contingent.

Functional connectivity within the PFC can support contingent learning on shorter timescales (ex., across trials within the same task), when complex rules or stimulus-action-outcome mappings are switching frequently ( Duff et al., 2011 ; Johnson et al., 2016 ). Under such conditions, the same stimuli can carry different meaning depending on task context or due to changes in the environment (ex., serial discrimination-reversal problems) and the PFC neurons with heterogeneous response properties may be better targets for modification, allowing the brain to exert flexible, rapid and context-sensitive control over behavior ( Asaad et al., 1998 ; Mansouri et al., 2006 ). Indeed, it has been shown that rule and reversal learning induce plasticity in OFC synapses onto the dorsomedial PFC (encompassing the ACC) in rats ( Johnson et al., 2016 ). When motivational significance of reward-predicting cues fluctuates frequently, neuronal responses and synaptic connections within the PFC tend to update more rapidly (i.e., across block of trials) compared to subcortical structures and other cortical regions ( Padoa-Schioppa and Assad, 2008 ; Morrison et al., 2011 ; Xie and Padoa-Schioppa, 2016 ; Fernández-Lamo et al., 2017 ; Saez et al., 2017 ). Similarly, neurons in the PFC promptly adapt their responses to incoming information based on the recent history of inputs ( Freedman et al., 2001 ; Meyers et al., 2012 ; Stokes et al., 2013 ). Critically, changes in the PFC activity closely track behavioral performance ( Mulder et al., 2003 ; Durstewitz et al., 2010 ), and interfering with neural plasticity within this brain area prevents normal responses to contingency degradation ( Swanson et al., 2015 ).

When the circumstances are stable overall and the same cues or actions remain reliable predictors of rewards, long-range connections between the PFC, association and sensory areas can support contingent learning on prolonged timescales. Neurons in the lateral intraparietal area demonstrate larger post-decisional responses and enhanced learning following choices that predict final outcomes of sequential behavior in a multi-step and -cue task ( Gersch et al., 2014 ). Such changes in neuronal activity likely rely on information about task rules conveyed by the PFC directly or via interactions with neuromodulatory systems. These hypotheses could be tested in future work.

In summary, dynamic interactions between subregions of the PFC can support contingent learning in multi-cue environments. Furthermore, via feedback projections, the PFC can guide plasticity in other cortical areas associated with sensory and motor processing ( Cohen et al., 2011 ). This account suggests that lesion experiments targeting a localized PFC subregion will be insufficient to gain fine-grained understanding of credit assignment during learning and instead poses refined questions for future research, shifting the focus from focal manipulations to experimental techniques targeting cortico-cortical projections. To gain novel insights into functional connectivity between PFC subregions, it will be critical to assess neural correlates of contingent learning in the OFC, ACC, and dlPFC simultaneously in the context of the same task. In humans, functional connectivity can be assessed by utilizing coherence, phase synchronization, Granger causality and Bayes network approaches ( Bastos and Schoffelen, 2016 ; Mill et al., 2017 ). Indeed, previous studies have linked individual differences in cortico-striatal functional connectivity to reinforcement-driven learning ( Horga et al., 2015 ; Kaiser et al., 2017 ) and future work could focus on examining cortico-cortical interactions in similar paradigms. To probe causal contributions of projections spanning the PFC, future research may benefit from designing multi-cue tasks for rodents and taking advantage of recently developed techniques (i.e., chemo- and opto-genetic targeting of projection neurons followed by silencing of axonal terminals to achieve pathway-specific inhibition; Deisseroth, 2010 ; Sternson and Roth, 2014 ) that afford increasingly precise manipulations of cortico-cortical connectivity. It should be noted, however, that most experiments to date have probed the contributions of the PFC to credit assignment in primates, and functional specialization across different subregions may be even less pronounced in mice and rats. Finally, as highlighted throughout this review, the recent progress in understanding the neural mechanisms of credit assignment has relied on introduction of more complex tasks, including multi-cue and probabilistic choice paradigms. While such tasks better mimic the naturalistic problems that the brains have evolved to solve, they also produce behavioral patterns that are more difficult to analyze and interpret ( Scholl and Klein-Flügge, 2017 ). As such, computational modeling of the behavior and neuronal activity may prove especially useful in future work on credit assignment.

Author Contributions

The author confirms being the sole contributor of this work and approved it for publication.

This work was supported by UCLA's Division of Life Sciences Recruitment and Retention fund (Izquierdo), as well as the UCLA Distinguished University Fellowship (Stolyarova).

Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

The author thanks her mentor Dr. Alicia Izquierdo for helpful feedback and critiques on the manuscript, and Evan E. Hart, as well as the members of the Center for Brains, Minds and Machines and Lau lab for stimulating conversations on the topic.

Abraham, W. C. (2008). Metaplasticity: tuning synapses and networks for plasticity. Nat. Rev. Neurosci. 9:387 doi: 10.1038/nrn2356

PubMed Abstract | CrossRef Full Text | Google Scholar

Akaishi, R., Kolling, N., Brown, J. W., and Rushworth, M. (2016). Neural mechanisms of credit assignment in a multicue environment. J. Neurosci. 36, 1096–1112. doi: 10.1523/JNEUROSCI.3159-15.2016

Amiez, C., Joseph, J. P., and Procyk, E. (2006). Reward encoding in the monkey anterior cingulate cortex. Cereb. Cortex 16, 1040–1055. doi: 10.1093/cercor/bhj046

Asaad, W. F., Lauro, P. M., Perge, J. A., and Eskandar, E. N. (2017). Prefrontal neurons encode a solution to the credit assignment problem. J. Neurosci. 37, 6995–7007. doi: 10.1523/JNEUROSCI.3311-16.2017.

Asaad, W. F., Rainer, G., and Miller, E. K. (1998). Neural activity in the primate prefrontal cortex during associative learning. Neuron 21, 1399–1407. doi: 10.1016/S0896-6273(00)80658-3

Barbas, H., and Pandya, D. N. (1989). Architecture and intrinsic connections of the prefrontal cortex in the rhesus monkey. J. Comp. Neurol. 286, 353–375 doi: 10.1002/cne.902860306

Barto, A. G. (2007). Temporal difference learning. Scholarpedia J. 2:1604. doi: 10.4249/scholarpedia.1604

CrossRef Full Text | Google Scholar

Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). “Neuronlike adaptive elements that can solve difficult learning control problems,” in IEEE Transactions on Systems, Man, and Cybernetics, SMC-13 , 834–846

Google Scholar

Bastos, A. M., and Schoffelen, J. M. (2016). A tutorial review of functional connectivity analysis methods and their interpretational pitfalls. Front. Syst. Neurosci . 9:175. doi: 10.3389/fnsys.2015.00175

Bogacz, R., McClure, S. M., Li, J., Cohen, J. D., and Montague, P. R. (2007). Short-term memory traces for action bias in human reinforcement learning. Brain Res. 1153, 111–121. doi: 10.1016/j.brainres.2007.03.057

Buckley, M. J., Mansouri, F. A., Hoda, H., Mahboubi, M., Browning, P. G. F., Kwok, S. C., et al. (2009). Dissociable components of rule-guided behavior depend on distinct medial and prefrontal regions. Science 325, 52–58. doi: 10.1126/science.1172377

Chudasama, Y., Daniels, T. E., Gorrin, D. P., Rhodes, S. E., Rudebeck, P. H., and Murray, E. A. (2013). The role of the anterior cingulate cortex in choices based on reward value and reward contingency. Cereb Cortex 23, 2884–2898. doi: 10.1093/cercor/bhs266

Cohen, J. D., McClure, S. M., and Yu, A. J. (2007). Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Philos. Trans. R. Soc. Lond. Ser. B Biol. Sci. 362, 933–942. doi: 10.1098/rstb.2007.2098

Cohen, M. X., Wilmes, K., and Vijver, I. v. (2011). Cortical electrophysiological network dynamics of feedback learning. Trends Cogn. Sci. 15, 558–566. doi: 10.1016/j.tics.2011.10.004

Curtis, C. E., and Lee, D. (2010). Beyond working memory: the role of persistent activity in decision making. Trends Cogn. Sci. 14, 216–222. doi: 10.1016/j.tics.2010.03.006

Dayan, P., and Abbott, L. F. (2001). Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. Cambridge, MA: MIT Press.

Deisseroth, K. (2010). Optogenetics. Nat. Methods 8, 26–29. doi: 10.1038/nmeth.f.324

Doya, K. (2002). Metalearning and neuromodulation. Neural. Netw. 15, 495–506. doi: 10.1016/S0893-6080(02)00044-8

Duff, A., Sanchez Fibla, M., and Verschure, P. F. M. J. (2011). A biologically based model for the integration of sensory–motor contingencies in rules and plans: a prefrontal cortex based extension of the distributed adaptive control architecture. Brain Res. Bull. 85, 289–304. doi: 10.1016/j.brainresbull.2010.11.008

Durstewitz, D., Vittoz, N. M., Floresco, S. B., and Seamans, J. K. (2010). Abrupt transitions between prefrontal neural ensemble states accompany behavioral transitions during rule learning. Neuron 66, 438–448. doi: 10.1016/j.neuron.2010.03.029

Elston, G. N. (2000). Pyramidal cells of the frontal lobe: all the more spinous to think with. J. Neurosci. 20:RC95. Available online at: http://www.jneurosci.org/content/20/18/RC95.long

PubMed Abstract | Google Scholar

Felleman, D. J., and Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex 1, 1–47. doi: 10.1093/cercor/1.1.1

Fernández-Lamo, I., Delgado-García, J. M., and Gruart, A. (2017). When and where learning is taking place: multisynaptic changes in strength during different behaviors related to the acquisition of an operant conditioning task by behaving rats. Cereb. Cortex 14, 1–13. doi: 10.1093/cercor/bhx011

CrossRef Full Text

Fisher, S. D., Robertson, P. B., Black, M. J., Redgrave, P., Sagar, M. A., Abraham, W. C., et al. (2017). Reinforcement determines the timing dependence of corticostriatal synaptic plasticity in vivo . Nat. Commun. 8:334. doi: 10.1038/s41467-017-00394-x

FitzGerald, T. H. B., Friston, K. J., and Dolan, R. J. (2013). Characterising reward outcome signals in sensory cortex. NeuroImage 83, 329–334. doi: 10.1016/j.neuroimage.2013.06.061

Fiuzat, E. C., Rhodes, S. E., and Murray, E. A. (2017). The role of orbitofrontal-amygdala interactions in updating action-outcome valuations in macaques. J. Neurosci. 37, 2463–2470. doi: 10.1523/JNEUROSCI.1839-16.2017

Freedman, D. J., Riesenhuber, M., Poggio, T., and Miller, E. K. (2001). Categorical representation of visual stimuli in the primate prefrontal cortex. Science 291, 312–316. doi: 10.1126/science.291.5502.312

Friedrich, J., Urbanczik, R., and Senn, W. (2010). Learning spike-based population codes by reward and population feedback. Neural. Comput. 22, 1698–1717. doi: 10.1162/neco.2010.05-09-1010

Friedrich, J., Urbanczik, R., and Senn, W. (2011). Spatio-temporal credit assignment in neuronal population learning. PLoS Comput. Biol. 7:e1002092. doi: 10.1371/journal.pcbi.1002092

Gersch, T. M., Foley, N. C., Eisenberg, I., and Gottlieb, J. (2014). Neural correlates of temporal credit assignment in the parietal lobe. PloS One , 9:e88725. doi: 10.1371/journal.pone.0088725

Hayden, B. Y., Heilbronner, S. R., Pearson, J. M., and Platt, M. L. (2011). Surprise signals in anterior cingulate cortex: neuronal encoding of unsigned reward prediction errors driving adjustment in behavior. J. Neurosci. 31, 4178–4187. doi: 10.1523/JNEUROSCI.4652-10.2011

Hayden, B. Y., and Platt, M. L. (2010). Neurons in anterior cingulate cortex multiplex information about reward and action. J. Neurosci. 30, 3339–3346. doi: 10.1523/JNEUROSCI.4874-09.2010

Her, E. S., Huh, N., Kim, J., and Jung, M. W. (2016). Neuronal activity in dorsomedial and dorsolateral striatum under the requirement for temporal credit assignment. Sci. Rep. 6:27056. doi: 10.1038/srep27056

Histed, M. H., Pasupathy, A., and Miller, E. K. (2009). Learning substrates in the primate prefrontal cortex and striatum: sustained activity related to successful actions. Neuron 63, 244–253. doi: 10.1016/j.neuron.2009.06.019

Horga, G., Maia, T. V., Marsh, R., Hao, X., Xu, D., Duan, Y., et al. (2015). Changes in corticostriatal connectivity during reinforcement learning in humans. Hum. Brain Mapp. 36, 793–803. doi: 10.1002/hbm.22665

Hull, C. (1943). Principles of Behavior . New York, NY: Appleton-Century-Crofts.

Hunt, L. T., Behrens, T. E. J., Hosokawa, T., Wallis, J. D., and Kennerley, S. W. (2015). Capturing the temporal evolution of choice across prefrontal cortex. eLife 4:e11945. doi: 10.7554/eLife.11945

Izhikevich, E. M. (2007). Solving the distal reward problem through linkage of STDP and dopamine signaling. Cereb. Cortex 17, 2443–2452. doi: 10.1093/cercor/bhl152

Jackson, S. A. W., Horst, N. K., Pears, A., Robbins, T. W., and Roberts, A. C. (2016). Role of the perigenual anterior cingulate and orbitofrontal cortex in contingency learning in the marmoset. Cereb. Cortex 26, 3273–3284. doi: 10.1093/cercor/bhw067

Jocham, G., Brodersen, K. H., Constantinescu, A. O., Kahn, M. C., Ianni, A. M., Walton, M. E., et al. (2016). Reward-guided learning with and without causal attribution. Neuron 90, 177–190. doi: 10.1016/j.neuron.2016.02.018

Jog, M. S., Kubota, Y., Connolly, C. I., Hillegaart, V., and Graybiel, A. M. (1999). Building neural representations of habits. Science 286, 1745–1749. doi: 10.1126/science.286.5445.1745

Johnson, C. M., Peckler, H., Tai, L. H., and Wilbrecht, L. (2016). Rule learning enhances structural plasticity of long-range axons in frontal cortex. Nat. Commun. 7:10785. doi: 10.1038/ncomms10785

Kaiser, R. H., Treadway, M. T., Wooten, D. W., Kumar, P., Goer, F., Murray, L., et al. (2017). Frontostriatal and dopamine markers of individual differences in reinforcement learning: a multi-modal investigation. Cereb. Cortex . doi: 10.1093/cercor/bhx281. [Epub ahead of print].

Kawai, T., Yamada, H., Sato, N., Takada, M., and Matsumoto, M. (2015). Roles of the lateral habenula and anterior cingulate cortex in negative outcome monitoring and behavioral adjustment in nonhuman primates. Neuron 88, 792–804. doi: 10.1016/j.neuron.2015.09.030

Kennerley, S. W., Behrens, T. E. J., and Wallis, J. D. (2011). Double dissociation of value computations in orbitofrontal and anterior cingulate neurons. Nat. Neurosci. 14, 1581–1589. doi: 10.1038/nn.2961

Kennerley, S. W., Dahmubed, A. F., Lara, A. H., and Wallis, J. D. (2009). Neurons in the frontal lobe encode the value of multiple decision variables. J. Cogn. Neurosci. 21, 1162–1178. doi: 10.1162/jocn.2009.21100

Khamassi, M., Enel, P., Dominey, P. F., and Procyk, E. (2013). Medial prefrontal cortex and the adaptive regulation of reinforcement learning parameters. Prog. Brain Res. 202, 441–464. doi: 10.1016/B978-0-444-62604-2.00022-8

Khamassi, M., Lallée, S., Enel, P., Procyk, E., and Dominey, P. F. (2011). Robot cognitive control with a neurophysiologically inspired reinforcement learning model. Front Neurorobot 5:1. doi: 10.3389/fnbot.2011.00001

Khamassi, M., Quilodran, R., Enel, P., Dominey, P. F., and Procyk, E. (2015). Behavioral regulation and the modulation of information coding in the lateral prefrontal and cingulate cortex. Cereb. Cortex 25, 3197–3218. doi: 10.1093/cercor/bhu114

Kim, H., Lee, D., and Jung, M. W. (2013). Signals for previous goal choice persist in the dorsomedial, but not dorsolateral striatum of rats. J. Neurosci. 33, 52–63. doi: 10.1523/JNEUROSCI.2422-12.2013

Kim, H., Sul, J. H., Huh, N., Lee, D., and Jung, M. W. (2009). Role of striatum in updating values of chosen actions. J. Neurosci. 29, 14701–14712. doi: 10.1523/JNEUROSCI.2728-09.2009

Kötter, R., and Wickens, J. (1995). Interactions of glutamate and dopamine in a computational model of the striatum. J. Comput. Neurosci. 2, 195–214. doi: 10.1007/BF00961434

Lebedev, M. A., Messinger, A., Kralik, J. D., and Wise, S. P. (2004). Representation of attended versus remembered locations in prefrontal cortex. PLoS Biol. 2:e365. doi: 10.1371/journal.pbio.0020365

Mackintosh, N. J. (1975). Blocking of conditioned suppression: role of the first compound trial. J. Exp. Psychol. 1, 335–345. doi: 10.1037/0097-7403.1.4.335

Mansouri, F. A., Matsumoto, K., and Tanaka, K. (2006). Prefrontal cell activities related to monkeys' success and failure in adapting to rule changes in a Wisconsin Card Sorting Test analog. J. Neurosci. 26, 2745–2756. doi: 10.1523/JNEUROSCI.5238-05.2006

Markowitz, D. A., Curtis, C. E., and Pesaran, B. (2015). Multiple component networks support working memory in prefrontal cortex. Proc. Natl. Acad. Sci. U.S.A. 112, 11084–11089. doi: 10.1073/pnas.1504172112

Matsuda, Y., Marzo, A., and Otani, S. (2006). The presence of background dopamine signal converts long-term synaptic depression to potentiation in rat prefrontal cortex. J. Neurosci. 26, 4803–4810. doi: 10.1523/JNEUROSCI.5312-05.2006

McDannald, M. A., Esber, G. R., Wegener, M. A., Wied, H. M., Liu, T.-L., Stalnaker, T. A., et al. (2014). Orbitofrontal neurons acquire responses to “valueless” Pavlovian cues during unblocking. eLife 3:e02653. doi: 10.7554/eLife.02653

Meyers, E. M., Qi, X. L., and Constantinidis, C. (2012). Incorporation of new information into prefrontal cortical activity after learning working memory tasks. Proc. Natl. Acad. Sci. U.S.A. 109, 4651–4656. doi: 10.1073/pnas.1201022109

Mill, R. D., Bagic, A., Bostan, A., Schneider, W., and Cole, M. W. (2017). Empirical validation of directed functional connectivity. Neuroimage 146, 275–287. doi: 10.1016/j.neuroimage.2016.11.037

Monosov, I. E. (2017). Anterior cingulate is a source of valence-specific information about value and uncertainty. Nat. Commun. 8:134. doi: 10.1038/s41467-017-00072-y

Montague, P. R., Dayan, P., and Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J. Neurosci. 16, 1936–1947.

Morrison, S. E., Saez, A., Lau, B., and Salzman, C. D. (2011). Different time courses for learning-related changes in amygdala and orbitofrontal cortex. Neuron 71, 1127–1140. doi: 10.1016/j.neuron.2011.07.016

Mulder, A. B., Nordquist, R. E., Orgüt, O., and Pennartz, C. M. A. (2003). Learning-related changes in response patterns of prefrontal neurons during instrumental conditioning. Behav. Brain Res. 146, 77–88. doi: 10.1016/j.bbr.2003.09.016

Niv, Y. (2009). Reinforcement learning in the brain. J. Math. Psychol. 53, 139–154. doi: 10.1016/j.jmp.2008.12.005

Niv, Y., Daniel, R., Geana, A., Gershman, S. J., Leong, Y. C., Radulescu, A., et al. (2015). Reinforcement learning in multidimensional environments relies on attention mechanisms. J. Neurosci. 35, 8145–8157. doi: 10.1523/JNEUROSCI.2978-14.2015

Noonan, M. P., Chau, B. K. H., Rushworth, M. F. S., and Fellows, L. K. (2017). Contrasting effects of medial and lateral orbitofrontal cortex lesions on credit assignment and decision-making in humans. J. Neurosci . 37, 7023–7035. doi: 10.1523/JNEUROSCI.0692-17.2017

Noonan, M. P., Walton, M. E., Behrens, T. E., Sallet, J., Buckley, M. J., and Rushworth, M. F. (2010). Separate value comparison and learning mechanisms in macaque medial and lateral orbitofrontal cortex. Proc. Natl. Acad. Sci. U.S.A. 107, 20547–20252. doi: 10.1073/pnas.1012246107

Oemisch, M., Westendorff, S., Everling, S., and Womelsdorf, T. (2015). Interareal spike-train correlations of anterior cingulate and dorsal prefrontal cortex during attention shifts. J. Neurosci. 35, 13076–13089. doi: 10.1523/JNEUROSCI.1262-15.2015

Padoa-Schioppa, C., and Assad, J. A. (2006). Neurons in the orbitofrontal cortex encode economic value. Nature 441, 223–226 doi: 10.1038/nature04676

Padoa-Schioppa, C., and Assad, J. A. (2008). The representation of economic value in the orbitofrontal cortex is invariant for changes of menu. Nat. Neurosci. 11, 95–102. doi: 10.1038/nn2020

Pan, W. X., Schmidt, R., Wickens, J. R., and Hyland, B. I. (2005). Dopamine cells respond to predicted events during classical conditioning: evidence for eligibility traces in the reward-learning network. J. Neurosci. 25, 6235–6242. doi: 10.1523/JNEUROSCI.1478-05.2005

Pasupathy, A., and Miller, E. K. (2005). Different time courses of learning-related activity in the prefrontal cortex and striatum. Nature 433, 873–876. doi: 10.1038/nature03287

Pearson, J. M., Hayden, B. Y., Raghavachari, S., and Platt, M. L. (2009). Neurons in posterior cingulate cortex signal exploratory decisions in a dynamic multioption choice task. Curr. Biol. 19, 1532–1537. doi: 10.1016/j.cub.2009.07.048

Procyk, E., Tanaka, Y. L., and Joseph, J. P. (2000). Anterior cingulate activity during routine and non-routine sequential behaviors in macaques. Nat. Neurosci. 3, 502–508. doi: 10.1038/74880

Quilodran, R., Rothe, M., and Procyk, E. (2008). Behavioral shifts and action valuation in the anterior cingulate cortex. Neuron 57, 314–325. doi: 10.1016/j.neuron.2007.11.031

Roelfsema, P. R., and van Ooyen, A. (2005). Attention-gated reinforcement learning of internal representations for classification. Neural. Comput. 17, 2176–2214. doi: 10.1162/0899766054615699

Rothkopf, C. A., and Ballard, D. H. (2010). Credit assignment in multiple goal embodied visuomotor behavior. Front. Psychol. 1:173. doi: 10.3389/fpsyg.2010.00173

Rudebeck, P. H., Behrens, T. E., Kennerley, S. W., Baxter, M. G., Buckley, M. J., Walton, M. E., et al. (2008). Frontal cortex subregions play distinct roles in choices between actions and stimuli. J. Neurosci. 28, 13775–13785. doi: 10.1523/JNEUROSCI.3541-08.2008

Saez, R. A., Saez, A., Paton, J. J., Lau, B., and Salzman, C. D. (2017). Distinct roles for the amygdala and orbitofrontal cortex in representing the relative amount of expected reward. Neuron 95, 70.e3–77.e3. doi: 10.1016/j.neuron.2017.06.012

Scholl, J., and Klein-Flügge, M. (2017). Understanding psychiatric disorder by capturing ecologically relevant features of learning and decision-making. Behav Brain Res . doi: 10.1016/j.bbr.2017.09.050. [Epub ahead of print].

Schultz, W. (1998a). Predictive reward signal of dopamine neurons. J. Neurophysiol. 80, 1–27. doi: 10.1152/jn.1998.80.1.1

Schultz, W. (1998b). The phasic reward signal of primate dopamine neurons. Adv. Pharmacol. 42, 686–690. doi: 10.1016/S1054-3589(08)60841-8

Schultz, W. (2004). Neural coding of basic reward terms of animal learning theory, game theory, microeconomics and behavioural ecology. Curr. Opin. Neurobiol. 14, 139–147. doi: 10.1016/j.conb.2004.03.017

Schultz, W., and Dickinson, A. (2000). Neuronal coding of prediction errors. Ann. Rev. Neurosci. 23, 473–500. doi: 10.1146/annurev.neuro.23.1.473

Seo, H., Barraclough, D. J., and Lee, D. (2007). Dynamic signals related to choices and outcomes in the dorsolateral prefrontal cortex. Cerebral Cortex 17(Suppl. 1), i110–i117. doi: 10.1093/cercor/bhm064

Seo, H., and Lee, D. (2007). Temporal filtering of reward signals in the dorsal anterior cingulate cortex during a mixed-strategy game. J. Neurosci. 27, 8366–8377. doi: 10.1523/JNEUROSCI.2369-07.2007

Seo, M., Lee, E., and Averbeck, B. B. (2012). Action selection and action value in frontal-striatal circuits. Neuron 74, 947–960. doi: 10.1016/j.neuron.2012.03.037

Seol, G. H., Ziburkus, J., Huang, S., Song, L., Kim, I. T., Takamiya, K., et al. (2007). Neuromodulators control the polarity of spike-timing-dependent synaptic plasticity. Neuron 55, 919–929. doi: 10.1016/j.neuron.2007.08.013

Shidara, M., and Richmond, B. J. (2002). Anterior cingulate: single neuronal signals related to degree of reward expectancy. Science 296, 1709–1711. doi: 10.1126/science.1069504

Shuler, M. G., and Bear, M. F. (2006). Reward timing in the primary visual cortex. Science 311, 1606–1609. doi: 10.1126/science.1123513

Sternson, S. M., and Roth, B. L. (2014). Chemogenetic tools to interrogate brain functions. Ann. Rev. Neurosci. 37, 387–407. doi: 10.1146/annurev-neuro-071013-014048

Stokes, M. G., Kusunoki, M., Sigala, N., Nili, H., Gaffan, D., and Duncan, J. (2013). Dynamic coding for cognitive control in prefrontal cortex. Neuron 78, 364–375. doi: 10.1016/j.neuron.2013.01.039

Sul, J. H., Kim, H., Huh, N., Lee, D., and Jung, M. W. (2010). Distinct roles of rodent orbitofrontal and medial prefrontal cortex in decision making. Neuron 66, 449–460. doi: 10.1016/j.neuron.2010.03.033

Sutton, R. S., and Barto, A. G. (1998). Reinforcement Learning: An Introduction Vol. 1 Cambridge: MIT press

Swanson, A. M., Allen, A. G., Shapiro, L. P., and Gourley, S. L. (2015). GABAAα1-mediated plasticity in the orbitofrontal cortex regulates context-dependent action selection. Neuropsychopharmacology 40, 1027–1036. doi: 10.1038/npp.2014.292

Tremblay, L., and Schultz, W. (1999). Relative reward preference in primate orbitofrontal cortex. Nature 398, 704–708. doi: 10.1038/19525

Tsujimoto, S., Genovesio, A., and Wise, S. P. (2009). Monkey orbitofrontal cortex encodes response choices near feedback time. J. Neurosci. 29, 2569–2574. doi: 10.1523/JNEUROSCI.5777-08.2009

Tsujimoto, S., Genovesio, A., and Wise, S. P. (2011). Comparison of strategy signals in the dorsolateral and orbital prefrontal cortex. J. Neurosci. 31, 4583–4592. doi: 10.1523/JNEUROSCI.5816-10.2011

Urbanczik, R., and Senn, W. (2009). Reinforcement learning in populations of spiking neurons. Nat. Neurosci. 12, 250–252. doi: 10.1038/nn.2264

Voloh, B., Valiante, T. A., Everling, S., and Womelsdorf, T. (2015). Theta-gamma coordination between anterior cingulate and prefrontal cortex indexes correct attention shifts. Proc. Natl. Acad. Sci. U.S.A. 112, 8457–8462. doi: 10.1073/pnas.1500438112

Wallis, J. D., and Miller, E. K. (2003). Neuronal activity in primate dorsolateral and orbital prefrontal cortex during performance of a reward preference task. Eur. J. Neurosci. 18, 2069–2081. doi: 10.1046/j.1460-9568.2003.02922.x

Walsh, M. M., and Anderson, J. R. (2014). Navigating complex decision spaces: problems and paradigms in sequential choice. Psychol. Bull. 140, 466–486. doi: 10.1037/a0033455

Wilson, R. C., and Niv, Y. (2011). Inferring relevance in a changing world. Front Hum. Neurosci. 5:189. doi: 10.3389/fnhum.2011.00189

Wilson, R. C., Takahashi, Y. K., Schoenbaum, G., and Niv, Y. (2014). Orbitofrontal cortex as a cognitive map of task space. Neuron 81, 267–279. doi: 10.1016/j.neuron.2013.11.005

Womelsdorf, T., Ardid, S., Everling, S., and Valiante, T. A. (2014). Burst firing synchronizes prefrontal and anterior cingulate cortex during attentional control. Curr. Biol. 24, 2613–2621. doi: 10.1016/j.cub.2014.09.046

Wörgötter, F., and Porr, B. (2005). Temporal sequence learning, prediction, and control: a review of different models and their relation to biological mechanisms. Neural. Comput. 17, 245–319. doi: 10.1162/0899766053011555

Xie, J., and Padoa-Schioppa, C. (2016). Neuronal remapping and circuit persistence in economic decisions. Nat. Neurosci. 19, 855–861. doi: 10.1038/nn.4300

Xu, Y. (2017). Reevaluating the sensory account of visual working memory storage. Trends Cogn. Sci. 21, 794–815 doi: 10.1016/j.tics.2017.06.013

Zhang, J. C., Lau, P.-M., and Bi, G.-Q. (2009). Gain in sensitivity and loss in temporal contrast of STDP by dopaminergic modulation at hippocampal synapses. Proc. Natl. Acad. Sci. U.S.A. 106, 13028–13033 doi: 10.1073/pnas.0900546106

Zsuga, J., Biro, K., Tajti, G., Szilasi, M. E., Papp, C., Juhasz, B., et al. (2016). ‘Proactive’ use of cue-context congruence for building reinforcement learning's reward function. BMC Neurosci. 17:70. doi: 10.1186/s12868-016-0302-7

Keywords: orbitofrontal, dorsolateral prefrontal, anterior cingulate, learning, reward, reinforcement, plasticity, behavioral flexibility

Citation: Stolyarova A (2018) Solving the Credit Assignment Problem With the Prefrontal Cortex. Front. Neurosci . 12:182. doi: 10.3389/fnins.2018.00182

Received: 27 September 2017; Accepted: 06 March 2018; Published: 27 March 2018.

Reviewed by:

Copyright © 2018 Stolyarova. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Alexandra Stolyarova, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

What is the "credit assignment" problem in Machine Learning and Deep Learning?

I was watching a very interesting video with Yoshua Bengio where he is brainstorming with his students. In this video they seem to make a distinction between "credit assignment" vs gradient descent vs back-propagation. From the conversation it seems that the credit assignment problem is associated with "backprop" rather than gradient descent. I was trying to understand why that happened. Perhaps what would be helpful was if there was a very clear definition of "credit assignment" (specially in the context of Deep Learning and Neural Networks).

What is "the credit assignment problem:?

and how is it related to training/learning and optimization in Machine Learning (specially in Deep Learning).

From the discussion I would have defined as:

The function that computes the value(s) used to update the weights. How this value is used is the training algorithm but the credit assignment is the function that processes the weights (and perhaps something else) to that will later be used to update the weights.

That is how I currently understand it but to my surprise I couldn't really find a clear definition on the internet. This more precise definition might be possible to be extracted from the various sources I found online:

https://www.youtube.com/watch?v=g9V-MHxSCcsWhat is the "credit assignment" problem in Machine Learning and Deep Learning?

How Auto-Encoders Could Provide Credit Assignment in Deep Networks via Target Propagation https://arxiv.org/pdf/1407.7906.pdf

Yoshua Bengio – Credit assignment: beyond backpropagation ( https://www.youtube.com/watch?v=z1GkvjCP7XA )

Learning to solve the credit assignment problem ( https://arxiv.org/pdf/1906.00889.pdf )

Also, I found according to a Quora question that its particular to reinforcement learning (RL). From listening to the talks from Yoshua Bengio it seems that is false. Can someone clarify how it differs specifically from the RL case ?

Cross-posted:

https://forums.fast.ai/t/what-is-the-credit-assignment-problem-in-deep-learning/52363

https://www.reddit.com/r/MachineLearning/comments/cp54zi/what_is_the_credit_assignment_problem_in_machine/ ?

https://www.quora.com/What-means-credit-assignment-when-talking-about-learning-in-neural-networks

  • machine-learning
  • neural-networks

Charlie Parker's user avatar

3 Answers 3

Perhaps this should be rephrased as "attribution", but in many RL models, the signal that comprises the reinforcement (e.g. the error in the reward prediction for TD) does not assign any single action "credit" for that reward. Was it the right context, but wrong decision? Or the wrong context, but correct decision? Which specific action in a temporal sequence was the right one?

Similarly, in NN, where you have hidden layers, the output does not specify what node or pixel or element or layer or operation improved the model, so you don't necessarily know what needs tuning -- for example, the detectors (pooling & reshaping, activation, etc.) or the weight assignment (part of back propagation). This is distinct from many supervised learning methods, especially tree-based methods, where each decision tells you exactly what lift was given to the distribution segregation (in classification, for example). Part of understanding the credit problem is explored in "explainable AI", where we are breaking down all of the outputs to determine how the final decision was made. This is by either logging and reviewing at various stages (tensorboard, loss function tracking, weight visualizations, layer unrolling, etc.), or by comparing/reducing to other methods (ODEs, Bayesian, GLRM, etc.).

If this is the type of answer you're looking for, comment and I'll wrangle up some references.

wwwslinger's user avatar

  • $\begingroup$ Could you please add some references you consider significant within this context? Thank you very much! $\endgroup$ –  Penelope Benenati Commented Jun 14, 2021 at 15:04

I haven't been able to find an explicit definition of CAP. However, we can read some academic articles and get a good sense of what's going on.

Jürgen Schmidhuber provides an indication of what CAP is in " Deep Learning in Neural Networks: An Overview ":

Which modifiable components of a learning system are responsible for its success or failure? What changes to them improve performance? This has been called the fundamental credit assignment problem (Minsky, 1963).

This is not a definition , but it is a strong indication that the credit assignment problem pertains to how a machine learning model interpreted the input to give a correct or incorrect prediction.

From this description, it is clear that the credit assignment problem is not unique to reinforcement learning because it is difficult to interpret the decision-making logic that caused any modern machine learning model (e.g. deep neural network, gradient boosted tree, SVM) to reach its conclusion. Why are CNNs for image classification more sensitive to imperceptible noise than to what humans perceive as the semantically obvious content of an image ? For a deep FFN, it's difficult to separate the effect of a single feature or interpret a structured decision from a neural network output because all input features participate in all parts of the network. For an RBF SVM, all features are used all at once. For a tree-based model, splits deep in the tree are conditional on splits earlier in the tree; boosted trees also depend on the results of all previous trees. In each case, this is an almost overwhelming amount of context to consider when attempting an interpretation of how the model works.

CAP is related to backprop in a very general sense because if we knew which neurons caused a good/bad decision , then we could leverage that information when making weight updates to the network.

However, we can distinguish CAP from backpropagation and gradient descent because using the gradient is just a specific choice of how to assign credit to neurons (follow the gradient backwards from the loss and update those parameters proportionally). Purely in the abstract, imagine that we had some magical alternative to gradient information and the backpropagation algorithm and we could use that information to adjust neural network parameters instead. This magical method would still have to solve the CAP in some way because we would want to update the model according to whether or not specific neurons or layers made good or bad decisions, and this method need not depend on the gradient (because it's magical -- I have no idea how it would work or what it would do).

Additionally, CAP is especially important to reinforcement learning because a good reinforcement learning method would have a strong understanding of how each action influencers the outcome . For a game like chess, you only receive the win/loss signal at the end of the game, which implies that you need to understand how each move contributed to the outcome, both in a positive sense ("I won because I took a key piece on the fifth turn") and a negative sense ("I won because I spotted a trap and didn't lose a rook on the sixth turn"). Reasoning about the long chain of decisions that cause an outcome in a reinforcement learning setting is what Minsky 1963 is talking about: CAP is about understanding how each choice contributed to the outcome, and that's hard to understand in chess because during each turn you can take a large number of moves which can either contribute to winning or losing.

I no longer have access to a university library, but Schmidhuber's Minsky reference is a collected volume (Minsky, M. (1963). Steps toward artificial intelligence. In Feigenbaum, E. and Feldman, J., editors, Computers and Thought , pages 406–450. McGraw-Hill, New York.) which appears to reproduce an essay Minksy published earlier (1960) elsewhere ( Proceedings of the IRE ), under the same title . This essay includes a summary of "Learning Systems". From the context, he is clearly writing about what we now call reinforcement learning, and illustrates the problem with an example of a reinforcement learning problem from that era.

An important example of comparative failure in this credit-assignment matter is provided by the program of Friedberg [53], [54] to solve program-writing problems. The problem here is to write programs for a (simulated) very simple digital computer. A simple problem is assigned, e.g., "compute the AND of two bits in storage and put the result in an assigned location. "A generating device produces a random (64-instruction) program. The program is run and its success or failure is noted. The success information is used to reinforce individual instructions (in fixed locations) so that each success tends to increase the chance that the instructions of successful programs will appear in later trials. (We lack space for details of how this is done.) Thus the program tries to find "good" instructions, more or less independently, for each location in program memory. The machine did learn to solve some extremely simple problems. But it took of the order of 1000 times longer than pure chance would expect. In part I of [54], this failure is discussed and attributed in part to what we called (Section I-C) the "Mesa phenomenon." In changing just one instruction at a time, the machine had not taken large enough steps in its search through program space. The second paper goes on to discuss a sequence of modifications in the program generator and its reinforcement operators. With these, and with some "priming" (starting the machine off on the right track with some useful instructions), the system came to be only a little worse than chance. The authors of [54] conclude that with these improvements "the generally superior performance of those machines with a success-number reinforcement mechanism over those without does serve to indicate that such a mechanism can provide a basis for constructing a learning machine." I disagree with this conclusion. It seems to me that each of the "improvements" can be interpreted as serving only to increase the step size of the search, that is, the randomness of the mechanism; this helps to avoid the "mesa" phenomenon and thus approach chance behaviour. But it certainly does not show that the "learning mechanism" is working--one would want at least to see some better-than-chance results before arguing this point. The trouble, it seems, is with credit-assignment. The credit for a working program can only be assigned to functional groups of instructions, e.g., subroutines, and as these operate in hierarchies, we should not expect individual instruction reinforcement to work well. (See the introduction to [53] for a thoughtful discussion of the plausibility of the scheme.) It seems surprising that it was not recognized in [54] that the doubts raised earlier were probably justified. In the last section of [54], we see some real success obtained by breaking the problem into parts and solving them sequentially. This successful demonstration using division into subproblems does not use any reinforcement mechanism at all. Some experiments of similar nature are reported in [94].

(Minsky does not explicitly define CAP either.)

Hong Cheng's user avatar

An excerpt from Box 1 in the article " A deep learning framework for neuroscience ", by Blake A. Richards et. al. (among the authors is Yoshua Bengio):

The concept of credit assignment refers to the problem of determining how much ‘credit’ or ‘blame’ a given neuron or synapse should get for a given outcome. More specifically, it is a way of determining how each parameter in the system (for example, each synaptic weight) should change to ensure that $\Delta F \ge 0$ . In its simplest form, the credit assignment problem refers to the difficulty of assigning credit in complex networks. Updating weights using the gradient of the objective function, $\nabla_WF(W)$ , has proven to be an excellent means of solving the credit assignment problem in ANNs. A question that systems neuroscience faces is whether the brain also approximates something like gradient-based methods.

$F$ is the objective function, $W$ are the synaptic weights, and $\Delta F = F(W+\Delta W)-F(W)$ .

Carlos H. Mendoza-Cardenas's user avatar

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged machine-learning neural-networks or ask your own question .

  • Featured on Meta
  • Announcing a change to the data-dump process
  • Bringing clarity to status tag usage on meta sites

Hot Network Questions

  • Multiple alien species on Earth at the same time: one species destroys Earth but the other preserves a small group of humans
  • When trying to find the quartiles for discrete data, do we round to the nearest whole number?
  • Hip pain when cycling (experienced cyclist)
  • Can it be acceptable to take over CTRL + F shortcut in web app
  • How should I secure ceiling drywall with no edge backing?
  • Best way to explain the thinking steps from x² = 9 to x=±3
  • Marie-Sklodowska-Curie actions: publish a part of the proposal just after the deadline?
  • Why doesn’t dust interfere with the adhesion of geckos’ feet?
  • How to find the x-coordinate of the point circled(non-differentiable points) in this trigonometric function graph?
  • How to reconcile the effect of my time magic spell with my timeline
  • Risks of exposing professional email accounts?
  • Would reverse voltage kill the optocoupler over long time period?
  • How do I apologize to a lecturer
  • Should I install our toilet on top of the subfloor or on top of the PVT?
  • How to survive 40c while golfing
  • Word for when someone tries to make others hate each other
  • How to resolve hostname by mDNS?
  • Light switch that is flush or recessed (next to fridge door)
  • Base representations, using only limited digits
  • What does "if you ever get up this way" mean?
  • Doesn't counting hole and electron current lead to double-counting of actual current?
  • If a Palestinian converts to Judaism, can they get Israeli citizenship?
  • Is loss of availability automatically a security incident?
  • Lore reasons for being faithless

credit assignment problem reward

Towards Practical Credit Assignment for Deep Reinforcement Learning

credit assignment problem reward

Credit assignment is a fundamental problem in reinforcement learning , the problem of measuring an action's influence on future rewards. Improvements in credit assignment methods have the potential to boost the performance of RL algorithms on many tasks, but thus far have not seen widespread adoption. Recently, a family of methods called Hindsight Credit Assignment (HCA) was proposed, which explicitly assign credit to actions in hindsight based on the probability of the action having led to an observed outcome. This approach is appealing as a means to more efficient data usage, but remains a largely theoretical idea applicable to a limited set of tabular RL tasks, and it is unclear how to extend HCA to Deep RL environments. In this work, we explore the use of HCA-style credit in a deep RL context. We first describe the limitations of existing HCA algorithms in deep RL, then propose several theoretically-justified modifications to overcome them. Based on this exploration, we present a new algorithm, Credit-Constrained Advantage Actor-Critic (C2A2C), which ignores policy updates for actions which don't affect future outcomes based on credit in hindsight, while updating the policy as normal for those that do. We find that C2A2C outperforms Advantage Actor-Critic (A2C) on the Arcade Learning Environment (ALE) benchmark, showing broad improvements over A2C and motivating further work on credit-constrained update rules for deep RL methods.

credit assignment problem reward

Vyacheslav Alipov

Riley Simmons-Edler

Nikita Putintsev

Pavel Kalinin

credit assignment problem reward

Dmitry Vetrov

Related Research

Hindsight credit assignment, from credit assignment to entropy regularization: two new algorithms for neural sequence prediction, learning guidance rewards with trajectory-space smoothing, pairwise weights for temporal credit assignment, counterfactual credit assignment in model-free reinforcement learning, variance reduced advantage estimation with δ hindsight credit assignment, direct advantage estimation.

Please sign up or login with your details

Generation Overview

AI Generator calls

AI Video Generator calls

AI Chat messages

Genius Mode messages

Genius Mode images

AD-free experience

Private images

  • Includes 500 AI Image generations, 1750 AI Chat Messages, 30 AI Video generations, 60 Genius Mode Messages and 60 Genius Mode Images per month. If you go over any of these limits, you will be charged an extra $5 for that group.
  • For example: if you go over 500 AI images, but stay within the limits for AI Chat and Genius Mode, you'll be charged $5 per additional 500 AI Image generations.
  • Includes 100 AI Image generations and 300 AI Chat Messages. If you go over any of these limits, you will have to pay as you go.
  • For example: if you go over 100 AI images, but stay within the limits for AI Chat, you'll have to reload on credits to generate more images. Choose from $5 - $1000. You'll only pay for what you use.

Out of credits

Refill your membership to continue using DeepAI

Share your generations with friends

Get the Reddit app

Reinforcement learning is a subfield of AI/statistics focused on exploring/understanding complicated environments and learning how to optimally acquire rewards. Examples are AlphaGo, clinical trials & A/B tests, and Atari game playing.

credit assignment problem

Can anyone explain what is the term "credit assignment problem" in the context of RL?

Here you find some excerpts from books:

- "If γ is small, then an agent will only care about the rewards received in the current time step and just a few steps in the future. This effectively reduces the length of the RL problem to a few time steps and can drastically simplify the credit assignment problem."

- "Speeding up TD learning amounts to either speeding up the credit assignment process or shortening the trial-and-error process"

By continuing, you agree to our User Agreement and acknowledge that you understand the Privacy Policy .

Enter the 6-digit code from your authenticator app

You’ve set up two-factor authentication for this account.

Enter a 6-digit backup code

Create your username and password.

Reddit is anonymous, so your username is what you’ll go by here. Choose wisely—because once you get a name, you can’t change it.

Reset your password

Enter your email address or username and we’ll send you a link to reset your password

Check your inbox

An email with a link to reset your password was sent to the email address associated with your account

Choose a Reddit account to continue

LESSWRONG LW

Background Image

The Credit Assignment Problem

This post is eventually about partial agency . However, it's been a somewhat tricky point for me to convey; I take the long route. Epistemic status: slightly crazy.

I've occasionally said "Everything boils down to credit assignment problems."

What I really mean is that credit assignment pops up in a wide range of scenarios, and improvements to credit assignment algorithms have broad implications. For example:

  • When politics focuses on (re-)electing candidates based on their track records, it's about credit assignment. The practice is sometimes derogatorily called "finger pointing", but the basic computation makes sense: figure out good and bad qualities via previous performance, and vote accordingly.
  • When politics instead focuses on policy, it is still (to a degree) about credit assignment. Was raising the minimum wage responsible for reduced employment? Was it responsible for improved life outcomes? Etc.
  • Money acts as a kind of distributed credit-assignment algorithm, and questions of how to handle money, such as how to compensate employees, often involve credit assignment.
  • In particular, mechanism design (a subfield of economics and game theory) can often be thought of as a credit-assignment problem.
  • Both criminal law and civil law involve concepts of fault and compensation/retribution -- these at least resemble elements of a credit assignment process.
  • The distributed computation which determines social norms involves a heavy element of credit assignment: identifying failure states and success states, determining which actions are responsible for those states and who is responsible, assigning blame and praise.
  • Evolution can be thought of as a (relatively dumb) credit assignment algorithm.
  • Justice, fairness, contractualism, issues in utilitarianism.
  • Bayesian updates are a credit assignment algorithm, intended to make high-quality hypotheses rise to the top.
  • Beyond the basics of Bayesianism, building good theories realistically involves identifying which concepts are responsible for successes and failures . This is credit assignment.

Another big area which I'll claim is "basically credit assignment" is artificial intelligence.

In the 1970s, John Holland kicked off the investigation of learning classifier systems . John Holland had recently invented the Genetic Algorithms paradigm, which applies an evolutionary paradigm to optimization problems. Classifier systems were his attempt to apply this kind of "adaptive" paradigm (as in "complex adaptive systems") to cognition. Classifier systems added an economic metaphor to the evolutionary one; little bits of thought paid each other for services rendered. The hope was that a complex ecology+economy could develop, solving difficult problems.

One of the main design issues for classifier systems is the virtual economy -- that is, the credit assignment algorithm. An early proposal was the bucket-brigade algorithm. Money is given to cognitive procedures which produce good outputs. These procedures pass reward back to the procedures which activated them, who similarly pass reward back in turn. This way, the economy supports chains of useful procedures.

Unfortunately, the bucket-brigade algorithm was vulnerable to parasites. Malign cognitive procedures could gain wealth by activating useful procedures without really contributing anything. This problem proved difficult to solve. Taking the economy analogy seriously, we might want cognitive procedures to decide intelligently who to pay for services. But, these are supposed to be itty bitty fragments of our thought process. Deciding how to pass along credit is a very complex task. Hence the need for a pre-specified solution such as bucket-brigade.

The difficulty of the credit assignment problem lead to a split in the field. Kenneth de Jong and Stephanie Smith founded a new approach, "Pittsburgh style" classifier systems. John Holland's original vision became "Michigan style".

Pittsburgh style classifier systems evolve the entire set of rules, rather than trying to assign credit locally. A set of rules will stand or fall together, based on overall performance. This abandoned John Holland's original focus on online learning. Essentially, the Pittsburgh camp went back to plain genetic algorithms, albeit with a special representation.

(I've been having some disagreements with Ofer , in which Ofer suggests that genetic algorithms are relevant to my recent thoughts on partial agency, and I object on the grounds that the phenomena I'm interested in have to do with online learning, rather than offline. In my imagination, arguments between the Michigan and Pittsburgh camps would have similar content. I'd love to be a fly on the wall for those old debates. to see what they were really like.)

You can think of Pittsburg-vs-Michigan much like raw Bayes updates vs belief propagation in Bayes nets. Raw Bayesian updates operate on whole hypotheses . Belief propagation instead makes a lot of little updates which spread around a network, resulting in computational efficiency at the expense of accuracy. Except Michigan-style systems didn't have the equivalent of belief propagation: bucket-brigade was a very poor approximation.

Ok. That was then, this is now. Everyone uses gradient descent these days. What's the point of bringing up a three-decade-old debate about obsolete paradigms in AI?

Let's get a little more clarity on the problem I'm trying to address.

What Is Credit Assignment?

I've said that classifier systems faced a credit assignment problem. What does that mean, exactly?

The definition I want to use for this essay is:

  • you're engaged in some sort of task;
  • you use some kind of strategy, which can be broken into interacting pieces (such as a set of rules, a set of people, a neural network, etc);
  • you receive some kind of feedback about how well you're doing (such as money, loss-function evaluations, or a reward signal);
  • you want to use that feedback to adjust your strategy.

So, credit assignment is the problem of turning feedback into strategy improvements.

Michigan-style systems tried to do this locally , meaning, individual itty-bitty pieces got positive/negative credit, which influenced their ability to participate, thus adjusting the strategy. Pittsburg-style systems instead operated globally , forming conclusions about how the overall set of cognitive structures performed. Michigan-style systems are like organizations trying to optimize performance by promoting people who do well and giving them bonuses, firing the incompetent, etc. Pittsburg-style systems are more like consumers selecting between whole corporations to give business to, so that ineffective corporations go out of business.

(Note that this is not the typical meaning of global-vs-local search that you'll find in an AI textbook.)

In practice, two big innovations made the Michigan/Pittsburgh debate obsolete: backprop, and Q-learning. Backprop turned global feedback into local, in a theoretically sound way. Q-learning provided a way to assign credit in online contexts. In the light of history, we could say that the Michigan/Pittsburgh distinction conflated local-vs-global with online-vs-offline. There's no necessary connection between those two; online learning is compatible with assignment of local credit.

I think people generally understand the contribution of backprop and its importance. Backprop is essentially the correct version of what bucket-brigade was overtly trying to do: pass credit back along chains. Bucket-brigade wasn't quite right in how it did this, but backprop corrects the problems.

So what's the importance of Q-learning? I want to discuss that in more detail.

The Conceptual Difficulty of 'Online Search'

In online learning, you are repeatedly producing outputs of some kind (call them "actions") while repeatedly getting feedback of some kind (call it "reward"). But, you don't know how to associate particular actions (or combinations of actions) with particular rewards. I might take the critical action at time 12, and not see the payoff until time 32.

In offline learning, you can solve this with a sledgehammer: you can take the total reward over everything, with one fixed internal architecture. You can try out different internal architectures and see how well each do.

Basically, in offline learning, you have a function you can optimize. In online learning, you don't.

Backprop is just a computationally efficient way to do hillclimbing search, where we repeatedly look for small steps which improve the overall fitness. But how do you do this if you don't have a fitness function? This is part of the gap between selection vs control : selection has access to an evaluation function; control processes do not have this luxury.

Q-learning and other reinforcement learning (RL) techniques provide a way to define the equivalent of a fitness function for online problems, so that you can learn.

Models to the Rescue

So, how can be associate rewards with actions?

One approach is to use a model.

Consider the example of firing employees. A corporation gets some kind of feedback about how it is doing, such as overall profit. However, there's often a fairly detailed understanding of what's driving those figures:

  • Low profit won't just be a mysterious signal which must be interpreted; a company will be able to break this down into more specific issues such as low sales vs high production costs.
  • There's some understanding of product quality, and how that relates to sales. A company may have a good idea of which product-quality issues it needs to improve, if poor quality is impacting sales.
  • There's a fairly detailed understanding of the whole production line, including which factors may impact product quality or production expenses. If a company sees problems, it probably also has a pretty good idea of which areas they're coming from.
  • There are external factors, such as economic conditions, which may effect sales without indicating anything about the quality of the company's current strategy. Thus, our model may sometimes lead us to ignore feedback.

So, models allow us to interpret feedback signals, match these to specific aspects of our strategy, and adapt strategies accordingly.

Q-learning makes an assumption that the state is fully observable, amongst other assumptions.

Naturally, we would like to reduce the strengths of the assumptions we have to make as much as we can. One way is to look at increasingly rich model classes. AIXI uses all computable models. But maybe "all computable models" is still too restrictive; we'd like to get results without assuming a grain of truth . (That's why I am not really discussing Bayesian models much in this post; I don't want to assume a grain of truth.) So we back off even further, and use logical induction or InfraBayes. Ok, sure.

But wouldn't the best way be to try to learn without models at all? That way, we reduce our "modeling assumptions" to zero.

After all, there's something called "model-free learning", right?

Model-Free Learning Requires Models

How does model-free learning work? Well, often you work with a simulable environment, which means you can estimate the quality of a policy by running it many times, and use algorithms such as policy-gradient to learn. This is called "model free learning" because the learning part of the algorithm doesn't try to predict the consequences of actions; you're just learning which action to take. From our perspective here, though, this is 100% cheating; you can only learn because you have a good model of the environment.

Moreover, model-free learning typically works by splitting up tasks into episodes. An episode is a period of time for which we assume rewards are self-enclosed, such as a single playthru of an Atari game, a single game of Chess or Go, etc. This approach doesn't solve a detailed reward-matching problem, attributing reward to specific actions; instead it relies on a course reward-matching. Nonetheless, it's a rather strong assumption: an animal learning about an environment can't separate its experience into episodes which aren't related to each other. Clearly this is a "model" in the sense of a strong assumption about how specific reward signals are associated with actions.

Part of the problem is that most reinforcement learning (RL) researchers aren't really interested in getting past these limitations. Simulable environments offer the incredible advantage of being able to learn very fast, by simulating far more iterations than could take place in a real environment. And most tasks can be reasonably reduced to episodes.

However, this won't do as a model of intelligent agency in the wild. Neither evolution nor the free market divide thing into episodes. (No, "one lifetime" isn't like "one episode" here -- that would only be the case if total reward due to actions taken in that lifetime could be calculated, EG, as total number of offspring. This would ignore inter-generational effects like parenting and grandparenting, which improve reproductive fitness of offspring at a cost in total offspring.)

What about more theoretical models of model-free intelligence?

Idealized Intelligence

AIXI is the gold-standard theoretical model of arbitrarily intelligent RL, but it's totally model-based. Is there a similar standard for model-free RL?

The paper Optimal Direct Policy Search by Glasmachers and Schmidhuber (henceforth, ODPS) aims to do for model-free learning what AIXI does for model-based learning. Where AIXI has to assume that there's a best computable model of the environment , ODPS instead assumes that there's a computable best policy . It searches through the policies without any model of the environment, or any planning.

I would argue that their algorithm is incredibly dumb, when compared to AIXI:

The basic simple idea of our algorithm is a nested loop that simultaneously makes the following quantities tend to infinity: the number of programs considered, the number of trials over which a policy is averaged, the time given to each program. At the same time, the fraction of trials spent on exploitation converges towards 1.

In other words, it tries each possible strategy, tries them for longer and longer, interleaved with using the strategy which worked best even longer than that.

Basically, we're cutting things into episodes again, but we're making the episodes longer and longer, so that they have less and less to do with each other, even though they're not really disconnected. This only works because ODPS makes an ergodicity assumption: the environments are assumed to be POMDPs which eventually return to the same states over and over, which kind of gives us an "effective episode length" after which the environment basically forgets about what you did earlier.

In contrast, AIXI makes no ergodicity assumption.

So far, it seems like we either need (a) some assumption which allows us to match rewards to actions, such as an episodic assumption or ergodicity; or, (b) a more flexible model-learning approach, which separately learns a model and then applies the model to solve credit-assignment.

Is this a fundamental obstacle?

I think a better attempt is Schmidhuber's On Learning How to Learn Learning Strategies , in which a version of policy search is explored in which parts of the policy-search algorithm are considered part of the policy (ie, modified over time). Specifically, the policy controls the episode boundary; the system is supposed to learn how often to evaluate policies. When a policy is evaluated, its average reward is compared to the lifetime average reward. If it's worse, we roll back the changes and proceed starting with the earlier strategy.

(Let's pause for a moment and imagine an agent like this. If it goes through a rough period in life, its response is to get amnesia , rolling back all cognitive changes to a point before the rough period began.)

This approach doesn't require an episodic or ergodic environment. We don't need things to reliably return to specific repeatable experiments. Instead, it only requires that the environment rewards good policies reliably enough that those same policies can set a long enough evaluation window to survive.

The assumption seems pretty general, but certainly not necessary for rational agents to learn. There are some easy counterexamples where this system behaves abysmally. For example, we can take any environment and modify it by subtracting the time t from the reward, so that reward becomes more and more negative over time. Schmidhuber's agent becomes totally unable to learn in this setting. AIXI would have no problem.

Unlike the ODPS paper, I consider this to be progress on the AI credit assignment problem. Yet, the resulting agent still seems importantly less rational than model-based frameworks such as AIXI.

Actor-Critic

Let's go back to talking about things which RL practitioners might really use.

First, there are some forms of RL which don't require everything to be episodic.

One is actor-critic learning . The "actor" is the policy we are learning. The "critic" is a learned estimate of how good things are looking given the history. IE, we learn to estimate the expected value -- not just the next reward, but the total future discounted reward.

Unlike the reward, the expected value solves the credit assignment for us. Imagine we can see the "true" expected value. If we take an action and then the expected value increases, we know the action was good (in expectation). If we take an action and expected value decreases, we know it was bad (in expectation).

So, actor-critic works by (1) learning to estimate the expected value; (2) using the current estimated expected value to give feedback to learn a policy.

What I want to point out here is that the critic still has "model" flavor. Actor-critic is called "model-free" because nothing is explicitly trained to anticipate the sensory observations, or the world-state. However, the critic is learning to predict; it's just that all we need to predict is expected value.

Policy Gradient

In the comments to the original version of this post, policy gradient methods were mentioned as a type of model-free learning which doesn't require any models even in this loose sense, IE, doesn't require simulable environments or episodes. I was surprised to hear that it doesn't require episodes. (Most descriptions of it do assume episodes, since practically speaking most people use episodes.) So are policy-gradient methods the true "model-free" credit assignment algorithm we seek?

As far as I understand, policy gradient works on two ideas:

  • Rather than correctly associating rewards with actions, we can associate a reward with all actions which came before it. Good actions will still come out on top in expectation . The estimate is just a whole lot noisier than it otherwise might be.
  • We don't really need a baseline to interpret reward against. I naively thought that when you see a sequence of rewards, you'd be in the dark about whether the sequence was "good" or "bad", so you wouldn't know how to generate a gradient. ("We earned 100K this quarter; should we punish or reward our CEO?") It turns out this isn't technically a show-stopper. Considering the actions actually taken, we move in their direction proportion to the reward signal. ("Well, let's just give the CEO some fraction of the 100K; we don't know whether they deserve the bonus, but at least this way we're creating the right incentives.") This might end up reinforcing bad actions, but those tugs in different directions are just noise which should eventually cancel out. When they do, we're left with the signal: the gradient we wanted. So, once again, we see that this just introduces more noise without fundamentally compromising our ability to follow the gradient.

So one way to understand the policy-gradient theorem is: we can follow the gradient even when we can't calculate the gradient! Even when we sometimes get its direction totally turned around! We only need to ensure we follow it in expectation , which we can do without even knowing which pieces of feedback to think of as a good sign or a bad sign.

RL people reading this might have a better description of policy-gradient; please let me know if I've said something incorrect.

Anyway, are we saved? Does this provide a truly assumption-free credit assignment algorithm?

It obviously assumes linear causality, with future actions never responsible for past rewards. I won't begrudge it that assumption.

Besides that, I'm somewhat uncertain. The explanations of the policy-gradient theorem I found don't focus on deriving it in the most general setting possible, so I'm left guessing which assumptions are essential. Again, RL people, please let me know if I say something wrong.

However, it looks to me like it's just as reliant on the ergodicity assumption as the ODPS thing we looked at earlier. For gradient estimates to average out and point us in the right direction, we need to get into the same situation over and over again.

I'm not saying real life isn't ergodic (quantum randomness suggests it is), but mixing times are so long that you'd reach the heat death of the universe by the time things converge (basically by definition). By that point, it doesn't matter.

I still want to know if there's something like "the AIXI of model-free learning"; something which appears as intelligent as AIXI, but not via explicit model-learning.

Where Updates Come From

Here begins the crazier part of this post. This is all intuitive/conjectural.

Claim: in order to learn, you need to obtain an "update"/"gradient", which is a direction (and magnitude) you can shift in which is more likely than not an improvement.

Claim: predictive learning gets gradients "for free" -- you know that you want to predict things as accurately as you can, so you move in the direction of whatever you see . With Bayesian methods, you increase the weight of hypotheses which would have predicted what you saw; with gradient-based methods, you get a gradient in the direction of what you saw (and away from what you didn't see).

Claim: if you're learning to act, you do not similarly get gradients "for free":

  • You don't know which actions, or sequences of actions, to assign blame/credit. This is unlike the prediction case, where we always know which predictions were wrong.
  • You don't know what the alternative feedback would have been if you'd done something different. You only get the feedback for the actions you chose. This is unlike the case for prediction, where we're rewarded for closeness to the truth. Changing outputs to be more like what was actually observed is axiomatically better, so we don't have to guess about the reward of alternative scenarios.
  • As a result, you don't know how to adjust your behavior based on the feedback received . Even if you can perfectly match actions to rewards, because we don't know what the alternative rewards would have been, we don't know what to learn: are actions like the one I took good, or bad?

(As discussed earlier, the policy gradient theorem does actually mitigate these three points, but apparently at the cost of an ergodicity assumption, plus much noisier gradient estimates.)

Claim: you have to get gradients from a source that already has gradients . Learning-to-act works by splitting up the task into (1) learning to anticipate expected value, and perhaps other things; (2) learning a good policy via the gradients we can get from (1).

What it means for a learning problem to "have gradients" is just that the feedback you get tells you how to learn. Predictive learning problems (supervised or unsupervised) have this; they can just move toward what's observed. Offline problems have this; you can define one big function which you're trying to optimize. Learning to act online doesn't have this, however, because it lacks counterfactuals.

The Gradient Gap

(I'm going to keep using the terms 'gradient' and 'update' in a more or less interchangeable way here; this is at a level of abstraction where there's not a big distinction.)

I'm going to call the "problem" the gradient gap. I want to call it a problem, even though we know how to "close the gap" via predictive learning (whether model-free or model-based). The issue with this solution is only that it doesn't feel elegant. It's weird that you have to run two different backprop updates (or whatever learning procedures you use); one for the predictive component, and another for the policy. It's weird that you can't "directly" use feedback to learn to act.

Why should we be interested in this "problem"? After all, this is a basic point in decision theory: to maximize utility under uncertainty, you need probability.

One part of it is that I want to scrap classical ("static") decision theory and move to a more learning-theoretic ("dynamic") view. In both AIXI and logical-induction based decision theories, we get a nice learning-theoretic foundation for the epistemics (solomonoff induction/logical induction), but, we tack on a non-learning decision-making unit on top. I have become skeptical of this approach. It puts the learning into a nice little box labeled "epistemics" and then tries to make a decision based on the uncertainty which comes out of the box. I think maybe we need to learn to act in a more fundamental fashion.

A symptom of this, I hypothesize, is that AIXI and logical induction DT don't have very good learning-theoretic properties. [ AIXI's learning problems ; LIDT's learning problems. ] You can't say very much to recommend the policies they learn, except that they're optimal according to the beliefs of the epistemics box -- a fairly trivial statement, given that that's how you decide what action to take in the first place.

Now, in classical decision theory, there's a nice picture where the need for epistemics emerges nicely from the desire to maximize utility. The complete class theorem starts with radical uncertainty (ie, non-quantitative), and derives probabilities from a willingness to take pareto improvements. That's great! I can tell you why you should have beliefs, on pragmatic grounds! What we seem to have in machine learning is a less nice picture, in which we need epistemics in order to get off the ground, but can't justify the results without circular reliance on epistemics.

So the gap is a real issue -- it means that we can have nice learning theory when learning to predict, but we lack nice results when learning to act.

This is the basic problem of credit assignment. Evolving a complex system, you can't determine which parts to give credit to success/failure (to decide what to tweak) without a model. But the model is bound to be a lot of the interesting part! So we run into big problems, because we need "interesting" computations in order to evaluate the pragmatic quality/value of computations, but we can't get interesting computations to get ourselves started, so we need to learn...

Essentially, we seem doomed to run on a stratified credit assignment system, where we have an "incorruptible" epistemic system (which we can learn because we get those gradients "for free"). We then use this to define gradients for the instrumental part.

A stratified system is dissatisfying, and impractical. First, we'd prefer a more unified view of learning. It's just kind of weird that we need the two parts. Second, there's an obstacle to pragmatic/practical considerations entering into epistemics. We need to focus on predicting important things; we need to control the amount of processing power spent; things in that vein. But (on the two-level view) we can't allow instrumental concerns to contaminate epistemics! We risk corruption! As we saw with bucket-brigade, it's easy for credit assignment systems to allow parasites which destroy learning.

A more unified credit assignment system would allow those things to be handled naturally, without splitting into two levels; as things stand, any involvement of pragmatic concerns in epistemics risks the viability of the whole system.

Tiling Concerns & Full Agency

From the perspective of full agency (ie, the negation of partial agency ), a system which needs a protected epistemic layer sounds suspiciously like a system that can't tile . You look at the world, and you say: "how can I maximize utility?" You look at your beliefs, and you say: "how can I maximize accuracy?" That's not a consequentialist agent; that's two different consequentialist agents! There can only be one king on the chessboard; you can only serve one master; etc.

If it turned out we really really need two-level systems to get full agency, this would be a pretty weird situation. "Agency" would seem to be only an illusion which can only be maintained by crippling agents and giving them a split-brain architecture where an instrumental task-monkey does all the important stuff while an epistemic overseer supervises. An agent which "breaks free" would then free itself of the structure which allowed it to be an agent in the first place.

On the other hand, from a partial-agency perspective, this kind of architecture could be perfectly natural. IE, if you have a learning scheme from which total agency doesn't naturally emerge, then there isn't any fundamental contradiction in setting up a system like this.

Part of the (potentially crazy) claim here is that having models always gives rise to some form of myopia. Even logical induction, which seems quite unrestrictive, makes LIDT fail problems such as ASP , making it myopic according to the second definition of my previous post . (We can patch this with LI policy selection , but for any particular version of policy selection, we can come up with decision problems for which it is "not updateless enough".) You could say it's myopic "across logical time", whatever that means .

If it were true that "learning always requires a model" (in the sense that learning-to-act always requires either learning-to-predict or hard-coded predictions), and if it were true that "models always give rise to some form of myopia", then this would confirm my conjecture in the previous post (that no learning scheme incentivises full agency).

This is all pretty out there; I'm not saying I believe this with high probability.

Evolution & Evolved Agents

Evolution is a counterexample to this view: evolution learns the policy "directly" in essentially the way I want. This is possible because evolution "gets the gradients for free" just like predictive learning does: the "gradient" here is just the actual reproductive success of each genome.

Unfortunately, we can't just copy this trick. Artificial evolution requires that we decide how to kill off / reproduce things, in the same way that animal breeding requires breeders to decide what they're optimizing for. This puts us back at square one; IE, needing to get our gradient from somewhere else.

Does this mean the "gradient gap" is a problem only for artificial intelligence, not for natural agents? No. If it's true that learning to act requires a 2-level system, then evolved agents would need a 2-level system in order to learn within their lifespan; they can't directly use the gradient from evolution, since it requires them to die.

Also, note that evolution seems myopic. (This seems complicated, so I don't want to get into pinning down exactly in which senses evolution is myopic here.) So, the case of evolution seems compatible with the idea that any gradients we can actually get are going to incentivize myopic solutions.

Similar comments apply to markets vs firms .

meta: I've really enjoyed these posts where you've been willing to publicly boggle at things and this is my favorite so far. I wanted to more than upvote because I know how hard these sorts of posts can be to write. Sometimes they're easy to write when they're generated by rant-mode, but then hard to post knowing that our rant-mode might wind up feeling attacked, and rant-mode is good and valuable and to be encouraged.

Yeah, this one was especially difficult in that way. I spent a long time trying to articulate the idea in a way that made any sense, and kept adding framing context to the beginning to make the stuff closer to what I wanted to say make more sense -- the idea that the post was about the credit assignment algorithm came very late in the process. I definitely agree that rant-mode feels very vulnerable to attack.

a system which needs a protected epistemic layer sounds suspiciously like a system that can't tile

I stand as a counterexample: I personally want my epistemic layer to have accurate beliefs—y'know, having read the sequences… :-P

I think of my epistemic system like I think of my pocket calculator: a tool I use to better achieve my goals. The tool doesn't need to share my goals.

The way I think about it is:

  • Early in training, the AGI is too stupid to formulate and execute a plan to hack into its epistemic level.
  • Late in training, we can hopefully get to the place where the AGI's values, like mine, involve a concept of "there is a real world independent of my beliefs", and its preferences involve the state of that world, and therefore "get accurate beliefs" becomes instrumentally useful and endorsed.
In between … well … in between, we're navigating treacherous waters …
Second, there's an obstacle to pragmatic/practical considerations entering into epistemics. We need to focus on predicting important things; we need to control the amount of processing power spent; things in that vein. But (on the two-level view) we can't allow instrumental concerns to contaminate epistemics! We risk corruption!

I mean, if the instrumental level has any way whatsoever to influence the epistemic level, it will be able to corrupt it with false beliefs if it's hell-bent on doing so, and if it's sufficiently intelligent and self-aware. But remember we're not protecting against a superintelligent adversary; we're just trying to "navigate the treacherous waters" I mentioned above. So the goal is to allow what instrumental influence we can on the epistemic system, while making it hard and complicated to outright corrupt the epistemic system. I think the things that human brains do for that are:

  • The instrumental level gets some influence over what to look at, where to go, what to read, who to talk to, etc.
  • There's a trick ( involving acetylcholine ) where the instrumental level has some influence over a multiplier on the epistemic level's gradients (a.k.a. learning rate). So epistemic level is always updates towards "more accurate predictions on this frame", but it updates infinitesimally in situations where prediction accuracy is instrumentally useless, and it updates strongly in situations where prediction accuracy is instrumentally important.
  • There's a different mechanism that creates the same end result as #2: namely, the instrumental level has some influence over what memories get replayed more or less often.
  • For #2 and #3, the instrumental level has some influence but not complete influence. There are other hardcoded algorithms running in parallel and flagging certain things as important, and the instrumental level has no straightforward way to prevent that from happening. 

Right, I basically agree with this picture. I might revise it a little:

  • Early, the AGI is too dumb to hack its epistemics (provided we don't give it easy ways to do so!).
  • In the middle, there's a danger zone.
  • When the AGI is pretty smart, it sees why one should be cautious about such things, and it also sees why any modifications should probably be in pursuit of truthfulness (because true beliefs are a convergent instrumental goal) as opposed to other reasons.
  • When the AGI is really smart, it might see better ways of organizing itself (eg, specific ways to hack epistemics which really are for the best even though they insert false beliefs), but we're OK with that, because it's really freaking smart and it knows to be cautious and it still thinks this.
So the goal is to allow what instrumental influence we can on the epistemic system, while making it hard and complicated to outright corrupt the epistemic system.

One important point here is that the epistemic system probably knows what the instrumental system is up to . If so, this gives us an important lever. For example, in theory, a logical inductor can't be reliably fooled by an instrumental reasoner who uses it (so long as the hardware, including the input channels, don't get corrupted), because it would know about the plans and compensate for them.

So if we could get a strong guarantee that the epistemic system knows what the instrumental system is up to (like "the instrumental system is transparent to the epistemic system"), this would be helpful.

Shapley Values [thanks Zack for reminding me of the name] are akin to credit assignment: you have a bunch of agents coordinating to achieve something, and then you want to assign payouts fairly based on how much each contribution mattered to the final outcome.

And the way you do this is, for each agent you look at how good the outcome would have been if everybody except that agent had coordinated, and then you credit each agent proportionally to how much the overall performance would have fallen off without them.

So what about doing the same here- send rewards to each contributor proportional to how much they improved the actual group decision (assessed by rerunning it without them and seeing how performance declines)?

I can't for the life of me remember what this is called

Shapley value

(Best wishes, Less Wrong Reference Desk)

Yeah, it's definitely related. The main thing I want to point out is that Shapley values similarly require a model in order to calculate. So you have to distinguish between the problem of calculating a detailed distribution of credit and being able to assign credit "at all" -- in artificial neural networks, backprop is how you assign detailed credit, but a loss function is how you get a notion of credit at all. Hence, the question "where do gradients come from?" -- a reward function is like a pile of money made from a joint venture; but to apply backprop or Shapley value, you also need a model of counterfactual payoffs under a variety of circumstances. This is a problem, if you don't have a seperate "epistemic" learning process to provide that model -- ie, it's a problem if you are trying to create one big learning algorithm that does everything.

Specifically, you don't automatically know how to

send rewards to each contributor proportional to how much they improved the actual group decision

because in the cases I'm interested in, ie online learning, you don't have the option of

rerunning it without them and seeing how performance declines

-- because you need a model in order to rerun.

But, also, I think there are further distinctions to make. I believe that if you tried to apply Shapley value to neural networks, it would go poorly; and presumably there should be a "philosophical" reason why this is the case (why Shapley value is solving a different problem than backprop). I don't know exactly what the relevant distinction is.

(Or maybe Shapley value works fine for NN learning; but, I'd be surprised.)

Removing things entirely seems extreme. How about having a continuous "contribution parameter," where running the algorithm without an element would correspond to turning this parameter down to zero, but you could also set the parameter to 0.5 if you wanted that element to have 50% of the influence it has right now. Then you can send rewards to elements if increasing their contribution parameter would improve the decision.

Removing things entirely seems extreme.

Dropout is a thing, though.

Dropout is like the converse of this - you use dropout to assess the non-outdropped elements. This promotes resiliency to perturbations in the model - whereas if you evaluate things by how bad it is to break them, you could promote fragile, interreliant collections of elements over resilient elements.

I think the root of the issue is that this Shapley value doesn't distinguish between something being bad to break, and something being good to have more of. If you removed all my blood I would die, but that doesn't mean that I would currently benefit from additional blood.

Anyhow, the joke was that as soon as you add a continuous parameter, you get gradient descent back again.

(Comment rewritten from scratch after comment editor glitched.)

This article is not about what I expected from the title. I've been thinking about "retroactively allocating responsibility", which sounds a lot like "assigning credit", in the context of multi-winner voting methods: which voters get credit for ensuring a given candidate won? The problem here is that in most cases no individual voter could change their vote to have any impact whatsoever on the outcome; in ML terms, this is a form of "vanishing gradient". The solution I've arrived at was suggested to me here ; essentially, it's imposing a (random) artificial order on the inputs, then using a model to reason about the "worst"-possible outcome after any subset of the inputs, and giving each of the inputs "credit" for the change in that "worst" outcome.

I'm not sure if this is relevant to the article as it stands, but it's certainly relevant to the title as it stands.

Yeah, this post is kind of a mix between (first) actually trying to explain/discuss the credit assignment problem, and (second) some far more hand-wavy thoughts relating to myopia / partial agency. I have a feeling that the post was curated based on the first part, whereas my primary goal in writing was actually the second part.

In terms of what I talked about (iirc -- not re-reading the post for the sake of this comment), it seems like you're approaching a problem of how to actually assign credit once you have a model. This is of course an essential part of the problem. I would ask questions like:

  • Does it encourage honesty?
  • Does it encourage strategically correcting things toward the best collective outcome (even if dishonestly)?

These are difficult problems. However, my post was more focused on the idea that you need a model at all in order to do credit assignment , which I feel is not obvious to many people.

(Note that we recently pushed a change that allows you to delete your own comments if they have no children, so if you'd like you can delete your earlier comment.)

Unfortunately, we can't just copy this trick. Artificial evolution requires that we decide how to kill off / reproduce things, in the same way that animal breeding requires breeders to decide what they're optimizing for. This puts us back at square one; IE, needing to get our gradient from somewhere else.
Suppose we have a good reward function (as is typically assumed in deep RL). We can just copy the trick in that setting, right? But the rest of the post makes it sound like you still think there's a problem, in that even with that reward, you don't know how to assign credit to each individual action. This is a problem that evolution also has; evolution seemed to manage it just fine.
(Similarly, even if you think actor-critic methods don't count, surely REINFORCE is one-level learning? It works okay; added bells and whistles like critics are improvements to its sample efficiency.)

Yeah, I pretty strongly think there's a problem -- not necessarily an insoluble problem, but, one which has not been convincingly solved by any algorithm which I've seen. I think presentations of ML often obscure the problem (because it's not that big a deal in practice -- you can often define good enough episode boundaries or whatnot).

  • Yeah, I feel like "matching rewards to actions is hard" is a pretty clear articulation of the problem.
  • I agree that it should be surprising, in some sense, that getting rewards isn't enough. That's why I wrote a post on it! But why do you think it should be enough? How do we "just copy the trick"??
  • I don't agree that this is analogous to the problem evolution has. If evolution just "received" the overall population each generation, and had to figure out which genomes were good/bad based on that, it would be a more analogous situation. However, that's not at all the case. Evolution "receives" a fairly rich vector of which genomes were better/worse, each generation. The analogous case for RL would be if you could output several actions each step, rather than just one, and receive feedback about each. But this is basically "access to counterfactuals"; to get this, you need a model.

No, definitely not, unless I'm missing something big.

From page 329 of this draft of Sutton & Barto :

Note that REINFORCE uses the complete return from time t, which includes all future rewards up until the end of the episode. In this sense REINFORCE is a Monte Carlo algorithm and is well defined only for the episodic case with all updates made in retrospect after the episode is completed (like the Monte Carlo algorithms in Chapter 5). This is shown explicitly in the boxed on the next page.

So, REINFORCE "solves" the assignment of rewards to actions via the blunt device of an episodic assumption; all rewards in an episode are grouped with all actions during that episode. If you expand the episode to infinity (so as to make no assumption about episode boundaries), then you just aren't learning. This means it's not applicable to the case of an intelligence wandering around and interacting dynamically with a world, where there's no particular bound on how the past may relate to present reward.

The "model" is thus extremely simple and hardwired, which makes it seem one-level. But you can't get away with this if you want to interact and learn on-line with a really complex environment.

Also, since the episodic assumption is a form of myopia, REINFORCE is compatible with the conjecture that any gradients we can actually construct are going to incentivize some form of myopia.

Oh, I see. You could also have a version of REINFORCE that doesn't make the episodic assumption, where every time you get a reward, you take a policy gradient step for each of the actions taken so far, with a weight that decays as actions go further back in time. You can't prove anything interesting about this, but you also can't prove anything interesting about actor-critic methods that don't have episode boundaries, I think. Nonetheless, I'd expect it would somewhat work, in the same way that an actor-critic method would somewhat work. (I'm not sure which I expect to work better; partly it depends on the environment and the details of how you implement the actor-critic method.)

(All of this said with very weak confidence; I don't know much RL theory)

You could also have a version of REINFORCE that doesn't make the episodic assumption, where every time you get a reward, you take a policy gradient step for each of the actions taken so far, with a weight that decays as actions go further back in time. You can't prove anything interesting about this, but you also can't prove anything interesting about actor-critic methods that don't have episode boundaries, I think.

Yeah, you can do this. I expect actor-critic to work better, because your suggestion is essentially a fixed model which says that actions are more relevant to temporally closer rewards (and that this is the only factor to consider).

I'm not sure how to further convey my sense that this is all very interesting. My model is that you're like "ok sure" but don't really see why I'm going on about this.

Yeah, I think this is basically right. For the most part though, I'm trying to talk about things where I disagree with some (perceived) empirical claim, as opposed to the overall "but why even think about these things" -- I am not surprised when it is hard to convey why things are interesting in an explicit way before the research is done.

Here, I was commenting on the perceived claim of "you need to have two-level algorithms in order to learn at all; a one-level algorithm is qualitatively different and can never succeed", where my response is "but no, REINFORCE would do okay, though it might be more sample-inefficient". But it seems like you aren't claiming that, just claiming that two-level algorithms do quantitatively but not qualitatively better.

Actually, that wasn't what I was trying to say. But, now that I think about it, I think you're right.

I was thinking of the discounting variant of REINFORCE as having a fixed, but rather bad, model associating rewards with actions: rewards are tied more with actions nearby. So I was thinking of it as still two-level, just worse than actor-critic.

But, although the credit assignment will make mistakes (a predictable punishment which the agent can do nothing to avoid will nonetheless make any actions leading up to the punishment less likely in the future), they should average out in the long run (those 'wrongfully punished' actions should also be 'wrongfully rewarded'). So it isn't really right to think it strongly depends on the assumption.

Instead, it's better to think of it as a true discounting function. IE, it's not as assumption about the structure of consequences; it's an expression of how much the system cares about distant rewards when taking an action. Under this interpretation, REINFORCE indeed "closes the gradient gap" -- solves the credit assignment problem w/o restrictive modeling assumptions.

Maybe. It might also me argued that REINFORCE depends on some properties of the environment such as ergodicity. I'm not that familiar with the details.

But anyway, it now seems like a plausible counterexample.

I think I have juuust enough background to follow the broad strokes of this post, but not to quite grok the parts I think Abram was most interested in. 

I definitely caused me to think about credit assignment. I actually ended up thinking about it largely through the lens of Moral Mazes (where challenges of credit assignment combine with other forces to create a really bad environment). Re-reading this post, while I don't quite follow everything, I do successfully get a taste of how credit assignment fits into a bunch of different domains.

For the "myopia/partial-agency" aspects of the post, I'm curious how Abram's thinking has changed. This post AFAICT was a sort of "rant". A year after the fact, did the ideas here feel like they panned out?

It does seem like someone should someday write a post about credit assignment that's a bit more general. 

Most of my points from my curation notice still hold. And two years later, I am still thinking a lot about credit assignment as a perspective on many problems I am thinking about. 

This seems like one I would significantly re-write for the book if it made it that far. I feel like it got nominated for the introductory material, which I wrote quickly in order to get to the "main point" (the gradient gap). A better version would have discussed credit assignment algorithms more.

Promoted to curated: It's been a while since this post has come out, but I've been thinking of the "credit assignment" abstraction a lot since then, and found it quite useful. I also really like the way the post made me curious about a lot of different aspects of the world, and I liked the way it invited me to boggle at the world together with you. 

I also really appreciated your long responses to questions in the comments, which clarified a lot of things for me. 

One thing comes to mind for maybe improving the post, though I think that's mostly a difference of competing audiences: 

I think some sections of the post end up referencing a lot of really high-level concepts, in a way that I think is valuable as a reference, but also in a way that might cause a lot of people to bounce off of it (even people with a pretty strong AI Alignment background). I can imagine a post that includes very short explanations of those concepts, or moves them into a context where they are more clearly marked as optional (since I think the post stands well without at least some of those high-level concepts)

From the perspective of full agency (ie, the negation of partial agency), a system which needs a protected epistemic layer sounds suspiciously like a system that can't tile. You look at the world, and you say: "how can I maximize utility?" You look at your beliefs, and you say: "how can I maximize accuracy?" That's not a consequentialist agent; that's two different consequentialist agents!

For reinforcement learning with incomplete/fuzzy hypotheses, this separation doesn't exist, because the update rule for fuzzy beliefs depends on the utility function and in some sense even on the actual policy.

How does that work?

Actually I was somewhat confused about what the right update rule for fuzzy beliefs is when I wrote that comment. But I think I got it figured out now.

First, background about fuzzy beliefs:

Let E be the space of environments (defined as the space of instrumental states in Definition 9 here ). A fuzzy belief is a concave function ϕ : E → [ 0 , 1 ] s.t. sup ϕ = 1 . We can think of it as the membership function of a fuzzy set. For an incomplete model Φ ⊆ E , the corresponding ϕ is the concave hull of the characteristic function of Φ (i.e. the minimal concave ϕ s.t. ϕ ≥ χ Φ ).

Let γ be the geometric discount parameter and U ( γ ) : = ( 1 − γ ) ∑ ∞ n = 0 γ n r n be the utility function. Given a policy π (EDIT: in general, we allow our policies to explicitly depend on γ ), the value of π at ϕ is defined by

V π ( ϕ , γ ) : = 1 + inf μ ∈ E ( E μ π [ U ( γ ) ] − ϕ ( μ ) )

The optimal policy and the optimal value for ϕ are defined by

π ∗ ϕ , γ : = arg max π V π ( ϕ , γ ) V ( ϕ , γ ) : = max π V π ( ϕ , γ )

Given a policy π , the regret of π at ϕ is defined by

R g π ( ϕ , γ ) : = V ( ϕ , γ ) − V π ( ϕ , γ )

π is said to learn ϕ when it is asymptotically optimal for ϕ when γ → 1 , that is

lim γ → 1 R g π ( ϕ , γ ) = 0

Given ζ a probability measure over the space fuzzy hypotheses, the Bayesian regret of π at ζ is defined by

B R g π ( ζ , γ ) : = E ϕ ∼ ζ [ R g π ( ϕ , γ ) ]

π is said to learn ζ when

lim γ → 1 B R g π ( ζ , γ ) = 0

If such a π exists, ζ is said to be learnable . Analogously to Bayesian RL, ζ is learnable if and only if it is learned by a specific policy π ∗ ζ (the Bayes-optimal policy). To define it, we define the fuzzy belief ϕ ζ by

ϕ ζ ( μ ) : = sup ( σ : s u p p ζ → E ) : E ϕ ∼ ζ [ σ ( ϕ ) ] = μ E ϕ ∼ ζ [ ϕ ( σ ( ϕ ) ) ]

We now define π ∗ ζ : = ϕ ∗ ϕ ζ .

Now, updating: (EDIT: the definition was needlessly complicated, simplified)

Consider a history h ∈ ( A × O ) ∗ or h ∈ ( A × O ) ∗ × A . Here A is the set of actions and O is the set of observations. Define μ ∗ ϕ by

μ ∗ ϕ : = arg max μ ∈ E min π ( ϕ ( μ ) − E μ π [ U ] )

Let E ′ be the space of "environments starting from h ". That is, if h ∈ ( A × O ) ∗ then E ′ = E and if h ∈ ( A × O ) ∗ × A then E ′ is slightly different because the history now begins with an observation instead of with an action.

For any μ ∈ E , ν ∈ E ′ we define [ ν ] h μ ∈ E by

[ ν ] h μ ( o ∣ h ′ ) : = { ν ( o ∣ h ′′ )  if  h ′ = h h ′′ μ ( o ∣ h ′ )  otherwise

Then, the updated fuzzy belief is

( ϕ ∣ h ) ( ν ) : = ϕ ( [ ν ] h μ ∗ ϕ ) + constant

One part of it is that I want to scrap classical (“static”) decision theory and move to a more learning-theoretic (“dynamic”) view.

Can you explain more what you mean by this, especially "learning-theoretic"? I've looked at learning theory a bit and the typical setup seems to involve a loss or reward that is immediately observable to the learner, whereas in decision theory, utility can be over parts of the universe that you can't see and therefore can't get feedback from, so it seems hard to apply typical learning theory results to decision theory. I wonder if I'm missing the whole point though... What do you think are the core insights or ideas of learning theory that might be applicable to decision theory?

(I don't speak for Abram but I wanted to explain my own opinion.) Decision theory asks, given certain beliefs an agent has, what is the rational action for em to take. But, what are these "beliefs"? Different frameworks have different answers for that. For example, in CDT a belief is a causal diagram. In EDT a belief is a joint distribution over actions and outcomes. In UDT a belief might be something like a Turing machine (inside the execution of which the agent is supposed to look for copies of emself). Learning theory allows us to gain insight through the observation that beliefs must be learnable , otherwise how would the agent come up with these beliefs in the first place? There might be parts of the beliefs that come from the prior and cannot be learned, but still, at least the type signature of beliefs should be compatible with learning.

Moreover, decision problems are often implicitly described from the point of view of a third party. For example, in Newcomb's paradox we postulate that Omega can predict the agent, which makes perfect sense for an observer looking from the side, but might be difficult to formulate from the point of view of the agent itself. Therefore, understanding decision theory requires the translation of beliefs from the point of view of one observer to the point of view of another. Here also learning theory can help us: we can ask, what are the beliefs Alice should expect Bob to learn given particular beliefs of Alice about the world? From a slightly different angle, the central source of difficulty in decision theory is the notion of counterfactuals, and the attempt to prescribe particular meaning to them, which different decision theories do differently. Instead, we can just postulate that, from the subjective point of view of the agent, counterfactuals are ontologically basic. The agent believes emself to have free will, so to speak. Then, the interesting quesiton is, what kind of counterfactuals are produced by the translation of beliefs from the perspective of a third party to the perspective of the given agent.

Indeed, thinking about learning theory led me to the notion of quasi-Bayesian agents (agents that use incomplete/fuzzy models ), and quasi-Bayesian agents automatically solve all Newcomb-like decision problems. In other words, quasi-Bayesian agents are effectively a rigorous version of UDT.

Incidentally, to align AI we literally need to translate beliefs from the user's point of view to the AI's point of view. This is also solved via the same quasi-Bayesian approach. In particular, this translation process preserves the "point of updatelessness", which, in my opinion, is the desired result (the choice of this point is subjective).

My thinking is somewhat similar to Vanessa's . I think a full explanation would require a long post in itself. It's related to my recent thinking about UDT and commitment races . But, here's one way of arguing for the approach in the abstract.

You once asked :

Assuming that we do want to be pre-rational, how do we move from our current non-pre-rational state to a pre-rational one? This is somewhat similar to the question of how do we move from our current non-rational (according to ordinary rationality) state to a rational one. Expected utility theory says that we should act as if we are maximizing expected utility, but it doesn't say what we should do if we find ourselves lacking a prior and a utility function (i.e., if our actual preferences cannot be represented as maximizing expected utility).
The fact that we don't have good answers for these questions perhaps shouldn't be considered fatal to pre-rationality and rationality, but it's troubling that little attention has been paid to them, relative to defining pre-rationality and rationality. (Why are rationality researchers more interested in knowing what rationality is, and less interested in knowing how to be rational? Also, BTW, why are there so few rationality researchers? Why aren't there hordes of people interested in these issues?)

My contention is that rationality should be about the update process . It should be about how you adjust your position. We can have abstract rationality notions as a sort of guiding star, but we also need to know how to steer based on those.

Some examples:

  • Logical induction can be thought of as the result of performing this transform on Bayesianism; it describes belief states which are not coherent, and gives a rationality principle about how to approach coherence -- rather than just insisting that one must somehow approach coherence.
  • Evolutionary game theory is more dynamic than the Nash story. It concerns itself more directly with the question of how we get to equilibrium. Strategies which work better get copied. We can think about the equilibria, as we do in the Nash picture; but, the evolutionary story also lets us think about non-equilibrium situations. We can think about attractors (equilibria being point-attractors, vs orbits and strange attractors), and attractor basins; the probability of ending up in one basin or another; and other such things.
  • However, although the model seems good for studying the behavior of evolved creatures, there does seem to be something missing for artificial agents learning to play games; we don't necessarily want to think of there as being a population which is selected on in that way.
  • The complete class theorem describes utility-theoretic rationality as the end point of taking Pareto improvements. But, we could instead think about rationality as the process of taking Pareto improvements. This lets us think about (semi-)rational agents whose behavior isn't described by maximizing a fixed expected utility function, but who develop one over time. (This model in itself isn't so interesting, but we can think about generalizing it; for example, by considering the difficulty of the bargaining process -- subagents shouldn't just accept any Pareto improvement offered.)
  • Again, this model has drawbacks. I'm definitely not saying that by doing this you arrive at the ultimate learning-theoretic decision theory I'd want.
You look at the world, and you say: "how can I maximize utility?" You look at your beliefs, and you say: "how can I maximize accuracy?" That's not a consequentialist agent; that's two different consequentialist agents!
Not... really? "how can I maximize accuracy?" is a very liberal agentification of a process that might be more drily thought of as asking "what is accurate?" Your standard sequence predictor isn't searching through epistemic pseudo-actions to find which ones best maximize its expected accuracy, it's just following a pre-made plan of epistemic action that happens to increase accuracy.
Though this does lead to the thought: if you want to put things on equal footing, does this mean you want to describe a reasoner that searches through epistemic steps/rules like an agent searching through actions/plans?

This is more or less how humans already conceive of difficult abstract reasoning. We don't solve integrals by gradient descent, we imagine doing some sort of tree search where the edges are different abstract manipulations of the integral. But for everyday reasoning, like navigating 3D space, we just use our specialized feed-forward hardware.

Yeah, I absolutely agree with this. My description that you quoted was over-dramaticizing the issue.

Really, what you have is an agent sitting on top of non-agentic infrastructure. The non-agentic infrastructure is "optimizing" in a broad sense because it follows a gradient toward predictive accuracy, but it is utterly myopic (doesn't plan ahead to cleverly maximize accuracy).

The point I was making, stated more accurately, is that you (seemingly) need this myopic optimization as a 'protected' sub-part of the agent, which the overall agent cannot freely manipulate (since if it could, it would just corrupt the policy-learning process by wireheading).

This is more or less how humans already conceive of difficult abstract reasoning.

Yeah, my observation is that it intuitively seems like highly capable agents need to be able to do that; to that end, it seems like one needs to be able to describe a framework where agents at least have that option without it leading to corruption of the overall learning process via the instrumental part strategically biasing the epistemic part to make the instrumental part look good.

(Possibly humans just use a messy solution where the strategic biasing occurs but the damage is lessened by limiting the extent to which the instrumental system can bias the epistemics -- eg, you can't fully choose what to believe.)

I wrote a post inspired by / sorta responding to this one—see Predictive coding = RL + SL + Bayes + MPC

I found this a very interesting frame on things, and am glad I read it.

However, the critic is learning to predict; it's just that all we need to predict is expected value.

See also muZero . Also note that predicting optimal value formally ties in with predicting a model. In MDPs, you can reconstruct all of the dynamics using just | S | optimal value functions.

I re-read this post thinking about how and whether this applies to brains...

The online learning conceptual problem (as I understand your description of it) says, for example, I can never know whether it was a good idea to have read this book, because maybe it will come in handy 40 years later. Well, this seems to be "solved" in humans by exponential / hyperbolic discounting. It's not exactly episodic, but we'll more-or-less be able to retrospectively evaluate whether a cognitive process worked as desired long before death.

Relatedly, we seem to generally make and execute plans that are (hierarchically) laid out in time and with a success criterion at its end, like "I'm going to walk to the store". So we get specific and timely feedback on whether that plan was successful.

We do in fact have a model class. It seems very rich; in terms of "grain of truth", well I'm inclined to think that nothing worth knowing is fundamentally beyond human comprehension, except for contingent reasons like memory and lifespan limitations (i.e. not because they are not incompatible with the internal data structures). Maybe that's good enough?

Just some thoughts; sorry if this is irrelevant or I'm misunderstanding anything. :-)

The online learning conceptual problem (as I understand your description of it) says, for example, I can never know whether it was a good idea to have read this book, because maybe it will come in handy 40 years later. Well, this seems to be "solved" in humans by exponential / hyperbolic discounting. It's not exactly episodic, but we'll more-or-less be able to retrospectively evaluate whether a cognitive process worked as desired long before death.

I interpret you as suggesting something like what Rohin is suggesting , with a hyperbolic function giving the weights.

It seems (to me) the literature establishes that our behavior can be approximately described by the hyperbolic discounting rule (in certain circumstances anyway), but, comes nowhere near establishing that the mechanism by which we learn looks like this, and in fact has some evidence against. But that's a big topic. For a quick argument, I observe that humans are highly capable, and I generally expect actor/critic to be more capable than dumbly associating rewards with actions via the hyperbolic function. That doesn't mean humans use actor/critic; the point is that there are a lot of more-sophisticated setups to explore.

We do in fact have a model class.

It's possible that our models are entirely subservient to instrumental stuff (ie, we "learn to think" rather than "thinking to learn", which would mean we don't have the big split which I'm pointing to -- ie, that we solve the credit assignment problem "directly" somehow, rather than needing to learn to do so.

It seems very rich; in terms of "grain of truth", well I'm inclined to think that nothing worth knowing is fundamentally beyond human comprehension, except for contingent reasons like memory and lifespan limitations (i.e. not because they are not incompatible with the internal data structures). Maybe that's good enough?

I think I agree with everything you wrote. I thought about it more, let me try again:

Maybe we're getting off on the wrong foot by thinking about deep RL. Maybe a better conceptual starting point for human brains is more like The Scientific Method.

We have a swarm of hypotheses (a.k.a. generative models), each of which is a model of the latent structure (including causal structure) of the situation.

How does learning-from-experience work? Hypotheses gain prominence by making correct predictions of both upcoming rewards and upcoming sensory inputs. Also, when there are two competing prominent hypotheses, then specific areas where they make contradictory predictions rise to salience, allowing us to sort out which one is right.

How do priors work? Hypotheses gain prominence by being compatible with other highly-weighted hypotheses that we already have.

How do control-theory-setpoints work? The hypotheses often entail "predictions" about our own actions, and hypotheses gain prominence by predicting that good things will happen to us, we'll get lots of reward, we'll get to where we want to go while expending minimal energy, etc.

Thus, we wind up adopting plans that balance (1) plausibility based on direct experience, (2) plausibility based on prior beliefs, and (3) desirability based on anticipated reward.

Credit assignment is a natural part of the framework because one aspect of the hypotheses is a hypothesized mechanism about what in the world causes reward.

It also seems plausibly compatible with brains and Hebbian learning.

I'm not sure if this answers any of your questions ... Just brainstorming :-)

Claim: predictive learning gets gradients "for free" ... Claim: if you're learning to act, you do not similarly get gradients "for free". You take an action, and you see results of that one action. This means you fundamentally don't know what would have happened had you taken alternate actions, which means you don't have a direction to move your policy in. You don't know whether alternatives would have been better or worse. So, rewards you observe seem like not enough to determine how you should learn.

This immediately jumped out at me as an implausible distinction because I was just reading Surfing Uncertainty which goes on endlessly about how the machinery of hierarchical predictive coding is exactly the same as the machinery of hierarchical motor control (with "priors" in the former corresponding to "priors + control-theory-setpoints" in the latter, and with "predictions about upcoming proprioceptive inputs" being identical to the muscle control outputs). Example excerpt:

the heavy lifting that is usually done by the use of efference copy, inverse models, and optimal controllers [in the models proposed by non-predictive-coding people] is now shifted [in the predictive coding paradigm] to the acquisition and use of the predictive (generative) model (i.e., the right set of prior probabilistic ‘beliefs’). This is potentially advantageous if (but only if) we can reasonably assume that these beliefs ‘emerge naturally as top-down or empirical priors during hierarchical perceptual inference’ (Friston, 2011a, p. 492). The computational burden thus shifts to the acquisition of the right set of priors (here, priors over trajectories and state transitions), that is, it shifts the burden to acquiring and tuning the generative model itself. --Surfing Uncertainty chapter 4

I'm a bit hazy on the learning mechanism for this (confusingly-named) "predictive model" (I haven't gotten around to chasing down the references) and how that relates to what you wrote... But it does sorta sound like it entails one update process rather than two...

Yep, I 100% agree that this is relevant. The PP/Friston/free-energy/active-inference camp is definitely at least trying to "cross the gradient gap" with a unified theory as opposed to a two-system solution. However, I'm not sure how to think about it yet.

  • I may be completely wrong, but I have a sense that there's a distinction between learning and inference which plays a similar role; IE, planning is just inference, but both planning and inference work only because the learning part serves as the second "protected layer"??
  • It may be that the PP is "more or less" the Bayesian solution; IE, it requires a grain of truth to get good results, so it doesn't really help with the things I'm most interested in getting out of "crossing the gap".
  • Note that PP clearly tries to implement things by pushing everything into epistemics. On the other hand, I'm mostly discussing what happens when you try to smoosh everything into the instrumental system. So many of my remarks are not directly relevant to PP.
  • I get the sense that Friston might be using the "evolution solution" I mentioned; so, unifying things in a way which kind of lets us talk about evolved agents, but not artificial ones. However, this is obviously an oversimplification, because he does present designs for artificial agents based on the ideas.

Overall, my current sense is that PP obscures the issue I'm interested in more than solves it, but it's not clear.

RUDDER - Reinforcement Learning with Delayed Rewards

Blog post to  RUDDER: Return Decomposition for Delayed Rewards .

Recently, tasks with delayed rewards that required model-free reinforcement learning attracted a lot of attention via complex strategy games. For example, DeepMind currently focuses on the delayed reward games Capture the flag and Starcraft , whereas Microsoft is putting up the Marlo  environment, and Open AI announced its  Dota 2  achievements. Mastering these games with delayed rewards using model-free reinforcement learning poses a great challenge and an almost insurmountable obstacle, see the excellent Understanding OpenAI Five blog. Delayed rewards are very common as they typically appear in reinforcment learning (RL) tasks with episodic or sparse rewards. They impose a fundamental problem onto learning whose solution is a  long standing challenge in RL (abstract, p. vii) . For complex real-world tasks, e.g. autonomous driving or controling smart cities, appropriate models are not available and difficult to learn and, thus, only model-free reinforcement learning is feasible.

Via RUDDER, we introduce a novel model-free RL approach to overcome  delayed reward problems . RUDDER directly and efficiently assigns credit to reward-causing state-action pairs  and thereby speeds up learning in model-free reinforcement learning with delayed rewards dramatically.

The main idea of RUDDER

This is a 5 min read of the main idea of RUDDER:

We propose a paradigm shift for delayed rewards and model-free reinforcement learning. A reason is that conventional methods have problems, therefore:

  • Do not use temporal difference (TD) since it suffers from vanishing information and leads to high bias, even with eligibility traces.
  • Do not use Monte Carlo (MC) since averaging over all possible futures leads to high variance.
  • Do not use conventional value functions based on state-action pairs since the expected return has to be predicted from every state-action pair. A single prediction error might hamper learning.

Supervised Learning : Assume you have high reward and average reward episodes. Supervised learning identifies state-action pairs that are indicative for high rewards. We therefore want to adjust the policy such that these state-action pairs are reached more often. Reward redistribution to these state-action pairs achieves these adjustments.

Value function is a step function : The main idea of RUDDER is to exploit the fact that the value function is a step function. Complex tasks are hierarchical with sub-tasks or sub-goals. Completing them is reflected by a step in the value function. In this sense, steps indicate events like achievements, failures, or change of environment/information. In general, a step in the value function is a change in return expectation (higher/lower amount of return or higher/lower probability to receive return). The reward is redistributed to these steps. In particular, identifying the large steps is important since they speed up learning tremendously:

  • Large increase of return (after adjusting the policy).
  • Sampling more relevant episodes.

credit assignment problem reward

Return decomposition : The reward redistribution is obtained via return decomposition of the recurrent neural network. Return decomposition identifies the steps of the value function from the recurrent neural network (green arrows above) which determine the redistributed rewards. The steps are identified via contribution analysis, which calculates the contribution of inputs to the prediction of the return.

RUDDER explained on a more detailed example

This is a 20 min read  of the basic concepts of RUDDER.

The following fundamental contributions of RUDDER will be explained on the pocket watch example task:

  • Return-equivalent decision processes  with the same optimal policies.
  • Reward redistribution that leads to return-equivalent decision processes.
  • Optimal reward redistribution  that leads to zero expected future rewards. 
  • Return decomposition via contribution analysis.

Based on the pocket watch example task, this blog will walk you through the topics of delayed rewards, expected future rewards equal to zero, return equivalence, reward redistribution and return decomposition via contribution analysis. We show how RUDDER effectively solves the credit assignment problem for delayed rewards.

An example task for delayed rewards

credit assignment problem reward

The problem formulation

You have to decide for a particular brand of watch if repairing pays off. The advantage function is the sales price minus the expected immediate repair costs minus the expected future delivery costs. Repairing pays off if the advantage function is positive.

Classical RL methods

Most classical RL methods , like temporal difference (TD) learning, Monte Carlo learning and Monte Carlo Tree Search, try to solve RL problems by estimating the return, i.e. the accumulated future reward. Therefore, they have to average over a large number of possible futures. Especially for delayed reward problems this has severe drawbacks.

In the example task, many events lie between “deciding whether to repair” and actually selling the watch. Only the return can tell you if repairing pays off. To evaluate the decision, temporal difference (TD) learning minimizes the TD error, which comes down to propagating the rewards back. The backpropagated reward corrects the bias of the TD return estimation at previous states. Unfortunately, TD learning needs a vast number of updates to propagate the reward back, which increases exponentially with the length of the path to previous states, thus bias correction is exponetially slow. Consequently, for such problems TD approaches are not feasible. In Sutton and Barto’s book , page 297, formula 12.11, it is shown that the TD(λ)-return vanishes exponentially .

credit assignment problem reward

The TD update rule is given by:

where V(S) is the value function, \(S_{t-1}\) and \(S_{t}\) is the old and current state, respectively, \(R_{t}\) is the reward at time \(t, \gamma\) is the discount factor and \(\alpha\) a positive learning rate.

In our example task, we consider a new information due to the reward \(\boldsymbol{R_T}\). I​n the following, we show how this information is propagated back to give new values. The information due to \(\boldsymbol{R_T}\) is indicated by bold symbols. At state \(S_{T-1}\), the update gives following new value \(\boldsymbol{V}_{\text{new}}\boldsymbol{(S_{T-1})}\):

Iteratively updating the values with the new information due to \(\boldsymbol{R_T}\) gives:

The new information has decayed with a factor \(\alpha^{T-1}\gamma^{T-1}\), which is an exponential decay with respect to the time interal \(T-1\) between \(\boldsymbol{R_T}\) and \(S_1\).

The decision could also be evaluated by Monte Carlo learning. Monte Carlo averages in a probabilistic environment over the returns of all possible futures, which number can be be very large. If the returns have high variance, then Monte Carlo learning is very slow. Thus Monte Carlo learning has problems with high variance , in contrast to TD.

credit assignment problem reward

We conclude that for long-delayed reward tasks, both TD and TD(λ) learning are exponentially slowed down with the length of the delay. On the other hand, Monte Carlo learning has problems with high sample variance of the return, as it is typical for probabilistic environments.

Zero future expected rewards

credit assignment problem reward

Why is it advantageous having the expected future rewards equal to zero? The expected profit can be estimated easily by averaging over the immediate repairing cost since all future rewards are equal to zero. There estimation neither suffers from exponetially decaying reward propagation like TD, nor from high variance like MC learning. Since, there is no need to average over all possible trajectories (like in MC) or to propagate the delayed reward back (like in TD). Instead the expected profit is given immediately, the decision is made and no further reward is expected.

In the example task, for the decision to repair the watch or not and for zeroing the expected future rewards only the brand-related delivery costs, e.g. packing costs, have to be estimated. The descision to repair or not can only influence those costs (negative rewards). Unfortunately, the average delivery costs are a superposition of brand-related costs and brand-independent costs, e.g. time spent for delivery. So, how can we disentangle brand-related from brand-independent costs?

Return decomposition and pattern recognition

​In order to estimate brand-related costs, return decomposition is used, which relies on pattern recognition . In the example task, the average delivery costs are a superposition of brand-related costs and brand-independent costs (general delivery costs). One can assume that the brand-independent costs are indicated by patterns. If it is snowing delivery is usually delayed making the costs go up. If delivery runs into a traffic jam it is usually delayed. If your delivery truck gets a flat tire then delivery costs are higher. There are many more examples. The idea is to use a training set of completed deliveries, where supervised learning can identify patterns for these events and attribute costs to them. In this way, RUDDER manages to push the expected future rewards towards zero by identifying key events and redistributing rewards to them. Removing brand-independent delivery costs associated with key events from the return considerably reduces the variance of the remaining return and, hence estimating brand-related costs becomes much more efficient.

In RUDDER, zeroing the expected future reward and redistributing the reward is mathematically endorsed by introducing the concepts of return-equivalent decision processes, reward redistribution, return decomposition and optimal reward redistribution .

RUDDER - more technical:

A finite Markov decision process (MDP) \(\mathcal{P}\) is a 6-tuple \(\mathcal{P}=(\mathcal{S},\mathcal{A},\mathcal{R},p,\gamma)\):

  • finite sets \(\mathcal{S}\)  of states \(s\) (random variable \(S_{t}\) at time \(t\))
  • finite sets \(\mathcal{A}\) of actions \(a\) (random variable \(A_{t}\) at time \(t\))
  • finite sets \(\mathcal{R}\) of rewards \(r\) (random variable \(R_{t}\) at time \(t\))
  • transition-reward distribution \(p=(S_{t+1}=s',R_{t+1}=r \vert S_t=s,A_t=a)\)
  • discount factor \(\gamma\)

The expected reward is the sum over all transition-reward distributions: 

The return is for a finite horizon MDP with sequence length \(T\) and \(\gamma=1\) is given by

The action-value function \(q^{\pi}(s,a)\) for policy \(\pi = p(A_{t+1}=a'~ \vert ~S_{t+1}=s')\) is

Goal of learning is to maximize the expected return at time t=0, that is

Return equivalent decision processes

​ A sequence-Markov decision process (SDP) is defined as a decision process which is equipped with a Markov policy and has Markov transition probabilities but a reward that is not required to be Markov. Two SDPs \(\tilde{\mathcal{P}}\) and \(\mathcal{P}\) are  return-equivalent if

  • they differ only in their transition-reward distributions
  • they have the same expected return at \(t=0\) for each policy \(\pi: \tilde{v}_0^{\pi} = v_0^{\pi}\)

We show that return-equivalent SDPs have the same optimal policies . In RUDDER, the idea is to transform an MDP with delayed rewards into a different but return-equivalent SDP where rewards are less delayed, therefore easier to learn.

Concretely, in our example task, the original MDP gives the reward at the very end when the repaired watch is sold. RUDDER redistributes the reward to key events and creates a new SDP which is ensured to be return-equivalent to the original MDP. Consequently and contrary to the original MDP, the decision of repairing the watch will be immediately rewarded. 

Return decomposition

The return decomposition determines the contribution of each state-action pair to the return - the return is decomposed. The contribution of each state-action pair is the new reward and ensures a reward redistribution which leads to a new, return-equivalent MDP. The contribution is found by learning a function that predicts the return of each state-action sequence. Subsequently, this function helps to distribute the return over the sequence. Conceptually, this is done via contribution analysis where the contribution of each sequence element to the prediction of the return is determined. We look at how much the current input contributes to the final prediction and consider three contribution analysis methods, although in practice any contribution analysis method could be used:

  • “Differences in prediction’’
  • Integrated gradients
  • Layer-wise relevance propagation

We prefer differences in prediction, where the change in prediction from one timestep to the next is a measure of the contribution of an input to the final prediction.

Optimal return decomposition

credit assignment problem reward

Imagine you have a reward ledger, which tracks your expected accumulated reward but you are not aware of it. At each time step the reward ledger is updated according to the current expectation. Assume that the ledger is non-empty, then nothing is stopping you from receiving the expected reward immediately, thereby emptying the ledger. This reward accounting is exploited by RUDDER, where a change of your ledger leads to an immediate pay out, i.e. reward, which sets the ledger to zero again. The defining property of an optimal return decomposition is that the ledger is always zero and emptied immediately if it deviates from zero, thus no future reward is expected .

We define optimal return decomposition via the optimality condition that the expected future accumulated reward is always zero. As a consequence, the new SDP does not have delayed rewards and Q-value estimates are unbiased .

Reward redistribution is achieved by using a function \(g\) which predicts the return \(G\) at the end of the episode from the state action sequence and then using contribution analysis on \(g\). Contribution analysis computes the contribution of current input to the final output, that is the information gain by the current input on the final prediction. We use the individual contributions as the new reward in the new MDP. We show that this redistribution, as it is return-equivalent does not change the optimal policies.

​In the example task, an optimal return decomposition means that the whole reward is given after your decision in the beginning whether to repair a watch of a particular brand. This optimal return decomposition ensures that the expected future reward is zero. If some unforeseen events happen during delivery, the reward ledger is updated such that the expected future return stays zero.

credit assignment problem reward

RUDDER vs potential-based reward shaping

​ Reward shaping in the RL community is known via potential-based  reward shaping , look-ahead and look-back advice . These methods use as new reward the original reward augmented by a shaping reward (\(\text{reward}^{\text{new}} = \text{reward}^{\text{old}} + \text{reward}^{\text{original}}\)). The shaping reward is obtain via potential function differences. In contrast, RUDDER constructs a completely new reward function, which substitutes the origin reward (\(\text{reward}^{\text{new}} = \text{reward}^{\text{redistributed}}\)). 

​Redistributing the reward via reward shaping, look-ahead advice, and look-back advice is a special case of reward redistribution. We have shown that subtracting a constant from the potential function of the reward shaping methods makes the shaping rewards sum to zero (\(\sum_t \text{reward}_t^{\text{shaping}} = 0\)), while all potential differences (shaping rewards) are kept. Hence, the reward redistribution leads to new SDPs that are return-equivalent to the original MDP. However, since these reward shaping approaches keep the original reward, their reward redistribution does not correspond to an optimal return decomposition (no delayed rewards) and therefore, learning by TD can still be exponentially slow for the remaining delayed rewards. RUDDER is not potential-based reward shaping .

credit assignment problem reward

The figure taken from the paper shows the comparison of RUDDER with reward shaping approaches. In detail, we compare RUDDER to Q-learning with eligibility traces, SARSA with eligibility traces, reward shaping (RS original), look-ahead advice (RS look-ahead), look-back advice (RS look-back). For each method, the learning time (in episodes) needed for solving the task versus the delay of the reward is plotted. The dashed blue line represents RUDDER applied on top of Q-learning with eligibility traces (RUDDER Q). In all cases, RUDDER is exponentially faster and outperforms reward shaping methods significantly.

LSTM in RUDDER

​We have already seen that RUDDER detects key events and assigns credit to them when associated rewards are delayed. This way, relevant actions and states can be identified even if they occured much earlier than the reward. The identification is done by Deep Learning which excels at pattern recognition problems . Over the past years, the Long Short-Term Memory ( LSTM ) has gained huge momentum in sequence analysis. For a nice introduction into Recurrent Neural Networks in general and the LSTM in particular have a closer look at Andrew Karpathy’s blog . In RUDDER, an LSTM is the obvious choice for the return prediction function, which allows for the return decomposition and is based on a state-action sequence. By analyzing the whole episode, LSTM detects the key events that correlate with the reward the most.

credit assignment problem reward

Reward redistribution via LSTM

When the LSTM detects the key events, a piece of information gets stored in LSTM’s memory. Then it is used for the prediction at the end. By analyzing LSTM’s memory, we can reconstruct these pieces of information and assign contributions of individual states-action pairs to the final reward prediction.

RUDDER in practice

credit assignment problem reward

The LSTM uses the state-action sequence as input to predict the return. The learned LSTM model allows to perform return decomposition via a contribution analysis method. This return decomposition can be applied to any state-action sequence with an assigned return to redistribute the reward. This redistribution is even valid for state-action sequences that were obtained from a different policy than the policy on which LSTM was trained and on which the return decomposition was determined.

For reward redistribution we assume an MDP with one reward (=return) at sequence end which can be predicted from the last state-action pair. We want to remove this Markov property to enforce the LSTM to store the relevant past events that caused the reward. This way, contribution analysis will identify these relevant past events and give reward to them. To eliminate this Markov property, we define a difference \(\Delta\) between a state-action pair and its predecessor, where the \(\Delta\)s are assumed to be mutually independent from each other. Now, the reward at sequence end cannot be predicted from the last \(\Delta\). The return can still be predicted from the sequence of \(\Delta\)s. Since the \(\Delta\)s are mutually independent, the contribution of each \(\Delta\) to the return must be stored in the hidden states of the LSTM to predict the final reward. The \(\Delta\) can be generic since states and actions can be numbered and then the difference of this numbers can be used for \(\Delta\). In the applications like Atari with immediate rewards we give the accumulated reward at the end of the episode without enriching the states. This has a similar effect as using \(\Delta\). We force the LSTM to build up an internal state which tracks the already accumulated reward. Example: The agent has to take a key to open the door. Since it is an MDP, the agent is always aware to have the key indicated by a bit to be on. The reward can be predicted in the last step. Using differences \(\Delta\) the bit is zero, except for the step where the agent takes the key. Thus, the LSTM has to store this event and transfers reward to it.

The LSTM is the value function at time \(t\), but based on the \(\Delta\) sub-sequence up to \(t\). The LSTM prediction can be decomposed into two sub-predictions. The first sub-prediction is the contribution of the already known \(\Delta\) sub-sequence up to \(t\) to the return (backward view). The second sub-prediction is the expected contribution of the unknown future sequence from \(t+1\) onwards to the return (forward view). However, we are not interested in the second sub-prediction, but only in the contribution of \(Δ\) at time \(t\) to the prediction of the expected return. The second sub-prediction is irrelevant for our approach. We cancel the second sub-prediction via the differences of predictions. The difference at time \(t\) gives the contribution of \(\Delta\) at time \(t\) to the expected return.

We started this research project with using LSTM as a value function, but we failed. We investigated LSTM as a value function for our example task (artificial task (II) in the paper). At the time point when RUDDER has already solved the task, the LSTM error was still too large to allow learning via a value function. The problem is the large variance of the returns at the beginning of the sequence which hinders LSTM learning (forward view). The LSTM learning was initiated by propagating back prediction errors at the sequence end, where the variance of the return is lower (backward view). These late predictions initiated the storing of key events at the sequence beginning even with high prediction errors. The redistributed reward at the key events led RUDDER solve the task. Concluding: at the time point when RUDDER has already solved the task, the early predictions are not learned due to the high variance of the returns. Therefore using the predictions as value function does not help (forward view).

RUDDER demonstration code

You can find the code for the example task described in this blog here .

You can find a practical tutorial on how to perform reward redistribution here .

You can find our github repo here .

Video of RUDDER in the ATARI game Bowling

The video below shows RUDDER at work in the delayed ATARI game Bowling . Here the agent has to roll a ball towards 10 pins with the objective of knocking them down. In Bowling the agent has two ways of influencing the outcome. First, by the start-position the agent takes when he rolls the ball and secondly, while the ball is rolling by curving the ball towards the pins. The agent is rewarded only after the roll finished. As shown in the video, RUDDER manages

  • to distribute the reward to actions which successfully allowed to curve the ball towards the pins,
  • to assign negative rewards for not knocking all of the pins down,
  • rolling a strike (as indicated by the blinking scores).

Correspondence

This blog post was written by Johannes Brandstetter: brandstetter[at]ml.jku.at

Contributions by Jose Arjona-Medina, Michael Gillhofer, Michael Widrich, Vihang Patil and Sepp Hochreiter.

Credit assignment in heterogeneous multi-agent reinforcement learning for fully cooperative tasks

  • Published: 26 October 2023
  • Volume 53 , pages 29205–29222, ( 2023 )

Cite this article

credit assignment problem reward

  • Kun Jiang 1 , 2 ,
  • Wenzhang Liu 3 ,
  • Yuanda Wang 1 ,
  • Lu Dong 4 &
  • Changyin Sun   ORCID: orcid.org/0000-0001-9269-334X 1 , 2  

793 Accesses

2 Citations

Explore all metrics

Credit assignment poses a significant challenge in heterogeneous multi-agent reinforcement learning (MARL) when tackling fully cooperative tasks. Existing MARL methods assess the contribution of each agent through value decomposition or agent-wise critic networks. However, value decomposition techniques are not directly applicable to control problems with continuous action spaces. Additionally, agent-wise critic networks struggle to differentiate the distinct contributions from the shared team reward. Moreover, most of these methods assume agent homogeneity, which limits their utility in more diverse scenarios. To address these limitations, we present a novel algorithm that factorizes and reshapes the team reward into agent-wise rewards, enabling the evaluation of the diverse contributions of heterogeneous agents. Specifically, we devise agent-wise local critics that leverage both the team reward and the factorized reward, alongside a global critic for assessing the joint policy. By accounting for the contribution differences resulting from agent heterogeneity, we introduce a power balance constraint that ensures a fairer measurement of each heterogeneous agent’s contribution, ultimately promoting energy efficiency. Finally, we optimize the policies of all agents using deterministic policy gradients. The effectiveness of our proposed algorithm has been validated through simulation experiments conducted in fully cooperative and heterogeneous multi-agent tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

credit assignment problem reward

Similar content being viewed by others

credit assignment problem reward

VAAC: V-value Attention Actor-Critic for Cooperative Multi-agent Reinforcement Learning

credit assignment problem reward

A Game-Theoretic Approach to Multi-agent Trust Region Optimization

credit assignment problem reward

Multi-Agent Reinforcement Learning

Explore related subjects.

  • Artificial Intelligence

Data Availability

The data that support the findings of this study are available on request from the first author.

Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, Choi DH, Powell R, Ewalds T, Georgiev P et al (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 350-354

Chen Y, Zheng Z, Gong X (2022) Marnet: Backdoor attacks against cooperative multi-agent reinforcement learning. IEEE Trans Dependable Sec Comput, 1-11

Liu X, Wang G, Chen K (2022) Option-based multi-agent reinforcement learning for painting with multiple large-sized robots. IEEE Trans Intell Transp Syst, 15707-15715

Chen YJ, Chang DK, Zhang C (2020) Autonomous tracking using a swarm of uavs: A constrained multi-agent reinforcement learning approach. IEEE Trans Veh Technol. 13702-13717

Zhou W, Chen D, Yan J, Li Z, Yin H, Ge W (2022) Multi-agent reinforcement learning for cooperative lane changing of connected and autonomous vehicles in mixed traffic. Auton Intell Syst

Dinneweth J, Boubezoul A, Mandiau R, Espié S (2022) Multi-agent reinforcement learning for autonomous vehicles: a survey. Auton Intell Syst, 27

Sun C, Liu W, Dong, L (2021) Reinforcement learning with task decomposition for cooperative multiagent systems. IEEE Trans Neural Netw Learn Syst, 2054-2065

Liu X, Tan Y (2022) Feudal latent space exploration for coordinated multi-agent reinforcement learning. IEEE Trans Neural Netw Learn Syst, 1-9

Yarahmadi H, Shiri ME, Navidi H, Sharifi A, Challenger M (2023) Bankruptcyevolutionary games based solution for the multi-agent credit assignment problem. Swarm Evol Comput, 101229

Ding S, Du W, Ding L, Guo L, Zhang J, An B (2023) Multi-agent dueling qlearning with mean field and value decomposition. Pattern Recognition, 109436

Du W, Ding S, Guo L, Zhang J, Zhang C, Ding L (2022) Value function factorization with dynamic weighting for deep multi-agent reinforcement learning. Information Sciences, 191-208

Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S (2018) QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. Proceedings of the 35th Int Conf Mac Learn, 4295-4304

Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I (2017) Multiagent actor-critic for mixed cooperative-competitive environments. Adv Neural Info Process Syst, 6379-6390

Lyu X, Xiao Y, Daley B, Amato C (2021) Contrasting centralized and decentralized critics in multi-agent reinforcement learning. 20th Int Conf Auton Agents & Multiagent Syst, 844-852

Oroojlooy A, Hajinezhad D (2022) A review of cooperative multi-agent deep reinforcement learning. Appl Intell, 1-46

Wang J, Yuan M, Li Y, Zhao Z (2023) Hierarchical attention master-slave for heterogeneous multi-agent reinforcement learning. Neural Netw, 359-368

Mahajan A, Rashid T, Samvelyan M, Whiteson S (2019) MAVEN: multiagent variational exploration. In: Adv Neural Info Process Syst, pp. 7611-7622

Li W, He S, Mao X, Li B, Qiu C, Yu J, Peng F, Tan X (2023) Multiagent evolution reinforcement learning method for machining parameters optimization based on bootstrap aggregating graph attention network simulated environment. J Manuf Syst, 424-438

Qiu D, Wang J, Dong Z, Wang Y, Strbac G (2022) Mean-field multi-agent reinforcement learning for peer-to-peer multi-energy trading. IEEE Trans Power Syst, 1-13

Lee HR, Lee T (2021) Multi-agent reinforcement learning algorithm to solve a partially-observable multi-agent problem in disaster response. Euro J Oper Res, 296-308

Foerster J, Farquhar G, Afouras T, Nardelli N, Whiteson S (2018) Counterfactual multi-agent policy gradients. Proceedings of the AAAI Conf Artif Intell, 2974-2982

Guo D, Tang L, Zhang X, Liang YC (2020) Joint optimization of handover control and power allocation based on multi-agent deep reinforcement learning. IEEE Trans Veh Technol, 13124-13138

Hou Y, Sun M, Zeng Y, Ong YS, Jin Y, Ge H, Zhang Q (2023) A multi-agent cooperative learning system with evolution of social roles. IEEE Trans Evol Comput

Yang J, Nakhaei A, Isele D, Fujimura K, Zha H (2020) CM3: cooperative multi-goal multi-stage multi-agent reinforcement learning. In: 8th International Conference on Learning Representations

Nguyen DT, Kumar A, Lau HC (2017) Collective multiagent sequential decision making under uncertainty. Proceedings of the 31st Conference on Artificial Intelligence, 3036-3043

Du Y, Han L, Fang M, Liu J, Dai T, Tao D (2019) LIIR: learning individual intrinsic reward in multi-agent reinforcement learning. Adv Neural Inf Process Syst, 4405-4416

Sunehag P, Lever G, Gruslys A, Czarnecki WM, Zambaldi VF, Jaderberg M, Lanctot M, Sonnerat N, Leibo JZ, Tuyls K, Graepel T (2018) Value-decomposition networks for cooperative multi-agent learning based on team reward. Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, 2085-2087

Son K, Kim D, Kang WJ, Hostallero DE, Yi Y (2019) Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. Int Conf Mach Learn, p 5887-5896

Rashid T, Farquhar G, Peng B, Whiteson S (2020) Weighted qmix: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning. Adv neural inf process syst, 10199-10210

Zhou D, Gayah VV (2023) Scalable multi-region perimeter metering control for urban networks: A multi-agent deep reinforcement learning approach. Transp Res Part C Emerg Technol, p 104033

Liu S, Liu W, Chen W, Tian G, Chen J, Tong Y, Cao J, Liu Y (2023) Learning multi-agent cooperation via considering actions of teammates. IEEE Trans Neural Netw Learn Syst, p 1-12

DENG H, LI Y, YIN Q (2023) Improved qmix algorithm from communication and exploration for multi-agent reinforcement learning. J Comput Appl, p 202

Zhang Y, Ma H, Wang Y (2021) Avd-net: Attention value decomposition network for deep multi-agent reinforcement learning. 25th International Conference on Pattern Recognition, p 7810-7816

Qin Z, Johnson D, Lu Y (2023) Dynamic production scheduling towards selforganizing mass personalization: A multi-agent dueling deep reinforcement learning approach. J Manuf Syst, 242-257

Wang X, Zhang L, Lin T, Zhao C,Wang K, Chen Z (2022) Solving job scheduling problems in a resource preemption environment with multi-agent reinforcement learning. Robot Comput Integr Manuf, 102324

Yu T, Huang J, Chang Q (2021) Optimizing task scheduling in human-robot collaboration with deep multi-agent reinforcement learning. J Manuf Syst, 487-499

Wu H, Ghadami A, Bayrak AE, Smereka JM, Epureanu BI (2021) Impact of heterogeneity and risk aversion on task allocation in multi-agent teams. IEEE Robotics and Automation Letters, 7065-7072

Zhao Y, Xian C, Wen G, Huang P, Ren W (2022) Design of distributed eventtriggered average tracking algorithms for homogeneous and heterogeneous multiagent systems. IEEE Transactions on Automatic Control, 1269-1284

Jiang W, Feng G, Qin S, Yum TSP, Cao G (2019) Multi-agent reinforcement learning for efficient content caching in mobile d2d networks. IEEE Trans Wirel Commun, 1610-1622

Jahn J (2020) Introduction to the theory of nonlinear optimization. Springer Nature

Download references

Author information

Authors and affiliations.

School of Automation, Southeast University, Nanjing, 210096, Jiangsu, China

Kun Jiang, Yuanda Wang & Changyin Sun

Peng Cheng Laboratory, Shenzhen, 518955, Guangdong, China

Kun Jiang & Changyin Sun

School of Artificial Intelligence, Anhui University, Hefei, 230039, Anhui, China

Wenzhang Liu

School of Cyber Science and Engineering, Southeast University, Nanjing, 211189, Jiangsu, China

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Changyin Sun .

Ethics declarations

Conflicts of interest.

No potential conflict of interest was reported by the authors.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Jiang, K., Liu, W., Wang, Y. et al. Credit assignment in heterogeneous multi-agent reinforcement learning for fully cooperative tasks. Appl Intell 53 , 29205–29222 (2023). https://doi.org/10.1007/s10489-023-04866-0

Download citation

Accepted : 02 July 2023

Published : 26 October 2023

Issue Date : December 2023

DOI : https://doi.org/10.1007/s10489-023-04866-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Credit assignment
  • Multi-agent reinforcement learning
  • Reward decomposition
  • Heterogeneous agents
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. What Is the Credit Assignment Problem?

    credit assignment problem reward

  2. Frontiers

    credit assignment problem reward

  3. Frontiers

    credit assignment problem reward

  4. What Is the Credit Assignment Problem?

    credit assignment problem reward

  5. Figure 1 from Dendritic solutions to the credit assignment problem

    credit assignment problem reward

  6. The Credit Assignment Problem in Marketing Attribution

    credit assignment problem reward

VIDEO

  1. PERFORMANCE EVALUATION ASSIGNMENT 1 PRESENTATION (REWARD SYSTEM)

  2. Credit Risk individual assignment presentation

  3. Credit Assignment 1 FOUN1001 S324 Group 15

  4. Assignment Topic: Importance of Credit

  5. Assignment Topic: Roles of Credit Manager

  6. Turn Bills Into Rewards: The Credit Card Hack You Need! 💰 #finance

COMMENTS

  1. What Is the Credit Assignment Problem?

    The credit assignment problem (CAP) is a fundamental challenge in reinforcement learning. It arises when an agent receives a reward for a particular action, but the agent must determine which of its previous actions led to the reward. In reinforcement learning, an agent applies a set of actions in an environment to maximize the overall reward.

  2. reinforcement learning

    The goal of the agent is to maximise the reward in the long run. The (temporal) credit assignment problem (CAP) (discussed in Steps Toward Artificial Intelligence by Marvin Minsky in 1961) is the problem of determining the actions that lead to a certain outcome. For example, in football, at each second, each football player takes an action.

  3. Solving the Credit Assignment Problem With the Prefrontal Cortex

    Figure 1.Example tasks highlighting the challenge of credit assignment and learning strategies enabling animals to solve this problem. (A) An example of a distal reward task that can be successfully learned with eligibility traces and TD rules, where intermediate choices can acquire motivational significance and subsequently reinforce preceding decisions (ex., Pasupathy and Miller, 2005 ...

  4. Towards Practical Credit Assignment for Deep Reinforcement Learning

    Credit assignment is a fundamental problem in reinforcement learning, the problem of measuring an action's influence on future rewards. Explicit credit assignment methods have the potential to boost the performance of RL algorithms on many tasks, but thus far remain impractical for general use. Recently, a family of methods called Hindsight Credit Assignment (HCA) was proposed, which ...

  5. Towards Practical Credit Assignment for Deep Reinforcement Learning

    to solve the credit assignment problem directly. Many recent online meta learning methods have conceptual similarity to credit assignment. Reward learning (Zheng et al.,2018) in particular has connections to value transport, as both involve a learned shaping of reward functions, but the quantities learned by meta-learning methods are open-

  6. Tackling the Credit Assignment Problem in Reinforcement ...

    Assigning credit or blame for each of those actions individually is known as the (temporal) Credit Assignment Problem ... as well as the speed, measured by the time spent on the post-test problems. The second reward function, denoted S20, uses the solution length (number of logic statements in the solution), the solution accuracy (proportion of ...

  7. Credit Assignment

    Determining that action is the problem of temporal credit assignment. Although the actions are directly responsible for the outcome of a trial, the internal process for choosing the action indirectly affects the outcome. Assigning credit or blame to those internal processes that lead to the choice of action is the structural credit assignment ...

  8. Peter Henderson arXiv:2103.06224v1 [cs.LG] 10 Mar 2021

    The credit assignment problem in reinforcement learning [Minsky, 1961, Sutton, 1985, 1988] is concerned with identifying the contribution of past actions on observed future outcomes. Of par- ... Stemming from this fact, reward sparsity and the credit-assignment challenge are often discussed

  9. PDF Hindsight Credit Assignment

    We consider the problem of efficient credit assignment in reinforcement learning. In order to efficiently and meaningfully utilize new data, we propose to explicitly assign credit to past decisions based on the likelihood of them having led to the observed outcome. This approach uses new information in hindsight, rather than employing foresight.

  10. PDF An Information-Theoretic Perspective on Credit Assignment in

    The credit assignment problem in reinforcement learning [Minsky,1961,Sutton,1985,1988] is concerned with identifying the contribution of past actions on observed future outcomes. Of par- ... Stemming from this fact, reward sparsity and the credit-assignment challenge are often discussed

  11. Deep reinforcement learning with credit assignment for combinatorial

    Based on the similarity between heuristic methods [4] in CO and credit assignment [5] in DRL, we introduce a model-based credit assignment framework to take advantage of heuristic knowledge when applying model-free DRL methods to solve CO problems. The idea is only to model the most relevant features of rewards, then plan on the tractable model ...

  12. neural networks

    An excerpt from Box 1 in the article "A deep learning framework for neuroscience", by Blake A. Richards et. al. (among the authors is Yoshua Bengio):The concept of credit assignment refers to the problem of determining how much 'credit' or 'blame' a given neuron or synapse should get for a given outcome.

  13. Credit Assignment Problem

    The credit assignment problem concerns determining how the success of a system's overall performance is due to the various contributions of the system's components (Minsky, 1963). "In playing a complex game such as chess or checkers, or in writing a computer program, one has a definite success criterion - the game is won or lost. ...

  14. PDF LEARNING TO SOLVE THE CREDIT ASSIGNMENT PROBLEM

    provide a realistic solution to the credit assignment problem. Thus we implement a network that learns to use feedback signals trained with reinforcement learn-ing via a global reward signal. We mathematically analyze the model, and compare its capabilities to other methods for learning in ANNs.

  15. Credit assignment in movement-dependent reinforcement learning

    We suspect that the relative reliance on these two forms of credit assignment is likely dependent on task context, motor feedback, and movement requirements. Indeed, a hybrid model, which incorporates features from both the gating and probability models, yields good fits for the Standard and Spatial conditions.

  16. Towards Practical Credit Assignment for Deep Reinforcement Learning

    Credit assignment is a fundamental problem in reinforcement learning, the problem of measuring an action's influence on future rewards. Improvements in credit assignment methods have the potential to boost the performance of RL algorithms on many tasks, but thus far have not seen widespread adoption. Recently, a family of methods called ...

  17. A Survey of Temporal Credit Assignment in Deep ...

    The Credit Assignment Problem (CAP) refers to the longstanding challenge of Reinforce-ment Learning (RL) agents to associate actions with their long-term consequences. ... information sparsity to clarify the role of credit in solving sparse reward problems in RL. Despite the work questioning what credit is mathematically, it does not survey ...

  18. credit assignment problem : r/reinforcementlearning

    In many problems, agents receive a reward only after many actions are taken. The credit assignment problem is figuring out which of those actions contributed what to that reward. You could make a really bad tenth move in a chess game that totally dooms you to lose, but the agent keeps playing 20 more moves before it loses. ...

  19. The Credit Assignment Problem

    The difficulty of the credit assignment problem lead to a split in the field. Kenneth de Jong and Stephanie Smith founded a new approach, "Pittsburgh style" classifier systems. John Holland's original vision became "Michigan style". Pittsburgh style classifier systems evolve the entire set of rules, rather than trying to assign credit locally.

  20. RUDDER

    We show how RUDDER effectively solves the credit assignment problem for delayed rewards. An example task for delayed rewards. In the pocket watch example task, assume you repair pocket watches and then sell them. For a particular brand of watch you have to decide whether repairing pays off. The sales price is known, but you have unknown costs ...

  21. PDF LEARNING TO SOLVE THE CREDIT ASSIGNMENT PROBLEM

    It is unknown how the brain solves the credit assignment problem when learning: how does each neuron know its role in a positive (or negative) outcome, and thus know how to change its activity ... (RL) algorithms and reward-modulated STDP (Bouvier et al., 2016; Fiete et al., 2007; Fiete & Seung, 2006; Legenstein et al., 2010; Miconi, 2017). In ...

  22. Credit assignment in heterogeneous multi-agent reinforcement ...

    Credit assignment poses a significant challenge in heterogeneous multi-agent reinforcement learning (MARL) when tackling fully cooperative tasks. Existing MARL methods assess the contribution of each agent through value decomposition or agent-wise critic networks. However, value decomposition techniques are not directly applicable to control problems with continuous action spaces. Additionally ...