Risk-sensitive reinforcement learning

Yun Shen; Michael J Tobia; Tobias Sommer-Blöchl; Klaus Obermayer

doi:10.1162/NECO_a_00600

Risk-sensitive reinforcement learning

Standard

Risk-sensitive reinforcement learning. / Shen, Yun; Tobia, Michael J; Sommer-Blöchl, Tobias; Obermayer, Klaus.

in: NEURAL COMPUT, Jahrgang 26, Nr. 7, 01.07.2014, S. 1298-328.

Publikationen: SCORING: Beitrag in Fachzeitschrift/Zeitung › SCORING: Zeitschriftenaufsatz › Forschung › Begutachtung

Harvard

Shen, Y, Tobia, MJ, Sommer-Blöchl, T & Obermayer, K 2014, 'Risk-sensitive reinforcement learning', NEURAL COMPUT, Jg. 26, Nr. 7, S. 1298-328. https://doi.org/10.1162/NECO_a_00600

APA

Shen, Y., Tobia, M. J., Sommer-Blöchl, T., & Obermayer, K. (2014). Risk-sensitive reinforcement learning. NEURAL COMPUT, 26(7), 1298-328. https://doi.org/10.1162/NECO_a_00600

Vancouver

Shen Y, Tobia MJ, Sommer-Blöchl T, Obermayer K. Risk-sensitive reinforcement learning. NEURAL COMPUT. 2014 Jul 1;26(7):1298-328. https://doi.org/10.1162/NECO_a_00600

Bibtex

@article{0517b76c7d734f238cf949bc12ec4263,

title = "Risk-sensitive reinforcement learning",

abstract = "We derive a family of risk-sensitive reinforcement learning methods for agents, who face sequential decision-making tasks in uncertain environments. By applying a utility function to the temporal difference (TD) error, nonlinear transformations are effectively applied not only to the received rewards but also to the true transition probabilities of the underlying Markov decision process. When appropriate utility functions are chosen, the agents' behaviors express key features of human behavior as predicted by prospect theory (Kahneman & Tversky, 1979 ), for example, different risk preferences for gains and losses, as well as the shape of subjective probability curves. We derive a risk-sensitive Q-learning algorithm, which is necessary for modeling human behavior when transition probabilities are unknown, and prove its convergence. As a proof of principle for the applicability of the new framework, we apply it to quantify human behavior in a sequential investment task. We find that the risk-sensitive variant provides a significantly better fit to the behavioral data and that it leads to an interpretation of the subject's responses that is indeed consistent with prospect theory. The analysis of simultaneously measured fMRI signals shows a significant correlation of the risk-sensitive TD error with BOLD signal change in the ventral striatum. In addition we find a significant correlation of the risk-sensitive Q-values with neural activity in the striatum, cingulate cortex, and insula that is not present if standard Q-values are used.",

author = "Yun Shen and Tobia, {Michael J} and Tobias Sommer-Bl{\"o}chl and Klaus Obermayer",

year = "2014",

month = jul,

day = "1",

doi = "10.1162/NECO_a_00600",

language = "English",

volume = "26",

pages = "1298--328",

journal = "NEURAL COMPUT",

issn = "0899-7667",

publisher = "MIT Press",

number = "7",

}

RIS

TY - JOUR

T1 - Risk-sensitive reinforcement learning

AU - Shen, Yun

AU - Tobia, Michael J

AU - Sommer-Blöchl, Tobias

AU - Obermayer, Klaus

PY - 2014/7/1

Y1 - 2014/7/1

N2 - We derive a family of risk-sensitive reinforcement learning methods for agents, who face sequential decision-making tasks in uncertain environments. By applying a utility function to the temporal difference (TD) error, nonlinear transformations are effectively applied not only to the received rewards but also to the true transition probabilities of the underlying Markov decision process. When appropriate utility functions are chosen, the agents' behaviors express key features of human behavior as predicted by prospect theory (Kahneman & Tversky, 1979 ), for example, different risk preferences for gains and losses, as well as the shape of subjective probability curves. We derive a risk-sensitive Q-learning algorithm, which is necessary for modeling human behavior when transition probabilities are unknown, and prove its convergence. As a proof of principle for the applicability of the new framework, we apply it to quantify human behavior in a sequential investment task. We find that the risk-sensitive variant provides a significantly better fit to the behavioral data and that it leads to an interpretation of the subject's responses that is indeed consistent with prospect theory. The analysis of simultaneously measured fMRI signals shows a significant correlation of the risk-sensitive TD error with BOLD signal change in the ventral striatum. In addition we find a significant correlation of the risk-sensitive Q-values with neural activity in the striatum, cingulate cortex, and insula that is not present if standard Q-values are used.

AB - We derive a family of risk-sensitive reinforcement learning methods for agents, who face sequential decision-making tasks in uncertain environments. By applying a utility function to the temporal difference (TD) error, nonlinear transformations are effectively applied not only to the received rewards but also to the true transition probabilities of the underlying Markov decision process. When appropriate utility functions are chosen, the agents' behaviors express key features of human behavior as predicted by prospect theory (Kahneman & Tversky, 1979 ), for example, different risk preferences for gains and losses, as well as the shape of subjective probability curves. We derive a risk-sensitive Q-learning algorithm, which is necessary for modeling human behavior when transition probabilities are unknown, and prove its convergence. As a proof of principle for the applicability of the new framework, we apply it to quantify human behavior in a sequential investment task. We find that the risk-sensitive variant provides a significantly better fit to the behavioral data and that it leads to an interpretation of the subject's responses that is indeed consistent with prospect theory. The analysis of simultaneously measured fMRI signals shows a significant correlation of the risk-sensitive TD error with BOLD signal change in the ventral striatum. In addition we find a significant correlation of the risk-sensitive Q-values with neural activity in the striatum, cingulate cortex, and insula that is not present if standard Q-values are used.

U2 - 10.1162/NECO_a_00600

DO - 10.1162/NECO_a_00600

M3 - SCORING: Journal article

C2 - 24708369

VL - 26

SP - 1298

EP - 1328

JO - NEURAL COMPUT

JF - NEURAL COMPUT

SN - 0899-7667

IS - 7

ER -