Paper ID: | 3358 |
---|---|

Title: | Distributional Reward Decomposition for Reinforcement Learning |

Originality: The work proposes an interesting framework for distributional reward decomposition. The work is not particularly novel since it is based on several prior works, albeit in a non-trivial way. For example, as discussed in point 3 in the contributions section above, the authors are inspired by [Grimm and Singh, 2019] but they propose their own disentanglement loss term. Quality:The intuitions behind the various steps seem quite reasonable to me. The experiment shows significant improvements in RL performance, and the authors provide experiments that show how the computed rewards are correlated with the ground truth. On the other hand, only decompositions of 2 or 3 terms have been performed and other decomposition methods are not assessed. Significance: Disentangled reward decomposition is a very important area in RL, so the work can be used by other researchers or practitioners. UPDATE Thanks for the authors' feedback. I acknowledge that I have read the rebuttal and the other reviews. I believe this is an interesting work that successfully brings together various promising ideas from the existing literature, so I think it is qualified for acceptance. Therefore, I keep my original score of 6. The reason why I am not giving a higher score is that the experiments are not entirely convincing to me. For instance, I raised the point about high-D experiments - the authors mentioned that their framework is in principle applicable to high-D settings but I think experiments would be needed to back up this claim. Nevertheless, I still think that this is an interesting work and relevant to the RL community.

The submission introduces a method for distributional reward decomposition which is more generally applicable than prior work, removing requirements for arbitrary resets as well as domain knowledge. The method models subrewards as Categorical distributions, treating reward composition as 1D convolution, and relying on update rules from prior work on distributional Q-learning (C51) for update rules. To further strengthen disentanglement the objective is extended to maximise the KL divergence between the distributions resulting from actions optimising for different subrewards (treating the learned Q functions as epsilon greedy policies). Overall, the work provides a valuable contribution to RL by investigating (and benefitting from) reward decomposition in a distributional setting. The combination of reward decomposition and distributional RL provides novelty and as demonstrated in the experimental section better agent performance by exploiting task structure. It would be interesting in this context to see how the approach fares in tasks with only a single source of reward and potential situations where the method might perform worse than the baseline. On the experiment section it would be important to additionally compare against distributions with the same number of atoms as it might be simply easier to fit distributions with M/N atoms than with M atoms, leading to an unplanned benefit of the proposed algorithm. In Figure 3, it would be interesting to investigate situations when either subreward spikes but not the original one to better understand the model. To be fair to prior work and provide a more complete evaluations, it would be good to compare against Van Seijen et al and Grimm and Singh in environments where their requirements are fulfilled. Minor: Couple of spelling/grammatical mistakes in lines 10, 89, 90, 101 Statement about stochastic policies could leave out the epsilon greedy part to be more general as stochasticity can also apply to continuous action policies After introducing the UVFA like trick in 3.4 is the state splitting still required? Figure 1a: To be more self-contained it would be beneficial to also explain the symbols here Possibly related work: RUDDER: Return Decomposition for Delayed Rewards; Jose A. Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, Sepp Hochreiter I appreciate the author feedback in particular with the additional ablation and investigation of the model which will help answer some open questions. In addition to this, I hope the authors will spend time on the promised detailed evaluation of failure cases and investigation why the approach improves performance even when its assumptions seem broken (one reward source).

I find the ideas presented in this work to be sound. Learning decomposed reward representation seems to give more representation power, which leads to improved performance. However, the ideas are quite incremental (combining two well-studied approaches), there is no new analysis and the experiments are not too impressing. To be more convinced, I would like to see ablative analysis of the results - how each component in this work (new loss, new architecture, distributional rl) contributed towards the final solution. Detailed comments: While many reward decomposition papers try to learn a different policy for each component, to be combined later on, here the focus is on learning better representations by using the decomposed representation. I would like the authors to emphasize that more in the text and explain why they took this approach. Section 3.1. This section is quite confusing. Equations are derived, but then it is explained that they are ignored. The authors mention that they performed experiments with the full distribution method (the none factorial) but they did not perform well. I would like to see the distributional model developed for this case as well as the supporting experiments to be convinced as well. The fact that the sum of two random variables is given by convolution is true for any two random variables (this is taught in basic probability courses). The way it is presented here may confuse the reader to think that there is some novelty here. I would like to see a derivation of equation 4, I am not confident that the optimal bellman equation is linear. Section 3.3. Is the projected Bellman loss novel to this work or was it proposed in the Bellamere et al (2017) paper? please be specific. If it was proposed before, then why wasn't it implemented in the Rainbow architecture? is this new loss only improves results when combined with the reward decomposition? I would like to see more experiments about this loss, with an without other components, as well as a detailed explanation regarding when was it first proposed and when was it used. In equation 7 there is a subscript i, but it is not used in equation 8, can you please explain how you move between these equations? Experiments. How were the hyperparameters selected? how are they different than the classic parameters of Rainbow? did you find the algorithm to be sensitive to these parameters? The regularized \lambda seems to be quite small. Can you elaborate on that? was it needed at all? --------------------------------------- Following the rebuttal: I appreciate the authors' effort in addressing my questions. I also find the additional experiments provided in the rebuttal to be interesting, and I believe that they improve the quality of the paper significantly. Since the authors also agreed to address most of my concerns in the final version of the paper, I am increasing my score from 4 to 6.