MAVRL

Learning Reward Functions from Multiple Feedback Types
with Amortized Variational Inference

1ETH AI Center  2Department of Computer Science, ETH Zurich  3Department of Informatics, University of Zurich
ICML 2026
MAVRL architecture: a shared reward encoder processes multi-type feedback through feedback-specific decoders, trained via a single ELBO objective.

MAVRL learns a shared probabilistic reward encoder from heterogeneous feedback types. Each feedback modality — preferences, demonstrations, ratings, and stops — contributes through a dedicated likelihood decoder. The entire model is trained end-to-end by optimizing a single evidence lower bound (ELBO).

Abstract

Reward learning typically relies on a single feedback type or combines multiple feedback types using manually weighted loss terms. Currently, it remains unclear how to jointly learn reward functions from heterogeneous feedback types — such as demonstrations, comparisons, ratings, and stops — that provide qualitatively different signals.

We address this challenge by formulating reward learning from multiple feedback types as Bayesian inference over a shared latent reward function, where each feedback type contributes information through an explicit likelihood. We introduce a scalable amortized variational inference approach that learns a shared reward encoder and feedback-specific likelihood decoders, trained by optimizing a single evidence lower bound.

Our approach avoids reducing feedback to a common intermediate representation and eliminates the need for manual loss balancing. Across discrete and continuous-control benchmarks, we show that jointly inferred reward posteriors outperform single-type baselines, exploit complementary information across feedback types, and yield policies that are more robust to environment perturbations.


Feedback Types Complement Each Other

Each feedback type induces a characteristic pattern in the inferred reward and uncertainty. Demonstrationsdemo yield low-uncertainty estimates along expert trajectories but leave large regions underdetermined. Preferencespref provide broader coverage but can over-reward frequently visited states. Ratingsrating reliably identify goals but offer limited landscape information. Stopsstop strongly constrain unsafe regions while providing little guidance on desirable behavior.

Qualitative comparison of reward estimates from individual feedback types and their combination.

Reward estimates (color) and uncertainty (cell size) inferred from individual feedback types and their combination on a 10×10 grid environment. Combining all four modalities recovers the ground-truth reward with high fidelity.


Combining Feedback Improves Performance

Across six environments spanning tabular grid worlds and continuous-control tasks, combining all four feedback types with MAVRL consistently achieves the strongest overall performance. No single feedback type dominates across all settings.

  • All four modalities combined achieves 100% normalized return in 3 of 6 environments and is the best or second-best method in 4 of 6.
  • MAVRL matches MCMC posterior inference quality at ~30× lower compute on grid environments, and scales to continuous-control tasks where MCMC is intractable.
  • Post-hoc reward averaging (training separate models per feedback type and ensembling) substantially underperforms joint inference, confirming that independent models miss cross-modal complementarity.
  • Equal-budget ablations confirm that performance gains reflect genuine complementarity, not simply more data.

Multi-Type Feedback Is More Robust

Reward functions learned from multiple feedback types degrade more gracefully when the environment changes at deployment. Under dynamics perturbations — increased stochasticity, altered physics, stronger wind — policies trained on multi-type rewards are generally more stable, outperforming single-feedback baselines in the grid environments and remaining competitive in continuous-control tasks.

Robustness under dynamics perturbations across four environments.

Normalized returns under increasing dynamics perturbations across four environments. Multi-type feedback (PDRS, bold) consistently degrades more gracefully than single-type baselines and imitation.

MAVRL is also resilient to feedback misspecification: when one feedback channel is corrupted (e.g., noisier preferences or miscalibrated ratings), the remaining modalities compensate. In 3 of 4 corruption scenarios, MAVRL retains ≥90% of its well-specified performance, while single-modality baselines collapse.


Citation

@inproceedings{baur2026mavrl, title={MAVRL: Learning Reward Functions from Multiple Feedback Types with Amortized Variational Inference}, author={Baur, Rapha{\"e}l and Metz, Yannick and Gkoulta, Maria and El-Assady, Mennatallah and Ramponi, Giorgia and Kleine Buening, Thomas}, booktitle={International Conference on Machine Learning (ICML)}, year={2026}, }