×
Soft Adaptive Policy Optimization

SAPO: A Stable and Performant Reinforcement Learning Method for Training Large Language Models

This article introduces SAPO, a new reinforcement learning method that stabilizes and improves policy optimization for training large language models.