In on-policy RL training (RLHF/GRPO/DAPO), the rollout phase dominates runtime, typically accounting for over 90% of total training time. Due to the highly variable response lengths across samples, ...
Code for paper "Prompt Engineering a Prompt Engineer" (https://arxiv.org/abs/2311.05661), to appear at ACL 2024 (Findings). In the paper, conda create --name pe2 ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results