Sleep-time compute is introduced as a method to enhance the efficiency of large language models by allowing them to anticipate user queries and pre-compute relevant data, significantly reducing test-time compute requirements. The study shows that this approach can lower compute needs by approximately 5x and improve accuracy by up to 18% on specific reasoning tasks. Additionally, a Multi-Query extension is proposed to further optimize compute costs across related queries.
The paper explores the enhancement of reward modeling in reinforcement learning for large language models, focusing on inference-time scalability. It introduces Self-Principled Critique Tuning (SPCT) to improve generative reward modeling and proposes a meta reward model to optimize performance during inference. Empirical results demonstrate that SPCT significantly enhances the quality and scalability of reward models compared to existing methods.