Sleep-time compute is introduced as a method to enhance the efficiency of large language models by allowing them to anticipate user queries and pre-compute relevant data, significantly reducing test-time compute requirements. The study shows that this approach can lower compute needs by approximately 5x and improve accuracy by up to 18% on specific reasoning tasks. Additionally, a Multi-Query extension is proposed to further optimize compute costs across related queries.
sleep-time-compute ✓
large-language-models ✓
inference-scaling ✓
+ reasoning-tasks
query-predictability ✓