Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

Nived Rajaraman, Audrey Huang, Miro Dudik, Rob Schapire, Dylan Foster, Akshay Krishnamurthy
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:5518-5555, 2026.

Abstract

Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that \textit{autocurriculum}—where the model uses its own performance to decide which problems to focus training on—provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires \textit{exponentially} fewer reasoning demonstrations than non-adaptive fine-tuning (Joshi et al., 2025), by focusing teacher supervision on prompts where the current model struggles. For RL fine-tuning, autocurriculum \textit{decouples} the computational cost from the quality of the reference model, reducing the latter to a burn-in cost that is nearly independent of the target accuracy. These improvements arise purely from adaptive data selection, drawing on classical techniques from boosting (Freund and Schapire, 1997) and learning from counterexamples (Angluin, 1987), and requiring no assumption on the distribution or difficulty of prompts.

Cite this Paper


BibTeX
@InProceedings{pmlr-v336-rajaraman26a, title = {Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum}, author = {Rajaraman, Nived and Huang, Audrey and Dudik, Miro and Schapire, Rob and Foster, Dylan and Krishnamurthy, Akshay}, booktitle = {Proceedings of Thirty Ninth Conference on Learning Theory}, pages = {5518--5555}, year = {2026}, editor = {Hanneke, Steve and Lattimore, Tor}, volume = {336}, series = {Proceedings of Machine Learning Research}, month = {29 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v336/main/assets/rajaraman26a/rajaraman26a.pdf}, url = {https://proceedings.mlr.press/v336/rajaraman26a.html}, abstract = {Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that \textit{autocurriculum}—where the model uses its own performance to decide which problems to focus training on—provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires \textit{exponentially} fewer reasoning demonstrations than non-adaptive fine-tuning (Joshi et al., 2025), by focusing teacher supervision on prompts where the current model struggles. For RL fine-tuning, autocurriculum \textit{decouples} the computational cost from the quality of the reference model, reducing the latter to a burn-in cost that is nearly independent of the target accuracy. These improvements arise purely from adaptive data selection, drawing on classical techniques from boosting (Freund and Schapire, 1997) and learning from counterexamples (Angluin, 1987), and requiring no assumption on the distribution or difficulty of prompts.} }
Endnote
%0 Conference Paper %T Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum %A Nived Rajaraman %A Audrey Huang %A Miro Dudik %A Rob Schapire %A Dylan Foster %A Akshay Krishnamurthy %B Proceedings of Thirty Ninth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2026 %E Steve Hanneke %E Tor Lattimore %F pmlr-v336-rajaraman26a %I PMLR %P 5518--5555 %U https://proceedings.mlr.press/v336/rajaraman26a.html %V 336 %X Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that \textit{autocurriculum}—where the model uses its own performance to decide which problems to focus training on—provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires \textit{exponentially} fewer reasoning demonstrations than non-adaptive fine-tuning (Joshi et al., 2025), by focusing teacher supervision on prompts where the current model struggles. For RL fine-tuning, autocurriculum \textit{decouples} the computational cost from the quality of the reference model, reducing the latter to a burn-in cost that is nearly independent of the target accuracy. These improvements arise purely from adaptive data selection, drawing on classical techniques from boosting (Freund and Schapire, 1997) and learning from counterexamples (Angluin, 1987), and requiring no assumption on the distribution or difficulty of prompts.
APA
Rajaraman, N., Huang, A., Dudik, M., Schapire, R., Foster, D. & Krishnamurthy, A.. (2026). Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:5518-5555 Available from https://proceedings.mlr.press/v336/rajaraman26a.html.

Related Material