Language Models as Science Tutors

Alexis Chevalier; Jiayi Geng; Alexander Wettig; Howard Chen; Sebastian Mizera; Toni Annala; Max Aragon; Arturo Rodriguez Fanlo; Simon Frieder; Simon Machado; Akshara Prabhakar; Ellie Thieu; Jiachen T. Wang; Zirui Wang; Xindi Wu; Mengzhou Xia; Wenhan Xia; Jiatong Yu; Junjie Zhu; Zhiyong Ren; Sanjeev Arora; Danqi Chen

Language Models as Science Tutors

Alexis Chevalier, Jiayi Geng, Alexander Wettig, Howard Chen, Sebastian Mizera, Toni Annala, Max Aragon, Arturo Rodriguez Fanlo, Simon Frieder, Simon Machado, Akshara Prabhakar, Ellie Thieu, Jiachen T. Wang, Zirui Wang, Xindi Wu, Mengzhou Xia, Wenhan Xia, Jiatong Yu, Junjie Zhu, Zhiyong Ren, Sanjeev Arora, Danqi Chen

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:8310-8335, 2024.

Abstract

NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TutorEval helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multi-disciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. Therefore, we create TutorChat, a dataset of 80,000 long synthetic dialogues about textbooks. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations publicly.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-chevalier24a,
  title = 	 {Language Models as Science Tutors},
  author =       {Chevalier, Alexis and Geng, Jiayi and Wettig, Alexander and Chen, Howard and Mizera, Sebastian and Annala, Toni and Aragon, Max and Fanlo, Arturo Rodriguez and Frieder, Simon and Machado, Simon and Prabhakar, Akshara and Thieu, Ellie and Wang, Jiachen T. and Wang, Zirui and Wu, Xindi and Xia, Mengzhou and Xia, Wenhan and Yu, Jiatong and Zhu, Junjie and Ren, Zhiyong and Arora, Sanjeev and Chen, Danqi},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {8310--8335},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/chevalier24a/chevalier24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/chevalier24a.html},
  abstract = 	 {NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TutorEval helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multi-disciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. Therefore, we create TutorChat, a dataset of 80,000 long synthetic dialogues about textbooks. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations publicly.}
}

Endnote

%0 Conference Paper
%T Language Models as Science Tutors
%A Alexis Chevalier
%A Jiayi Geng
%A Alexander Wettig
%A Howard Chen
%A Sebastian Mizera
%A Toni Annala
%A Max Aragon
%A Arturo Rodriguez Fanlo
%A Simon Frieder
%A Simon Machado
%A Akshara Prabhakar
%A Ellie Thieu
%A Jiachen T. Wang
%A Zirui Wang
%A Xindi Wu
%A Mengzhou Xia
%A Wenhan Xia
%A Jiatong Yu
%A Junjie Zhu
%A Zhiyong Ren
%A Sanjeev Arora
%A Danqi Chen
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-chevalier24a
%I PMLR
%P 8310--8335
%U https://proceedings.mlr.press/v235/chevalier24a.html
%V 235
%X NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TutorEval helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multi-disciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. Therefore, we create TutorChat, a dataset of 80,000 long synthetic dialogues about textbooks. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations publicly.

APA


Chevalier, A., Geng, J., Wettig, A., Chen, H., Mizera, S., Annala, T., Aragon, M., Fanlo, A.R., Frieder, S., Machado, S., Prabhakar, A., Thieu, E., Wang, J.T., Wang, Z., Wu, X., Xia, M., Xia, W., Yu, J., Zhu, J., Ren, Z., Arora, S. & Chen, D.. (2024). Language Models as Science Tutors. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:8310-8335 Available from https://proceedings.mlr.press/v235/chevalier24a.html.

Language Models as Science Tutors

Abstract

Cite this Paper

Related Material