TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories

Honghua Dong, Jiacheng Yang, Xun Deng, Yuhe Jiang, Gennady Pekhimenko, Fan Long, Xujie Si
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:14203-14230, 2025.

Abstract

Type inference for dynamic languages like Python is a persistent challenge in software engineering. While large language models (LLMs) have shown promise in code understanding, their type inference capabilities remain underexplored. We introduce TypyBench, a benchmark designed to evaluate LLMs’ type inference across entire Python repositories. TypyBench features two novel metrics: TypeSim, which captures nuanced semantic relationships between predicted and ground truth types, and TypeCheck, which assesses type consistency across codebases. Our evaluation of various LLMs on a curated dataset of 50 high-quality Python repositories reveals that, although LLMs achieve decent TypeSim scores, they struggle with complex nested types and exhibit significant type consistency errors. These findings suggest that future research should shift focus from improving type similarity to addressing repository-level consistency. TypyBench provides a foundation for this new direction, offering insights into model performance across different type complexities and usage contexts. Our code and data are available at https://github.com/typybench/typybench

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-dong25l, title = {{T}ypy{B}ench: Evaluating {LLM} Type Inference for Untyped Python Repositories}, author = {Dong, Honghua and Yang, Jiacheng and Deng, Xun and Jiang, Yuhe and Pekhimenko, Gennady and Long, Fan and Si, Xujie}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {14203--14230}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/dong25l/dong25l.pdf}, url = {https://proceedings.mlr.press/v267/dong25l.html}, abstract = {Type inference for dynamic languages like Python is a persistent challenge in software engineering. While large language models (LLMs) have shown promise in code understanding, their type inference capabilities remain underexplored. We introduce TypyBench, a benchmark designed to evaluate LLMs’ type inference across entire Python repositories. TypyBench features two novel metrics: TypeSim, which captures nuanced semantic relationships between predicted and ground truth types, and TypeCheck, which assesses type consistency across codebases. Our evaluation of various LLMs on a curated dataset of 50 high-quality Python repositories reveals that, although LLMs achieve decent TypeSim scores, they struggle with complex nested types and exhibit significant type consistency errors. These findings suggest that future research should shift focus from improving type similarity to addressing repository-level consistency. TypyBench provides a foundation for this new direction, offering insights into model performance across different type complexities and usage contexts. Our code and data are available at https://github.com/typybench/typybench} }
Endnote
%0 Conference Paper %T TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories %A Honghua Dong %A Jiacheng Yang %A Xun Deng %A Yuhe Jiang %A Gennady Pekhimenko %A Fan Long %A Xujie Si %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-dong25l %I PMLR %P 14203--14230 %U https://proceedings.mlr.press/v267/dong25l.html %V 267 %X Type inference for dynamic languages like Python is a persistent challenge in software engineering. While large language models (LLMs) have shown promise in code understanding, their type inference capabilities remain underexplored. We introduce TypyBench, a benchmark designed to evaluate LLMs’ type inference across entire Python repositories. TypyBench features two novel metrics: TypeSim, which captures nuanced semantic relationships between predicted and ground truth types, and TypeCheck, which assesses type consistency across codebases. Our evaluation of various LLMs on a curated dataset of 50 high-quality Python repositories reveals that, although LLMs achieve decent TypeSim scores, they struggle with complex nested types and exhibit significant type consistency errors. These findings suggest that future research should shift focus from improving type similarity to addressing repository-level consistency. TypyBench provides a foundation for this new direction, offering insights into model performance across different type complexities and usage contexts. Our code and data are available at https://github.com/typybench/typybench
APA
Dong, H., Yang, J., Deng, X., Jiang, Y., Pekhimenko, G., Long, F. & Si, X.. (2025). TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:14203-14230 Available from https://proceedings.mlr.press/v267/dong25l.html.

Related Material