CrossPyEval: Enhancing LLM-based Evaluation of Low-Resource Code via Code Translation

Weijie Wu; Ling-I Wu; Guoqiang Li

CrossPyEval: Enhancing LLM-based Evaluation of Low-Resource Code via Code Translation

Weijie Wu, Ling-I Wu, Guoqiang Li

Proceedings of the 17th Asian Conference on Machine Learning, PMLR 304:846-861, 2025.

Abstract

Large language models (LLMs) have demonstrated remarkable performance in code generation and evaluation tasks, particularly for Python, which dominates the pre-training corpora. However, the evaluation of code in low-resource programming languages remains challenging due to limited data and suboptimal model alignment. In this paper, we propose CrossPyEval, a novel cross-language code evaluation framework that uses an LLM to translate code from other languages into Python, verifies consistency with an SMT solver, and then analyzes the translated code via abstract syntax trees before performing the final evaluation. Experiments on public benchmarks and our custom low-resource datasets demonstrate that CrossPyEval substantially boosts evaluation accuracy for non-Python languages, achieving up to an 8.83% improvement, and significantly enhances alignment with human judgments, with the Kendall correlation rising to as high as 0.689.

Cite this Paper

BibTeX

@InProceedings{pmlr-v304-wu25c,
  title = 	 {CrossPyEval: Enhancing LLM-based Evaluation of Low-Resource Code via Code Translation},
  author =       {Wu, Weijie and Wu, Ling-I and Li, Guoqiang},
  booktitle = 	 {Proceedings of the 17th Asian Conference on Machine Learning},
  pages = 	 {846--861},
  year = 	 {2025},
  editor = 	 {Lee, Hung-yi and Liu, Tongliang},
  volume = 	 {304},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--12 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v304/main/assets/wu25c/wu25c.pdf},
  url = 	 {https://proceedings.mlr.press/v304/wu25c.html},
  abstract = 	 {Large language models (LLMs) have demonstrated remarkable performance in code generation and evaluation tasks, particularly for Python, which dominates the pre-training corpora. However, the evaluation of code in low-resource programming languages remains challenging due to limited data and suboptimal model alignment. In this paper, we propose CrossPyEval, a novel cross-language code evaluation framework that uses an LLM to translate code from other languages into Python, verifies consistency with an SMT solver, and then analyzes the translated code via abstract syntax trees before performing the final evaluation. Experiments on public benchmarks and our custom low-resource datasets demonstrate that CrossPyEval substantially boosts evaluation accuracy for non-Python languages, achieving up to an 8.83% improvement, and significantly enhances alignment with human judgments, with the Kendall correlation rising to as high as 0.689.}
}

Endnote

%0 Conference Paper
%T CrossPyEval: Enhancing LLM-based Evaluation of Low-Resource Code via Code Translation
%A Weijie Wu
%A Ling-I Wu
%A Guoqiang Li
%B Proceedings of the 17th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Hung-yi Lee
%E Tongliang Liu	
%F pmlr-v304-wu25c
%I PMLR
%P 846--861
%U https://proceedings.mlr.press/v304/wu25c.html
%V 304
%X Large language models (LLMs) have demonstrated remarkable performance in code generation and evaluation tasks, particularly for Python, which dominates the pre-training corpora. However, the evaluation of code in low-resource programming languages remains challenging due to limited data and suboptimal model alignment. In this paper, we propose CrossPyEval, a novel cross-language code evaluation framework that uses an LLM to translate code from other languages into Python, verifies consistency with an SMT solver, and then analyzes the translated code via abstract syntax trees before performing the final evaluation. Experiments on public benchmarks and our custom low-resource datasets demonstrate that CrossPyEval substantially boosts evaluation accuracy for non-Python languages, achieving up to an 8.83% improvement, and significantly enhances alignment with human judgments, with the Kendall correlation rising to as high as 0.689.

APA

Wu, W., Wu, L. & Li, G.. (2025). CrossPyEval: Enhancing LLM-based Evaluation of Low-Resource Code via Code Translation. Proceedings of the 17th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 304:846-861 Available from https://proceedings.mlr.press/v304/wu25c.html.

CrossPyEval: Enhancing LLM-based Evaluation of Low-Resource Code via Code Translation

Abstract

Cite this Paper

Related Material