Learning and Evaluating Contextual Embedding of Source Code

Aditya Kanade; Petros Maniatis; Gogul Balakrishnan; Kensen Shi

Learning and Evaluating Contextual Embedding of Source Code

Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, Kensen Shi

Proceedings of the 37th International Conference on Machine Learning, PMLR 119:5110-5121, 2020.

Abstract

Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-tune CuBERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, BiLSTM and Transformer models, as well as published state-of-the-art models, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against CuBERT models as a strong baseline.

Cite this Paper

BibTeX

@InProceedings{pmlr-v119-kanade20a,
  title = 	 {Learning and Evaluating Contextual Embedding of Source Code},
  author =       {Kanade, Aditya and Maniatis, Petros and Balakrishnan, Gogul and Shi, Kensen},
  booktitle = 	 {Proceedings of the 37th International Conference on Machine Learning},
  pages = 	 {5110--5121},
  year = 	 {2020},
  editor = 	 {III, Hal Daumé and Singh, Aarti},
  volume = 	 {119},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--18 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v119/kanade20a/kanade20a.pdf},
  url = 	 {https://proceedings.mlr.press/v119/kanade20a.html},
  abstract = 	 {Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-tune CuBERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, BiLSTM and Transformer models, as well as published state-of-the-art models, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against CuBERT models as a strong baseline.}
}

Endnote

%0 Conference Paper
%T Learning and Evaluating Contextual Embedding of Source Code
%A Aditya Kanade
%A Petros Maniatis
%A Gogul Balakrishnan
%A Kensen Shi
%B Proceedings of the 37th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2020
%E Hal Daumé III
%E Aarti Singh	
%F pmlr-v119-kanade20a
%I PMLR
%P 5110--5121
%U https://proceedings.mlr.press/v119/kanade20a.html
%V 119
%X Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-tune CuBERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, BiLSTM and Transformer models, as well as published state-of-the-art models, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against CuBERT models as a strong baseline.

APA

Kanade, A., Maniatis, P., Balakrishnan, G. & Shi, K.. (2020). Learning and Evaluating Contextual Embedding of Source Code. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:5110-5121 Available from https://proceedings.mlr.press/v119/kanade20a.html.

Learning and Evaluating Contextual Embedding of Source Code

Abstract

Cite this Paper

Related Material