Open Vocabulary Learning on Source Code with a Graph-Structured Cache

Milan Cvitkovic; Badal Singh; Animashree Anandkumar

Open Vocabulary Learning on Source Code with a Graph-Structured Cache

Milan Cvitkovic, Badal Singh, Animashree Anandkumar

Proceedings of the 36th International Conference on Machine Learning, PMLR 97:1475-1485, 2019.

Abstract

Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a Graph-Structured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graph-structured cache strategy with recent Graph-Neural-Network-based models for supervised learning on code improves the models’ performance on a code completion task and a variable naming task — with over 100% relative improvement on the latter — at the cost of a moderate increase in computation time.

Cite this Paper

BibTeX

@InProceedings{pmlr-v97-cvitkovic19b,
  title = 	 {Open Vocabulary Learning on Source Code with a Graph-Structured Cache},
  author =       {Cvitkovic, Milan and Singh, Badal and Anandkumar, Animashree},
  booktitle = 	 {Proceedings of the 36th International Conference on Machine Learning},
  pages = 	 {1475--1485},
  year = 	 {2019},
  editor = 	 {Chaudhuri, Kamalika and Salakhutdinov, Ruslan},
  volume = 	 {97},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--15 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v97/cvitkovic19b/cvitkovic19b.pdf},
  url = 	 {https://proceedings.mlr.press/v97/cvitkovic19b.html},
  abstract = 	 {Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a Graph-Structured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graph-structured cache strategy with recent Graph-Neural-Network-based models for supervised learning on code improves the models’ performance on a code completion task and a variable naming task — with over 100% relative improvement on the latter — at the cost of a moderate increase in computation time.}
}

Endnote

%0 Conference Paper
%T Open Vocabulary Learning on Source Code with a Graph-Structured Cache
%A Milan Cvitkovic
%A Badal Singh
%A Animashree Anandkumar
%B Proceedings of the 36th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2019
%E Kamalika Chaudhuri
%E Ruslan Salakhutdinov	
%F pmlr-v97-cvitkovic19b
%I PMLR
%P 1475--1485
%U https://proceedings.mlr.press/v97/cvitkovic19b.html
%V 97
%X Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a Graph-Structured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graph-structured cache strategy with recent Graph-Neural-Network-based models for supervised learning on code improves the models’ performance on a code completion task and a variable naming task — with over 100% relative improvement on the latter — at the cost of a moderate increase in computation time.

APA

Cvitkovic, M., Singh, B. & Anandkumar, A.. (2019). Open Vocabulary Learning on Source Code with a Graph-Structured Cache. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:1475-1485 Available from https://proceedings.mlr.press/v97/cvitkovic19b.html.

Open Vocabulary Learning on Source Code with a Graph-Structured Cache

Abstract

Cite this Paper

Related Material