Open Vocabulary Learning on Source Code with a Graph-Structured Cache

Milan Cvitkovic, Badal Singh, Animashree Anandkumar
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:1475-1485, 2019.

Abstract

Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a Graph-Structured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graph-structured cache strategy with recent Graph-Neural-Network-based models for supervised learning on code improves the models’ performance on a code completion task and a variable naming task — with over 100% relative improvement on the latter — at the cost of a moderate increase in computation time.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-cvitkovic19b, title = {Open Vocabulary Learning on Source Code with a Graph-Structured Cache}, author = {Cvitkovic, Milan and Singh, Badal and Anandkumar, Animashree}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {1475--1485}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/cvitkovic19b/cvitkovic19b.pdf}, url = {https://proceedings.mlr.press/v97/cvitkovic19b.html}, abstract = {Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a Graph-Structured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graph-structured cache strategy with recent Graph-Neural-Network-based models for supervised learning on code improves the models’ performance on a code completion task and a variable naming task — with over 100% relative improvement on the latter — at the cost of a moderate increase in computation time.} }
Endnote
%0 Conference Paper %T Open Vocabulary Learning on Source Code with a Graph-Structured Cache %A Milan Cvitkovic %A Badal Singh %A Animashree Anandkumar %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-cvitkovic19b %I PMLR %P 1475--1485 %U https://proceedings.mlr.press/v97/cvitkovic19b.html %V 97 %X Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a Graph-Structured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graph-structured cache strategy with recent Graph-Neural-Network-based models for supervised learning on code improves the models’ performance on a code completion task and a variable naming task — with over 100% relative improvement on the latter — at the cost of a moderate increase in computation time.
APA
Cvitkovic, M., Singh, B. & Anandkumar, A.. (2019). Open Vocabulary Learning on Source Code with a Graph-Structured Cache. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:1475-1485 Available from https://proceedings.mlr.press/v97/cvitkovic19b.html.

Related Material