Why do Nearest Neighbor Language Models Work?

Frank F. Xu; Uri Alon; Graham Neubig

Why do Nearest Neighbor Language Models Work?

Frank F. Xu, Uri Alon, Graham Neubig

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:38325-38341, 2023.

Abstract

Language models (LMs) compute the probability of a text by sequentially computing a representation of an already-seen context and using this representation to predict the next word. Currently, most LMs calculate these representations through a neural network consuming the immediate previous context. However recently, retrieval-augmented LMs have shown to improve over standard neural LMs, by accessing information retrieved from a large datastore, in addition to their standard, parametric, next-word prediction. In this paper, we set out to understand why retrieval-augmented language models, and specifically why k-nearest neighbor language models (kNN-LMs) perform better than standard parametric LMs, even when the k-nearest neighbor component retrieves examples from the same training set that the LM was originally trained on. To this end, we perform analysis of various dimensions over which kNN-LM diverges from standard LMs, and investigate these dimensions one by one. Empirically, we identify three main reasons why kNN-LM performs better than standard LMs: using a different input representation for predicting the next tokens, approximate kNN search, and the importance of softmax temperature for the kNN distribution. Further, we incorporate some insights into the standard parametric LM, improving performance without the need for an explicit retrieval component. The code is available at https://github.com/frankxu2004/knnlm-why.

Cite this Paper

BibTeX


@InProceedings{pmlr-v202-xu23a,
  title = 	 {Why do Nearest Neighbor Language Models Work?},
  author =       {Xu, Frank F. and Alon, Uri and Neubig, Graham},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {38325--38341},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/xu23a/xu23a.pdf},
  url = 	 {https://proceedings.mlr.press/v202/xu23a.html},
  abstract = 	 {Language models (LMs) compute the probability of a text by sequentially computing a representation of an already-seen context and using this representation to predict the next word. Currently, most LMs calculate these representations through a neural network consuming the immediate previous context. However recently, retrieval-augmented LMs have shown to improve over standard neural LMs, by accessing information retrieved from a large datastore, in addition to their standard, parametric, next-word prediction. In this paper, we set out to understand why retrieval-augmented language models, and specifically why k-nearest neighbor language models (kNN-LMs) perform better than standard parametric LMs, even when the k-nearest neighbor component retrieves examples from the same training set that the LM was originally trained on. To this end, we perform analysis of various dimensions over which kNN-LM diverges from standard LMs, and investigate these dimensions one by one. Empirically, we identify three main reasons why kNN-LM performs better than standard LMs: using a different input representation for predicting the next tokens, approximate kNN search, and the importance of softmax temperature for the kNN distribution. Further, we incorporate some insights into the standard parametric LM, improving performance without the need for an explicit retrieval component. The code is available at https://github.com/frankxu2004/knnlm-why.}
}

Endnote

%0 Conference Paper
%T Why do Nearest Neighbor Language Models Work?
%A Frank F. Xu
%A Uri Alon
%A Graham Neubig
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-xu23a
%I PMLR
%P 38325--38341
%U https://proceedings.mlr.press/v202/xu23a.html
%V 202
%X Language models (LMs) compute the probability of a text by sequentially computing a representation of an already-seen context and using this representation to predict the next word. Currently, most LMs calculate these representations through a neural network consuming the immediate previous context. However recently, retrieval-augmented LMs have shown to improve over standard neural LMs, by accessing information retrieved from a large datastore, in addition to their standard, parametric, next-word prediction. In this paper, we set out to understand why retrieval-augmented language models, and specifically why k-nearest neighbor language models (kNN-LMs) perform better than standard parametric LMs, even when the k-nearest neighbor component retrieves examples from the same training set that the LM was originally trained on. To this end, we perform analysis of various dimensions over which kNN-LM diverges from standard LMs, and investigate these dimensions one by one. Empirically, we identify three main reasons why kNN-LM performs better than standard LMs: using a different input representation for predicting the next tokens, approximate kNN search, and the importance of softmax temperature for the kNN distribution. Further, we incorporate some insights into the standard parametric LM, improving performance without the need for an explicit retrieval component. The code is available at https://github.com/frankxu2004/knnlm-why.

APA


Xu, F.F., Alon, U. & Neubig, G.. (2023). Why do Nearest Neighbor Language Models Work?. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:38325-38341 Available from https://proceedings.mlr.press/v202/xu23a.html.

Why do Nearest Neighbor Language Models Work?

Abstract

Cite this Paper

Related Material