Rissanen Data Analysis: Examining Dataset Characteristics via Description Length

Ethan Perez, Douwe Kiela, Kyunghyun Cho
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:8500-8513, 2021.

Abstract

We introduce a method to determine if a certain capability helps to achieve an accurate model of given data. We view labels as being generated from the inputs by a program composed of subroutines with different capabilities, and we posit that a subroutine is useful if and only if the minimal program that invokes it is shorter than the one that does not. Since minimum program length is uncomputable, we instead estimate the labels’ minimum description length (MDL) as a proxy, giving us a theoretically-grounded method for analyzing dataset characteristics. We call the method Rissanen Data Analysis (RDA) after the father of MDL, and we showcase its applicability on a wide variety of settings in NLP, ranging from evaluating the utility of generating subquestions before answering a question, to analyzing the value of rationales and explanations, to investigating the importance of different parts of speech, and uncovering dataset gender bias.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-perez21a, title = {Rissanen Data Analysis: Examining Dataset Characteristics via Description Length}, author = {Perez, Ethan and Kiela, Douwe and Cho, Kyunghyun}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {8500--8513}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/perez21a/perez21a.pdf}, url = {https://proceedings.mlr.press/v139/perez21a.html}, abstract = {We introduce a method to determine if a certain capability helps to achieve an accurate model of given data. We view labels as being generated from the inputs by a program composed of subroutines with different capabilities, and we posit that a subroutine is useful if and only if the minimal program that invokes it is shorter than the one that does not. Since minimum program length is uncomputable, we instead estimate the labels’ minimum description length (MDL) as a proxy, giving us a theoretically-grounded method for analyzing dataset characteristics. We call the method Rissanen Data Analysis (RDA) after the father of MDL, and we showcase its applicability on a wide variety of settings in NLP, ranging from evaluating the utility of generating subquestions before answering a question, to analyzing the value of rationales and explanations, to investigating the importance of different parts of speech, and uncovering dataset gender bias.} }
Endnote
%0 Conference Paper %T Rissanen Data Analysis: Examining Dataset Characteristics via Description Length %A Ethan Perez %A Douwe Kiela %A Kyunghyun Cho %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-perez21a %I PMLR %P 8500--8513 %U https://proceedings.mlr.press/v139/perez21a.html %V 139 %X We introduce a method to determine if a certain capability helps to achieve an accurate model of given data. We view labels as being generated from the inputs by a program composed of subroutines with different capabilities, and we posit that a subroutine is useful if and only if the minimal program that invokes it is shorter than the one that does not. Since minimum program length is uncomputable, we instead estimate the labels’ minimum description length (MDL) as a proxy, giving us a theoretically-grounded method for analyzing dataset characteristics. We call the method Rissanen Data Analysis (RDA) after the father of MDL, and we showcase its applicability on a wide variety of settings in NLP, ranging from evaluating the utility of generating subquestions before answering a question, to analyzing the value of rationales and explanations, to investigating the importance of different parts of speech, and uncovering dataset gender bias.
APA
Perez, E., Kiela, D. & Cho, K.. (2021). Rissanen Data Analysis: Examining Dataset Characteristics via Description Length. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:8500-8513 Available from https://proceedings.mlr.press/v139/perez21a.html.

Related Material