DAGnosis: Localized Identification of Data Inconsistencies using Structures

Nicolas Huynh, Jeroen Berrevoets, Nabeel Seedat, Jonathan Crabbé, Zhaozhi Qian, Mihaela van der Schaar
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:1864-1872, 2024.

Abstract

Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models. While recent data-centric methods are able to identify such inconsistencies with respect to the training set, they suffer from two key limitations: (1) suboptimality in settings where features exhibit statistical independencies, due to their usage of compressive representations and (2) lack of localization to pin-point why a sample might be flagged as inconsistent, which is important to guide future data collection. We solve these two fundamental limitations using directed acyclic graphs (DAGs) to encode the training set’s features probability distribution and independencies as a structure. Our method, called DAGnosis, leverages these structural interactions to bring valuable and insightful data-centric conclusions. DAGnosis unlocks the localization of the causes of inconsistencies on a DAG, an aspect overlooked by previous approaches. Moreover, we show empirically that leveraging these interactions (1) leads to more accurate conclusions in detecting inconsistencies, as well as (2) provides more detailed insights into why some samples are flagged.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-huynh24a, title = {{DAGnosis}: Localized Identification of Data Inconsistencies using Structures}, author = {Huynh, Nicolas and Berrevoets, Jeroen and Seedat, Nabeel and Crabb\'{e}, Jonathan and Qian, Zhaozhi and van der Schaar, Mihaela}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {1864--1872}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/huynh24a/huynh24a.pdf}, url = {https://proceedings.mlr.press/v238/huynh24a.html}, abstract = {Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models. While recent data-centric methods are able to identify such inconsistencies with respect to the training set, they suffer from two key limitations: (1) suboptimality in settings where features exhibit statistical independencies, due to their usage of compressive representations and (2) lack of localization to pin-point why a sample might be flagged as inconsistent, which is important to guide future data collection. We solve these two fundamental limitations using directed acyclic graphs (DAGs) to encode the training set’s features probability distribution and independencies as a structure. Our method, called DAGnosis, leverages these structural interactions to bring valuable and insightful data-centric conclusions. DAGnosis unlocks the localization of the causes of inconsistencies on a DAG, an aspect overlooked by previous approaches. Moreover, we show empirically that leveraging these interactions (1) leads to more accurate conclusions in detecting inconsistencies, as well as (2) provides more detailed insights into why some samples are flagged.} }
Endnote
%0 Conference Paper %T DAGnosis: Localized Identification of Data Inconsistencies using Structures %A Nicolas Huynh %A Jeroen Berrevoets %A Nabeel Seedat %A Jonathan Crabbé %A Zhaozhi Qian %A Mihaela van der Schaar %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-huynh24a %I PMLR %P 1864--1872 %U https://proceedings.mlr.press/v238/huynh24a.html %V 238 %X Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models. While recent data-centric methods are able to identify such inconsistencies with respect to the training set, they suffer from two key limitations: (1) suboptimality in settings where features exhibit statistical independencies, due to their usage of compressive representations and (2) lack of localization to pin-point why a sample might be flagged as inconsistent, which is important to guide future data collection. We solve these two fundamental limitations using directed acyclic graphs (DAGs) to encode the training set’s features probability distribution and independencies as a structure. Our method, called DAGnosis, leverages these structural interactions to bring valuable and insightful data-centric conclusions. DAGnosis unlocks the localization of the causes of inconsistencies on a DAG, an aspect overlooked by previous approaches. Moreover, we show empirically that leveraging these interactions (1) leads to more accurate conclusions in detecting inconsistencies, as well as (2) provides more detailed insights into why some samples are flagged.
APA
Huynh, N., Berrevoets, J., Seedat, N., Crabbé, J., Qian, Z. & van der Schaar, M.. (2024). DAGnosis: Localized Identification of Data Inconsistencies using Structures. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:1864-1872 Available from https://proceedings.mlr.press/v238/huynh24a.html.

Related Material