Naive Bayes Classifiers over Missing Data: Decision and Poisoning

Song Bian, Xiating Ouyang, Zhiwei Fan, Paraschos Koutris
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:3913-3934, 2024.

Abstract

We study the certifiable robustness of ML classifiers on dirty datasets that could contain missing values. A test point is certifiably robust for an ML classifier if the classifier returns the same prediction for that test point, regardless of which cleaned version (among exponentially many) of the dirty dataset the classifier is trained on. In this paper, we show theoretically that for Naive Bayes Classifiers (NBC) over dirty datasets with missing values: (i) there exists an efficient polynomial time algorithm to decide whether multiple input test points are all certifiably robust over a dirty dataset; and (ii) the data poisoning attack, which aims to make all input test points certifiably non-robust by inserting missing cells to the clean dataset, is in polynomial time for single test points but NP-complete for multiple test points. Extensive experiments demonstrate that our algorithms are efficient and outperform existing baselines.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-bian24b, title = {Naive {B}ayes Classifiers over Missing Data: Decision and Poisoning}, author = {Bian, Song and Ouyang, Xiating and Fan, Zhiwei and Koutris, Paraschos}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {3913--3934}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/bian24b/bian24b.pdf}, url = {https://proceedings.mlr.press/v235/bian24b.html}, abstract = {We study the certifiable robustness of ML classifiers on dirty datasets that could contain missing values. A test point is certifiably robust for an ML classifier if the classifier returns the same prediction for that test point, regardless of which cleaned version (among exponentially many) of the dirty dataset the classifier is trained on. In this paper, we show theoretically that for Naive Bayes Classifiers (NBC) over dirty datasets with missing values: (i) there exists an efficient polynomial time algorithm to decide whether multiple input test points are all certifiably robust over a dirty dataset; and (ii) the data poisoning attack, which aims to make all input test points certifiably non-robust by inserting missing cells to the clean dataset, is in polynomial time for single test points but NP-complete for multiple test points. Extensive experiments demonstrate that our algorithms are efficient and outperform existing baselines.} }
Endnote
%0 Conference Paper %T Naive Bayes Classifiers over Missing Data: Decision and Poisoning %A Song Bian %A Xiating Ouyang %A Zhiwei Fan %A Paraschos Koutris %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-bian24b %I PMLR %P 3913--3934 %U https://proceedings.mlr.press/v235/bian24b.html %V 235 %X We study the certifiable robustness of ML classifiers on dirty datasets that could contain missing values. A test point is certifiably robust for an ML classifier if the classifier returns the same prediction for that test point, regardless of which cleaned version (among exponentially many) of the dirty dataset the classifier is trained on. In this paper, we show theoretically that for Naive Bayes Classifiers (NBC) over dirty datasets with missing values: (i) there exists an efficient polynomial time algorithm to decide whether multiple input test points are all certifiably robust over a dirty dataset; and (ii) the data poisoning attack, which aims to make all input test points certifiably non-robust by inserting missing cells to the clean dataset, is in polynomial time for single test points but NP-complete for multiple test points. Extensive experiments demonstrate that our algorithms are efficient and outperform existing baselines.
APA
Bian, S., Ouyang, X., Fan, Z. & Koutris, P.. (2024). Naive Bayes Classifiers over Missing Data: Decision and Poisoning. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:3913-3934 Available from https://proceedings.mlr.press/v235/bian24b.html.

Related Material