[edit]
Interactive Anomaly Detection in Mixed Tabular Data using Bayesian Networks
Proceedings of the 10th International Conference on Probabilistic Graphical Models, PMLR 138:185-196, 2020.
Abstract
The last decades improvements in processing abilities
have quickly led to an increasing use of data analyses implying massive
data-sets. To retrieve insightful information from any data driven
approach, a pivotal aspect to ensure is good data quality. Manual
correction of massive data-sets requires tremendous efforts, is prone to
errors, and results being really costly. If knowledge in a specific
field can often allow the development of efficient models for anomaly
detection and data correction, this knowledge can sometimes be
unavailable and a more generic approach should be found. This paper
presents a novel approach to anomaly detection and correction in mixed
tabular data using Bayesian Networks. We present an algorithm for
detecting anomalies and offering correction hints based on Jensen scores
computed within the Markov Blankets of considered variables. We also
discuss the incremental corrections of detection model using user’s
feedback, as well as additional aspects related to discretization in
mixed data and its effects on detection efficiency. Finally we also
discuss how functional dependencies can be managed to detect errors
while improving faithfulness and computation speed.