- title: '3rd Workshop on Learning with Imbalanced Domains: Preface' volume: 154 URL: https://proceedings.mlr.press/v154/moniz21a.html PDF: https://proceedings.mlr.press/v154/moniz21a/moniz21a.pdf edit: https://github.com/mlresearch//v154/edit/gh-pages/_posts/2021-09-29-moniz21a.md series: 'Proceedings of Machine Learning Research' container-title: 'Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications' publisher: 'PMLR' author: - given: Nuno family: Moniz - given: Paula family: Branco - given: Luís family: Torgo - given: Nathalie family: Japkowicz - given: Michał family: Woźniak - given: Shuo family: Wang editor: - given: Nuno family: Moniz - given: Paula family: Branco - given: Luis family: Torgo - given: Nathalie family: Japkowicz - given: Michał family: Woźniak - given: Shuo family: Wang page: 1-6 id: moniz21a issued: date-parts: - 2021 - 9 - 29 firstpage: 1 lastpage: 6 published: 2021-09-29 00:00:00 +0000 - title: 'Centralised vs decentralised anomaly detection: when local and imbalanced data are beneficial' abstract: 'In this paper, we address the problem of anomaly detection in decentralised settings. We took inspiration from the current edge computing trend, pushing towards the development of decentralised ML algorithms, i.e., the devices that collected or generated data are in charge of collaborating to train the ML models without sharing raw data . The challenges connected to this scenario are (i) data distributions of local datasets might be different, (ii) data is very often unlabelled, and (iii) devices have limited computational resources. We address them by proposing an unsupervised ensemble method for decentralised anomaly detection where the base learners are lightweight autoencoders. We aim to investigate whether an ensemble of lightweight models trained in isolation on non-IID and unlabelled local data can compete with heavier models trained in centralised settings. In a task of multi-category anomaly detection, our results show that our method exploits the data imbalance successfully to make accurate predictions.' volume: 154 URL: https://proceedings.mlr.press/v154/nardi21a.html PDF: https://proceedings.mlr.press/v154/nardi21a/nardi21a.pdf edit: https://github.com/mlresearch//v154/edit/gh-pages/_posts/2021-09-29-nardi21a.md series: 'Proceedings of Machine Learning Research' container-title: 'Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications' publisher: 'PMLR' author: - given: Mirko family: Nardi - given: Lorenzo family: Valerio - given: Andrea family: Passarella editor: - given: Nuno family: Moniz - given: Paula family: Branco - given: Luis family: Torgo - given: Nathalie family: Japkowicz - given: Michał family: Woźniak - given: Shuo family: Wang page: 7-20 id: nardi21a issued: date-parts: - 2021 - 9 - 29 firstpage: 7 lastpage: 20 published: 2021-09-29 00:00:00 +0000 - title: 'Online-MC-Queue: Learning from Imbalanced Multi-Class Streams' abstract: 'Online supervised learning from fast-evolving data streams has application in many areas. The development of techniques with highly skewed class distributions (or ’class imbalance’) is an important area of research in domains such as manufacturing, the environment, and health. Solutions should not only be able to analyse large repositories in near real-time but also be capable of providing accurate models to describe rare classes that may appear infrequently or in bursts, while continuously accommodating new instances. Although online learning methods have been proposed to handle binary class imbalance, solutions suitable for multi-class streams with varying degrees of imbalance in evolving streams have received limited attention. In order to address this knowledge gap, this paper introduces the Online-MC-Queue (OMCQ) algorithm for online learning in multi-class imbalanced settings. Our approach utilises a queue-based resampling method that dynamically creates an instance queue for each class. The number of instances is balanced by maintaining a queue threshold and removing older samples during training. In addition, new and rare classes are dynamically added to the training process as they appear. Our experimental results confirm a noticeable improvement in minority-class detection and in classification performance. A comparative evaluation shows that the OMCQ algorithm outperforms the state-of-the-art.' volume: 154 URL: https://proceedings.mlr.press/v154/sadeghi21a.html PDF: https://proceedings.mlr.press/v154/sadeghi21a/sadeghi21a.pdf edit: https://github.com/mlresearch//v154/edit/gh-pages/_posts/2021-09-29-sadeghi21a.md series: 'Proceedings of Machine Learning Research' container-title: 'Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications' publisher: 'PMLR' author: - given: Farnaz family: Sadeghi - given: Herna L. family: Viktor editor: - given: Nuno family: Moniz - given: Paula family: Branco - given: Luis family: Torgo - given: Nathalie family: Japkowicz - given: Michał family: Woźniak - given: Shuo family: Wang page: 21-34 id: sadeghi21a issued: date-parts: - 2021 - 9 - 29 firstpage: 21 lastpage: 34 published: 2021-09-29 00:00:00 +0000 - title: 'ML-NCA: Multi-label Neighbourhood Component Analysis' abstract: 'In multi-label classification, a datapoint can be assigned to more than one class simultaneously. Input space transformation methods can be used to transform the input space so that classification algorithms can perform better. Although existing algorithms used in binary or multi-class classifications can be used with multi-label datasets, this leads to one transformation per label and hence is very costly. Also, considering each label independently ignores consideration of any label associations in the transformation process which is a missed opportunity. In this work, a new input space transformation algorithm, Multi-label Neighbourhood Component Analysis (ML-NCA), is proposed. ML-NCA performs one single linear transformation of the input space in a supervised fashion, that transforms to a space in which $k$ nearest-neighbour based algorithms are expected to perform well. ML-NCA considers all the labels together while finding the single transformation of the input space, therefore omitting the need for per-label transformations. This also implicitly takes advantage of label associations. An extensive set of experiments and detailed analysis demonstrate that the transformation found by ML-NCA is able to significantly improve the performance of multi-label-specific $k$ nearest neighbour algorithms.' volume: 154 URL: https://proceedings.mlr.press/v154/pakrashi21a.html PDF: https://proceedings.mlr.press/v154/pakrashi21a/pakrashi21a.pdf edit: https://github.com/mlresearch//v154/edit/gh-pages/_posts/2021-09-29-pakrashi21a.md series: 'Proceedings of Machine Learning Research' container-title: 'Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications' publisher: 'PMLR' author: - given: Arjun family: Pakrashi - given: Sayel family: Sadhukhan - given: Brian Mac family: Namee editor: - given: Nuno family: Moniz - given: Paula family: Branco - given: Luis family: Torgo - given: Nathalie family: Japkowicz - given: Michał family: Woźniak - given: Shuo family: Wang page: 35-48 id: pakrashi21a issued: date-parts: - 2021 - 9 - 29 firstpage: 35 lastpage: 48 published: 2021-09-29 00:00:00 +0000 - title: 'BayesBoost: Identifying and Handling Bias Using Synthetic Data Generators' abstract: 'Advanced synthetic data generators can model sensitive personal datasets by creating simulated samples of data with realistic correlation structures and distributions, but with a greatly reduced risk of identifying individuals. This has huge potential in medicine where sensitive patient data can be simulated and shared, enabling the development and robust validation of new AI technologies for diagnosis and disease management. However, even when massive ground truth datasets are available (such as UK-NHS databases which contain patient records in the order of millions) there is a high risk that biases still exist which are carried over to the data generators. For example, certain cohorts of patients may be under-represented due to cultural sensitivities amongst some communities, or due to institutionalised procedures in data collection. The under-representation of groups is one of the forms in which bias can manifest itself in machine learning, and it is the one we investigate in this work.These factors may also lead to structurally missing data or incorrect correlations and distributions which will be mirrored in the synthetic data generated from biased ground truth datasets. In this paper, we explore methods to improve synthetic data generators by using probabilistic methods to firstly identify the difficult to predict data samples in ground truth data, and then to boost these types of data when generating synthetic samples. The paper explores attempts to create synthetic data that contain more realistic distributions and that lead to predictive models with better performance.' volume: 154 URL: https://proceedings.mlr.press/v154/draghi21a.html PDF: https://proceedings.mlr.press/v154/draghi21a/draghi21a.pdf edit: https://github.com/mlresearch//v154/edit/gh-pages/_posts/2021-09-29-draghi21a.md series: 'Proceedings of Machine Learning Research' container-title: 'Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications' publisher: 'PMLR' author: - given: Barbara family: Draghi - given: Zhenchen family: Wang - given: Puja family: Myles - given: Allan family: Tucker editor: - given: Nuno family: Moniz - given: Paula family: Branco - given: Luis family: Torgo - given: Nathalie family: Japkowicz - given: Michał family: Woźniak - given: Shuo family: Wang page: 49-62 id: draghi21a issued: date-parts: - 2021 - 9 - 29 firstpage: 49 lastpage: 62 published: 2021-09-29 00:00:00 +0000 - title: 'Learning to Rank Anomalies: Scalar Performance Criteria and Maximization of Two-Sample Rank Statistics' abstract: 'The ability to collect and store ever more massive databases has been accompanied by the need to process them efficiently. In many cases, most observations have the same behavior, while a probable small proportion of these observations are abnormal. Detecting the latter, defined as outliers, is one of the major challenges for machine learning applications (e.g. in fraud detection or in predictive maintenance). In this paper, we propose a methodology addressing the problem of outlier detection, by learning a data-driven scoring function defined on the feature space which reflects the degree of abnormality of the observations. This scoring function is learnt through a well-designed binary classification problem whose empirical criterion takes the form of a two-sample linear rank statistics on which theoretical results are available. We illustrate our methodology with preliminary encouraging numerical experiments.' volume: 154 URL: https://proceedings.mlr.press/v154/limnios21a.html PDF: https://proceedings.mlr.press/v154/limnios21a/limnios21a.pdf edit: https://github.com/mlresearch//v154/edit/gh-pages/_posts/2021-09-29-limnios21a.md series: 'Proceedings of Machine Learning Research' container-title: 'Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications' publisher: 'PMLR' author: - given: Myrto family: Limnios - given: Nathan family: Noiry - given: Stephan family: Clémençon editor: - given: Nuno family: Moniz - given: Paula family: Branco - given: Luis family: Torgo - given: Nathalie family: Japkowicz - given: Michał family: Woźniak - given: Shuo family: Wang page: 63-75 id: limnios21a issued: date-parts: - 2021 - 9 - 29 firstpage: 63 lastpage: 75 published: 2021-09-29 00:00:00 +0000 - title: 'On Oversampling via Generative Adversarial Networks under Different Data Difficulty Factors' abstract: 'Over the last two decades, several approaches have been proposed to tackle the class imbalance problem which is characterized by the inability of a learner to focus on a relevant but scarcely represented class. The generation of synthetic examples to oversample the training set and thus force the learner to focus on the important cases is one of such solutions. Recently, generative adversarial networks (GANs) started to be explored as an oversampling alternative due to their capability of generating samples from an implicit distribution. Still, data difficulty factors such as class overlap, data dimensionality or sample size, and were shown to also negatively impact the learners performance under an imbalance setting. The ability of GANs to deal with the imbalance problem and other data difficulty factors has not yet been assessed. The main goal of this paper is to understand how data difficulty factors impact the performance of GANs when they are used as an oversampling method. Namely, we study the performance of conditioned GANs (CGANs) in an image dataset with controlled levels of the following data difficulty factors: sample size, data dimensionality, class overlap and imbalance ratio. We show that CGANs are effective for tackling tasks with multiple data difficulty factors, exhibiting increased gains on the most difficult tasks.' volume: 154 URL: https://proceedings.mlr.press/v154/nazari21a.html PDF: https://proceedings.mlr.press/v154/nazari21a/nazari21a.pdf edit: https://github.com/mlresearch//v154/edit/gh-pages/_posts/2021-09-29-nazari21a.md series: 'Proceedings of Machine Learning Research' container-title: 'Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications' publisher: 'PMLR' author: - given: Ehsan family: Nazari - given: Paula family: Branco editor: - given: Nuno family: Moniz - given: Paula family: Branco - given: Luis family: Torgo - given: Nathalie family: Japkowicz - given: Michał family: Woźniak - given: Shuo family: Wang page: 76-89 id: nazari21a issued: date-parts: - 2021 - 9 - 29 firstpage: 76 lastpage: 89 published: 2021-09-29 00:00:00 +0000 - title: 'Two Ways of Extending BRACID Rule-based Classifiers for Multi-class Imbalanced Data' abstract: 'The number of rule-based classifiers specialized for imbalanced data is quite small so far. In particular, there is no such classifier dedicated for multi-class imbalance data. Thus, in this work we considered two ways of extending BRACID, which is the effective algorithm for binary data. In the first approach, BRACID was used in the OVO ensemble along with modifications of the prediction aggregation strategy. The second approach modifies an induction of rules for multiple classes simultaneously, additionally combined with their post-pruning. Experiments showed that both approaches outperformed the baselines. Moreover, the second approach turned out to be better than OVO with respect to predictive results and producing a smaller number of rules.' volume: 154 URL: https://proceedings.mlr.press/v154/naklicka21a.html PDF: https://proceedings.mlr.press/v154/naklicka21a/naklicka21a.pdf edit: https://github.com/mlresearch//v154/edit/gh-pages/_posts/2021-09-29-naklicka21a.md series: 'Proceedings of Machine Learning Research' container-title: 'Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications' publisher: 'PMLR' author: - given: Maria family: Naklicka - given: Jerzy family: Stefanowski editor: - given: Nuno family: Moniz - given: Paula family: Branco - given: Luis family: Torgo - given: Nathalie family: Japkowicz - given: Michał family: Woźniak - given: Shuo family: Wang page: 90-103 id: naklicka21a issued: date-parts: - 2021 - 9 - 29 firstpage: 90 lastpage: 103 published: 2021-09-29 00:00:00 +0000 - title: 'GanoDIP - GAN Anomaly Detection through Intermediate Patches: a PCBA Manufacturing Case' abstract: 'Industry 4.0 and recent deep learning progress make it possible to solve problems that traditional methods could not. This is the case for anomaly detection that received a particular attention from the machine learning community, and resulted in a use of generative adversarial networks (GANs). In this work, we propose to use intermediate patches for the inference step, after a WGAN training procedure suitable for highly imbalanced datasets, to make the anomaly detection possible on full size Printed Circuit Board Assembly (PCBA) images. We therefore show that our technique can be used to support or replace actual industrial image processing algorithms, as well as to avoid a waste of time for industries.' volume: 154 URL: https://proceedings.mlr.press/v154/bougaham21a.html PDF: https://proceedings.mlr.press/v154/bougaham21a/bougaham21a.pdf edit: https://github.com/mlresearch//v154/edit/gh-pages/_posts/2021-09-29-bougaham21a.md series: 'Proceedings of Machine Learning Research' container-title: 'Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications' publisher: 'PMLR' author: - given: Arnaud family: Bougaham - given: Adrien family: Bibal - given: Isabelle family: Linden - given: Benoit family: Frenay editor: - given: Nuno family: Moniz - given: Paula family: Branco - given: Luis family: Torgo - given: Nathalie family: Japkowicz - given: Michał family: Woźniak - given: Shuo family: Wang page: 104-117 id: bougaham21a issued: date-parts: - 2021 - 9 - 29 firstpage: 104 lastpage: 117 published: 2021-09-29 00:00:00 +0000