<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Proceedings of Machine Learning Research</title>
    <description>Proceedings of the 16th Asian Conference on Machine Learning
  Held in Hanoi, Vietnam on 05-08 December 2024

Published as Volume 260 by the Proceedings of Machine Learning Research on 14 January 2025.

Volume Edited by:
  Vu Nguyen
  Hsuan-Tien Lin

Series Editors:
  Neil D. Lawrence
</description>
    <link>https://proceedings.mlr.press/v260/</link>
    <atom:link href="https://proceedings.mlr.press/v260/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Tue, 14 Jan 2025 06:02:10 +0000</pubDate>
    <lastBuildDate>Tue, 14 Jan 2025 06:02:10 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>RedditEM: Unveiling Diachronic Semantic Shifts in Social Network Discourse</title>
        <description>Humans employ words to convey abstract concepts. The evolution of lexical semantics holds significance not only in Natural Language Processing applications but also in the realm of social computing research. However, the scarcity of diachronic word representations persists due to the substantial computational demands, particularly evident in the absence of large-scale and enduring diachronic word embeddings for social network texts. Herein, we introduce RedditEM, a comprehensive collection of diachronic word representations derived from Reddit English comment texts, featuring one word embedding per month spanning from January 2010 to December 2021. To assess the diachronic semantic shifts of words, we employ cosine distance metrics and juxtapose the embeddings’ neighborhoods. Our experimental findings underscore the utility of RedditEM in detecting alterations in word meanings within social networks and advancing social computing endeavors. Researchers interested in accessing this resource are cordially invited to contact us without hesitation.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/zou25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/zou25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Graph Neural Networks (with Proper Weights) Can Escape Oversmoothing</title>
        <description>Graph Neural Networks (GNNs) are known to suffer from degraded performance with more layers. Most prior works explained it from graph propagation, arguing that it inevitably leads to indistinguishable node features under more depth, known as *oversmoothing*. However, we notice that these analyses largely ignore the role of GNN weights either directly or by unrealistically strong assumptions. In this paper, we rediscover the role of GNN weights on oversmoothing with a systematic study. Notably, contrary to previous findings, we show that when learned freely, there always exist ideal weights such that vanilla GNNs completely avoid oversmoothing, even after infinite propagation steps. It indicates that oversmoothing is a problem of learning disabilities instead of the doom of GNNs themselves. To facilitate the learning of proper weights, we propose Weight Reparameterization (**WeightRep**) as a way to adaptively maintain the ideal weights in vanilla GNNs along the learning process. We theoretically show that for linear GNNs, WeightRep can always mitigate oversmoothing (full collapse) as well as dimensional collapse. Extensive experiments on nine benchmark datasets demonstrate its effectiveness and efficiency in practice.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/zhuo25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/zhuo25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Convergence Analysis of Inexact Over-relaxed ADMM via Dissipativity Theory</title>
        <description>We present a new convergence analysis for the over-relaxed alternating direction method of multipliers (ADMM) when the subproblem cannot be exactly solved, \ie, inexact over-relaxed ADMM. Our method builds on \citep{hu2017dissipativity} that relates the convergence analysis of optimization algorithms to the stability of a discrete-time linear dynamic system. By expressing the inexact over-relaxed ADMM as a discrete-time linear dynamic system, we show that both the linear and sublinear convergence of inexact over-relaxed ADMM can be obtained by solving or verifying the feasibility of a small semidefinite program (SDP). More importantly, we prove that the associated SDP has an analytical solution for various parameters. We demonstrate the theoretical result by applying the inexact over-relaxed ADMM to solve a distributed $\ell_1$-norm regularized logistic regression problem.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/zhou25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/zhou25b.html</guid>
        
        
      </item>
    
      <item>
        <title>HiRAG: A Historical Information-Driven Retrieval-Augmented Generation Framework for Background Summarization</title>
        <description>In an era overwhelmed by a deluge of global information, it is often challenging for people to grasp the relationships that an event develops over time. The background summarization (BS) task facilitates a profound understanding of the relationships between the current background of an event at any given time and its historical backgrounds. To enhance comprehension and help news readers and professionals to quickly understand the evolution of events, we introduce a Historical information-driven Retrieval-Augmented Generation framework (HiRAG). This framework is designed to extract the most relevant information from historical backgrounds and supplement it to generate precise background summarization. HiRAG employs state-of-the-art retrieval-augmented generation technologies to produce relevant background summarization. We implement a multi-strategy similarity calculation and introduce a sliding window mechanism to optimize retrieval construction. Our framework has been rigorously tested through a series of experiments and extensive analyses of the latest datasets. The promising results affirm the effectiveness of our proposed HiRAG framework and its retrieval capabilities.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/zhou25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/zhou25a.html</guid>
        
        
      </item>
    
      <item>
        <title>An Efficient Query Optimization Framework Based on MCTS and LTR</title>
        <description>Identifying the optimal query plan is always a fundamental task of query optimization in database management system (DBMS). However, traditional query optimization methods face significant challenges in continuously enhancing query performance due to complex query sentences, intricate data distributions and the exponentially growing search space of table joins. In this paper, we propose a formidable query optimization framework called MRQO (Integrating-MCTS-and-LTR-for-Query-Optimization). This framework utilizes the Monte Carlo Tree Search (MCTS) algorithm to find a comprehensive set of join orders for a query, and uses these join orders as hints to generate corresponding query plans. Additionally, it employs the Learning-to-Rank (LTR) approach to train a relative ranking model, achieving higher efficiency and accuracy in identifying the optimal query plan from all plans. Experimental results on PostgreSQL demonstrate that the proposed MRQO can achieve stable performance and match or even outperform both traditional query optimizers and advanced learned optimizers based on Deep Reinforcement Learning (DRL) in terms of query optimization efficiency.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/zhao25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/zhao25b.html</guid>
        
        
      </item>
    
      <item>
        <title>AMG-AVSR: Adaptive Modality Guidance for Audio-Visual Speech Recognition via Progressive Feature Enhancement</title>
        <description>Audio-Visual Speech Recognition (AVSR) is a task that identifies spoken words by analyzing both lip movements and auditory signals. Compared to Automatic Speech Recognition (ASR), AVSR demonstrates greater robustness in noisy environments due to the support of dual modalities. However, the inherent differences between these modalities present a challenge: effectively accounting for their disparities and leveraging their complementary information to extract useful information for AVSR. To address this, we propose the AMG-AVSR model, which utilizes a two-stage curriculum learning strategy and incorporates a feature compression and recovery mechanism. By leveraging the characteristics of different modalities in various scenarios to guide each other, the model extracts refined features from audio-visual data, thereby enhancing recognition performance in both clean and noisy environments. Compared to the baseline model AV-HuBERT, AMG-AVSR demonstrates superior performance on the LRS2 dataset in both noisy and clean environments. AMG-AVSR achieves a word error rate (WER) of 2.9% under clean speech conditions. In various noisy conditions, AMG-AVSR shows a significant reduction in WER compared to previous methods.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/zhao25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/zhao25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Countering Relearning with Perception Revising Unlearning</title>
        <description>Unlearning methods that rely solely on forgetting data typically modify the network’s decision boundary to achieve unlearning. However, these approaches are susceptible to the &quot;relearning&quot; problem, whereby the network may recall the forgotten class upon subsequent updates with the remaining class data. Our experimental analysis reveals that, although these modifications alter the decision boundary, the network’s fundamental perception of the samples remains mostly unchanged. In response to the relearning problem, we introduce the Perception Revising Unlearning (PRU) framework. PRU employs a probability redistribution method, which assigns new labels and more precise supervision information to each forgetting class instance. The PRU actively shifts the network’s perception of forgetting class samples toward other remaining classes. The experimental results demonstrate that PRU not only has good classification effectiveness but also significantly reduces the risk of relearning, suggesting a robust approach to class unlearning tasks that depend solely on forgetting data.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/zhang25d.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/zhang25d.html</guid>
        
        
      </item>
    
      <item>
        <title>LabelPrompt: Effective prompt-based learning for relation classification</title>
        <description>Recently, prompt-based learning has become popular in many Natural Language Processing (NLP) tasks by converting the task into a cloze-style one to smooth out the differences between Pre-trained Language Models (PLMs) and the current task. However, as for relation classification, it is challenging to associate the natural language word that fills in the mask token with relation labels due to the rich semantic information in textual label, e.g. ”org:founded_by”. To address this challenge, this paper presents a novel prompt-based learning method, namely LabelPrompt, for the relation classification task. It is an extraordinary intuitive approach motivated by “GIVE MODEL CHOICES!”. Specifically, we first define additional tokens to represent the relation labels, which regard these tokens as the verbalizer with semantic initialisation and explicitly construct them with a prompt template method. Then, we address the inconsistency between predicted relations and given entities by implementing an entity-aware module that employs contrastive learning. Last, we conduct an attention query strategy to differentiate prompt tokens and sequence tokens. These strategies effectively improve the adaptation capability of prompt-based learning, especially when only a small labelled dataset is available. Extensive experimental results obtained on several bench-marking datasets demonstrate the superiority of our method, particularly in the few-shot scenario. Our code can be found at \url{https://github.com/xerrors/Labelprompt}.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/zhang25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/zhang25c.html</guid>
        
        
      </item>
    
      <item>
        <title>Multilevel Position-aware Attention Enhanced Network for Skeleton-Based Action Recognition</title>
        <description>Effectively capturing the spatiotemporal dependencies between joints is crucial for skeleton-based action recognition. However, existing methods do not consider the sparsity of skeleton data, which hinders the accurate capture of complex posture information and subtle action variations. Moreover, the locality of temporal features requires the model to focus on certain key features. Yet, most methods overlook the impact of temporal redundancy on feature focus, resulting in ineffective capture of significant temporal features. To address the issue of skeleton sparsity, we propose a Multilevel Position-aware Attention module (MPA) that explicitly leverages the relative positional information of the input data to enrich spatial information. To achieve a more effective focus on local temporal features, we develop a Multi-scale Temporal Excitation module (MTE). By scaling temporal features, the MTE module elevates the prominence of salient features and facilitates the capture of multi-scale features. Furthermore, we propose a Part Partition Encoding module (PPE) to aggregate joint data into part data, thereby providing the model with high-level information carried by the interactions between body parts. The MPA, MTE, and PPE are integrated into a unified framework called MPAE-Net. Extensive experimental results demonstrate that the MPAE-Net achieves state-of-the-art performance on two large-scale datasets, NTU RGB+D and NTU RGB+D 120.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/zhang25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/zhang25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Saliency Maps Give a False Sense of Explanability to Image Classifiers: An Empirical Evaluation across Methods and Metrics</title>
        <description>The interpretability of deep neural networks (DNNs) has emerged as a crucial area of research, particularly in image classification tasks where decisions often lack transparency. Saliency maps have been widely used as a tool to decode the inner workings of these networks by highlighting regions of input images deemed most influential in the classification process. However, recent studies have revealed significant limitations and inconsistencies in the utility of saliency maps as explanations.  This paper aims to systematically assess the shortcomings of saliency maps and explore alternative approaches to achieve more reliable and interpretable explanations for image classification models. We carry out a series of experiments to show that 1) the existing evaluation does not provide a fair nor meaningful comparison to the existing saliency maps; these evaluations have their implicit assumption and are not differentiable; 2) the saliency maps do not provide enough information on explaining the accuracy of network, the relationship between classes and the modification of the images.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/zhang25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/zhang25a.html</guid>
        
        
      </item>
    
      <item>
        <title>FedLTF: Linear Probing Teaches Fine-tuning to Mitigate Noisy Labels in Federated Learning</title>
        <description>The presence of noisy labels has always been a primary factor affecting the effectiveness of federated learning (FL). Conventional FL approaches relying on Supervised Learning (SL) tend to overfit the noise labels, resulting in suboptimal Feature Extractor (FE). In this paper, we exploit models obtained in Self-Supervised Learning (SSL) to mitigate the impact of noisy labels in FL. In addition, we explore two popular methods to transfer to downstream tasks: linear probing, which updates only the last classification layers, and fine-tuning, which updates all model parameters. We empirically observe that, although fine-tuning typically yields higher accuracy than linear probing, in the presence of noise, it is very sensitive to noisy labels and will cause performance degradation. To achieve the best of both worlds (i.e., high accuracy and robustness against noisy labels), we “teach” fine-tuning to control overfitting. In particular, we leverage SSL to obtain a robust FE that is unaffected by noisy labels, and employ linear probing to train the classifiers. The FE and classifiers are integrated to construct a teacher model, which undergoes knowledge distillation to instruct the fine-tuning process of the student model. Extensive experimental evaluations conducted on multiple datasets demonstrate the effectiveness and robustness of our proposed framework against noisy labels in FL, outperforming state-of-the-art methods.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/zhan25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/zhan25a.html</guid>
        
        
      </item>
    
      <item>
        <title>ColorMamba: Towards High-quality NIR-to-RGB Spectral Translation with Mamba</title>
        <description>Translating NIR to the visible spectrum is challenging due to cross-domain complexities. Current models struggle to balance a broad receptive field with computational efficiency, limiting practical use. Although the Selective Structured State Space Model, especially the improved version, Mamba, excels in generative tasks by capturing long-range dependencies with linear complexity, its default approach of converting 2D images into 1D sequences neglects local context. In this work, we propose a simple but effective backbone, dubbed ColorMamba, which first introduces Mamba into spectral translation tasks. To explore global long-range dependencies and local context for efficient spectral translation, we introduce learnable padding tokens to enhance the distinction of image boundaries and prevent potential confusion within the sequence model. Furthermore, local convolutional enhancement and agent attention are designed to improve the vanilla Mamba. Moreover, we exploit the HSV color to provide multi-scale guidance in the reconstruction process for more accurate spectral translation. Extensive experiments show that our ColorMamba achieves a 1.02 improvement in terms of PSNR compared with the state-of-the-art method. Our code is available at https://github.com/AlexYangxx/ColorMamba/.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/zhai25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/zhai25a.html</guid>
        
        
      </item>
    
      <item>
        <title>PISDR: Page and Item Sequential Decision for Re-ranking Based on Offline Reinforcement Learning</title>
        <description>Re-ranking is the last part of a multi-stage recommendation system, involving the reordering of lists based on historical user behavior to better align with user preferences. Offline Reinforcement Learning (RL) has been employed in both the prediction and ranking phases of recommendation systems to align with long-term objectives, surpassing the efficacy of supervised learning.  However, extrapolation error is a common problem in offline RL, due to the biased distribution of features, which can lead to the reduction of recommendation accuracy. Consider that as users browse an e-commerce app, their preferences are influenced by previously recommended items or pages, therefore the history can be used to correct the bias of offline RL. This paper uses offline RL to model re-ranking and presents a re-ranking algorithm named Page and Item Sequential Decision for Re-ranking (PISDR) to improve accuracy by correcting bias at two levels (pages and items). PISDR employs sequential RL, leveraging a session-level data structure that encapsulates global information at the page level and item-level interrelationships. Additionally, PISDR utilizes a multi-tower critic network to assess various feedback metrics, including click-through rate, conversion rate, etc. which can raise actor network from the long-term reward. Experimental results validate the effectiveness of PISDR in significantly enhancing of Area Under Curve (AUC), Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG) about 1.4% in generated re-ranking sequences when compared to current state-of-the-art re-ranking algorithms. Finally, as a consequence, our method achieves a significant improvement (2.59%) in terms of Click-Through Rate (CTR) over the industrial-level ranking model in online A/B tests.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/yuan25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/yuan25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Cross-Level Feature Relocation: Mitigating Information Loss in Cross-Layer Feature Fusion for Crowd Counting</title>
        <description>In crowd counting, significant challenges persist due to scale variation, occlusion, and complex scene interference. Merging feature maps from different levels of the backbone network is an intuitive and efficient approach to addressing these issues. However, existing multi-scale merging algorithms often overlook a critical aspect: feature maps at different levels typically have varying resolutions, and traditional interpolation-based methods for feature fusion result in significant information loss, limiting the algorithm’s multi-scale perception capability. To address this issue, we propose the Cross-Level Feature Relocation Module (CFRM), which regresses features across different levels into a unified representation space and utilizes a cross-level attention mechanism to transfer complementary information from low-resolution to high-resolution feature maps, significantly enhancing effective information utilization. Based on CFRM, we introduce the Cross-Level Feature Relocation Network (CFRNet), which exhibits strong multi-scale perception capabilities. Extensive experiments on five datasets and comprehensive ablation studies demonstrate the effectiveness of CFRM.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/yin25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/yin25a.html</guid>
        
        
      </item>
    
      <item>
        <title>A Novel Evolutionary Multitasking Feature Selection Approach for Genomic Data Classification</title>
        <description>Microarray-generated genomic data has recently sparked a wave of bioinformatics and data mining research. However, such data presents significant challenges for further analysis due to its high dimensionality and small sample sizes. Feature selection is a standard approach to address this issue, as it can enhance classification performance while reducing dimensionality. This paper introduces an Improved Gray Wolf Optimization-based Evolutionary Multitasking (EMT-IGWO) feature selection approach tailored for high-dimensional classification. It adopts multi-population co-evolving searching modes that can be regarded as a typical feature selection task via a specific information-sharing mechanism. Within the proposed multitasking framework, both population diversity and global searching capabilities of EMT-IGWO are improved. Moreover, several enhancements are incorporated into the two searching modes to help stagnant individuals escape from local optima with higher probabilities. Computational results show that EMT-IGWO outperforms other compared algorithms in effectiveness and efficiency evaluated across eight public gene expression datasets.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/yifan25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/yifan25a.html</guid>
        
        
      </item>
    
      <item>
        <title>DataFrame QA: A Universal LLM Framework on DataFrame Question Answering Without Data Exposure</title>
        <description>This paper introduces DataFrame question answering (QA), a novel task that utilizes natural language processing (NLP) models to generate Pandas queries for information retrieval and data analysis on dataframes, emphasizing safe and non-revealing data handling. Specifically, our method, leveraging large language model (LLM), which solely relies on dataframe column names, not only ensures data privacy but also significantly reduces the context window in the prompt, streamlining information processing and addressing major challenges in LLM-based data analysis. We propose DataFrame QA as a comprehensive framework that includes safe Pandas query generation and code execution. Various LLMs are evaluated on the renowned WikiSQL dataset and our newly developed UCI-DataFrameQA, tailored for complex data analysis queries. Our findings indicate that GPT-4 performs well on both datasets, underscoring its capability in securely retrieving and aggregating dataframe values and conducting sophisticated data analyses. This approach, deployable in a zero-shot manner without prior training or adjustments, proves to be highly adaptable and secure for diverse applications. Our code and dataset are available at https://github.com/JunyiYe/dataframe-qa.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/ye25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/ye25a.html</guid>
        
        
      </item>
    
      <item>
        <title>LPNER: Label Prompt for Few-shot Nested Named Entity Recognition</title>
        <description>Few-shot Named Entity Recognition (NER) aims to identify named entities using very little annotated data. Recently, prompt-based few-shot NER methods have demonstrated significant effectiveness. However, most existing methods employ multi-round prompts, which significantly increase time and computational costs. Furthermore, current single-round prompt methods are mainly designed for flat NER tasks and are not effective in handling nested NER tasks. Additionally, these methods do not to fully utilize the semantic information of entity labels through prompts. To address these challenges, we propose a novel Label-Prompt-based few-shot nested NER method named LPNER, which not only handles nested NER tasks but also efficiently extracts semantic information of entities through label prompts, thereby achieving more efficient and accurate NER. LPNER first designs a specialized prompt based on a span strategy to enhance label semantics and effectively combines multiple span representations using special mark to obtain enhanced span representations integrated with label semantics. Then, entity prototypes are constructed through prototype network for classifying candidate entity spans. We conducted extensive experiments on five nested datasets: ACE04, ACE05, GENIA, GermEval, and NEREL. In 1-shot and 5-shot tasks, LPNER’s $F_1$ scores mostly outperform baseline models.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/yang25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/yang25a.html</guid>
        
        
      </item>
    
      <item>
        <title>One-Shot Machine Unlearning with Mnemonic Code</title>
        <description>Ethical and privacy issues inherent in artificial intelligence (AI) applications have been a growing concern with the rapid spread of deep learning. Machine unlearning (MU) is the research area that addresses these issues by making a trained AI model forget about undesirable training data. Unfortunately, most existing MU methods incur significant time and computational costs for forgetting. Therefore, it is often difficult to apply these methods to practical datasets and sophisticated architectures, e.g., ImageNet and Transformer. To tackle this problem, we propose a lightweight and effective MU method. Our method identifies the model parameters sensitive to the forgetting targets and adds perturbation to such model parameters. We identify the sensitive parameters by calculating the Fisher Information Matrix (FIM). This approach does not require time-consuming additional training for forgetting. In addition, we introduce class-specific random signals called mnemonic code to reduce the cost of FIM calculation, which generally requires the entire training data and incurs significant computational costs. In our method, we train the model with mnemonic code; when forgetting, we use a small number of mnemonic codes to calculate the FIM and get the effective perturbation for forgetting. Comprehensive experiments demonstrate that our method is faster and better at forgetting than existing MU methods. Furthermore, we show that our method can scale to more practical datasets and sophisticated architectures.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/yamashita25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/yamashita25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Fast Stealthily Biased Sampling Using Sliced Wasserstein Distance</title>
        <description>Ensuring fairness is essential when implementing machine learning models in practical applications. However, recent research has revealed that benchmark datasets can be crafted as fake evidence of fairness from unfair models using a method called Stealthily Biased Sampling (SBS). SBS minimizes the Wasserstein distance to manipulate a fake benchmark so that the distribution of the benchmark closely resembles the true data distribution. This optimization requires superquadratic time relative to the dataset size, making SBS applicable only to small-sized datasets. In this study, we reveal for the first time that the risk of manipulated benchmark datasets exists even for large-sized datasets. This finding indicates the necessity of considering the potential for manipulated benchmarks regardless of their size. To demonstrate this risk, we developed FastSBS, a computationally efficient variant of SBS using the Sliced Wasserstein distance. FastSBS is optimized by a stochastic gradient-based method, which requires only nearly linear time for each update. In experiments with both synthetic and real-world datasets, we show that FastSBS is an order of magnitude faster than the original SBS for large datasets while maintaining the quality of the manipulated benchmark.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/yamamoto25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/yamamoto25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Analyzing Diffusion Models on Synthesizing Training Datasets</title>
        <description>Synthetic samples from diffusion models are promising for training discriminative models as replications or augmentations of real training datasets. However, we found that the synthetic datasets degrade classification performance over real datasets when using the same dataset size. This means that the synthetic samples from modern diffusion models are less informative for training discriminative tasks. This paper investigates the gap between synthetic and real samples by analyzing the synthetic samples reconstructed from real samples through the noising (diffusion) and denoising (reverse) process of diffusion models. By varying the time steps starting the reverse process in the reconstruction, we can control the trade-off between the information in the original real data and the information produced by diffusion models. Through assessing the reconstructed samples and the trained models, we found that the synthetic samples are concentrated in modes of the training data distribution as the reverse step increases, and thus, they have difficulty covering the outer edges of the distribution. On the contrary, we found that these synthetic samples yield significant improvements in the data augmentation setting where both real and synthetic samples are used, indicating that the samples around modes are useful as interpolation for learning classification boundaries. These findings suggest that modern diffusion models are currently insufficient to replicate the real training dataset in the same dataset size but are suitable for interpolating the real training samples as the augment datasets.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/yamaguchi25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/yamaguchi25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Toward Data Efficient Model Merging between Different Datasets without Performance Degradation</title>
        <description>Model merging is attracting attention as a novel method for creating a new model by combining the weights of different trained models. While previous studies reported that model merging works well for models trained on a single dataset with different random seeds, model merging between different datasets remains unsolved. In this paper, we attempt to reveal the difficulty in merging such models trained on different datasets and alleviate it. Our empirical analyses show that, in contrast to the single-dataset scenarios, dataset information needs to be accessed to achieve high accuracy when merging models trained on different datasets. However, the requirement to use full datasets not only incurs significant computational costs but also becomes a major limitation when integrating models developed and shared by others. To address this, we demonstrate that dataset reduction techniques, such as coreset selection and dataset condensation, effectively reduce the data requirement for model merging. In our experiments with SPLIT-CIFAR10 model merging, the accuracy is significantly improved by $31$% when using the full dataset and $24$% when using the sampled subset compared with not using the dataset.  Our code is available at https://github.com/MasanoriYamada/re-basin-merge-pytorch.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/yamada25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/yamada25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Vision Transformer with High Spatial Structure Sensitivity</title>
        <description>Self-attention operation, the core operation of the vision transformer (VT), is position-independent. Therefore, VT uses positional embedding to encode spatial information. However, we found that the role of positional encoding is very limited, and VT is insensitive to spatial structure. We demonstrated a significant sensitivity gap to random block shuffling and masking between VT and convolutional neural network (CNN), which indicates that VT does not learn the spatial structure of the target well and focuses too much on small-scale detail features. We argue that self-attention should use position-dependent operations to encode spatial information instead of relying on positional embedding. We replace the linear projection of self-attention with convolution operation and use regular receptive field for each feature point, which significantly increases VT’s sensitivity to spatial structure without sacrificing performance.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/xu25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/xu25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Universal Video Face Restoration Method Based on Vision-Language Model</title>
        <description>The video face restoration aims to restore high-quality face video from low-quality face video, but most existing methods typically focus on specific and single degradation scene such as denoising or deblurring. However, the universal video face restoration should restore face video in various degradation scenes. In this paper, we use language prompt which describes the face information including gender, appearance and expression to guide video face restoration. To enhance the applicability, we remove the language prompt by ControlNet and incorporate the human-level knowledge from vision-language models into general networks to improve the video face restoration performance and enable the universal video face restoration. In addition, we construct a degradation dataset, which contains multiple degradations in the same scene and captions which describe the face information. Our extensive experiments show that our approach achieves highly competitive performance in universal video face restoration.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/xu25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/xu25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Hierarchical Global Asynchronous Federated Learning Across Multi-Center</title>
        <description>Federated learning for training machine learning models across geographically distributed regional centers is becoming prevalent. However, because of disparities in location, latency, and computational capabilities, synchronously aggregating models across different sites requires waiting for stragglers, leading to significant delays. Traditional asynchronous aggregation across regional centers still faces issues of stale model parameters and outdated gradients due to the hierarchical aggregation involving local clients within each center. To address this, we propose Hierarchical Global Asynchronous Federated Learning (HGA-FL), which combines global asynchronous model aggregation across regional centers with synchronous aggregation and local consistent regularization alignment within each local center. We theoretically analyze the convergence rate of our method under non-convex optimization settings, demonstrating its stable convergence during the aggregation. Experimental evaluations show that our approach outperforms other baseline two-level aggregation methods in terms of global model generalization ability, particularly under conditions of data heterogeneity, latency, and gradient staleness.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/xie25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/xie25c.html</guid>
        
        
      </item>
    
      <item>
        <title>FedDGL: Federated Dynamic Graph Learning for Temporal Evolution and Data Heterogeneity</title>
        <description>Federated graph learning enhances federated learning by enabling privacy-preserving collaborative training on distributed graph data. While traditional methods are effective in managing data heterogeneity, they typically assume static graph structures, overlooking the dynamic nature of real-world graphs. Integrating federated graph learning with dynamic graph neural networks addresses this issue but often fails to retain previously acquired knowledge, limiting generalization for both global and personalized models. This paper proposes FedDGL, a novel framework designed to address temporal evolution , and data heterogeneity in federated dynamic graph learning. Unlike conventional approaches, FedDGL captures temporal dynamics through a global knowledge distillation technique and manages client heterogeneity using a global prototype-based regularization method. The framework employs contrastive learning to generate global prototypes, enhancing feature representation while utilizing a prototype similarity-based personalized aggregation strategy for effective adaptation to local and global data distributions. Experiments on multiple benchmark datasets show that FedDGL achieves significant performance improvements over state-of-the-art methods, with up to 9.02% and 8.77% gains in local and global testing, respectively, compared to FedAvg. These results highlight FedDGL’s effectiveness in improving personalized and global model performance in dynamic, heterogeneous federated graph learning scenarios.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/xie25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/xie25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Multi-level Relational Learning with Synergistic Graphs for Multivariate Time Series Forecasting</title>
        <description>Multivariate Time Series (MTS) forecasting involves analyzing the evolution and interrelationships of multiple variables over time. Effectively mining relationships between MTS variables remains challenging as variables may imply multiple relational patterns. Recently, graph-based approaches have exhibited substantial effectiveness in capturing relationships between MTS variables. However, these methods often adhere to the paradigm of capturing low-level pairwise relationships, which limits their ability to capture other high-level beyond pair-wise relational patterns. To address this issue, we present a synergistic graph learning framework that combines the modeling advantages of graphs and hypergraphs to uncover more comprehensive relational patterns. This framework mainly consists of two parts. Firstly, we introduced a Synergistic Relation Construction module, which incorporates dynamic graph and hypergraph structures to model low-level pairwise and high-level beyond pairwise relationships among variables, representing multi-level relational patterns through obtained adjacency matrices and incidence metrics. Additionally, we developed a Synergistic Relation Learning mechanism, that leverages novel synergistic graph and hypergraph convolutional networks to facilitate spatial dependency interactions across multi-levels, along with temporal convolutional networks to capture more comprehensive spatial-temporal dependencies. We conducted comprehensive experiments on four benchmark datasets, and experimental results demonstrate that our model outperforms the state-of-the-art performance. The source code will be available online.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/xie25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/xie25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Causal ATTention Multiple Instance Learning for Whole Slide Image Classification</title>
        <description>We propose a new multiple instance learning (MIL) method called Causal ATTention Multiple Instance Learning (CATTMIL) to alleviate the dataset bias for more accurate classification of whole slide images (WSIs). There are different kinds of dataset bias due to confounders that are rooted in data generation and/or pre-training dataset of MIL. Confounders might mislead MIL models to learn spurious correlations between instances and bag label. Such spurious correlations, in turn, impede the generalization ability of models and hurt the final performance. To fight against the negative impacts of confounders, CATTMIL exploits the causal intervention using the front-door adjustment with a Causal ATTention (CATT) mechanism. This enables CATTMIL to remove the spurious correlations so as to estimate the causal effect of instances on the bag label. Unlike previous deconfounded MIL methods, our CATTMIL does not need to approximate confounder values. Therefore, CATTMIL is able to bring further performance boosting to existing schemes and achieve the state-of-the-art in WSI classification. Extensive experiments on classification of the two widely-used datasets of TCGA-NSCLC and CAMELYON16 show CATTMIL’s effectiveness in suppressing the dataset bias and enhancing the generalization capability as well.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/wu25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/wu25b.html</guid>
        
        
      </item>
    
      <item>
        <title>A Single-Stage Multi-Style License Plate Recognition Method Based on Attention</title>
        <description>Automatic license plate recognition is applied widely in life, but it is still a challenging task in open scenarios. The current ALPR methods require multiple recognitions in multi-license plate scenarios. Furthermore, they have complex recognition structures and insufficient recognition capabilities for multi-style license plates. To solve the above problems, this paper proposes a single-stage multi-style multi-license plate recognition method based on the attention mechanism: CLPRNet. We use a spatial attention module based on UNet to separate the license plate character sequence into each attention heatmap in order. This approach unifies license plates with different character lengths and different character rows into a single processing logic, thereby enabling CLPRNet to recognize multiple styles without additional style judgement branches. At the same time, we abandon the traditional method of cropping RoI from image or feature, and instead combine attention to recognize characters directly, which allows CLPRNet to recognize multiple license plates in a single pass. To address the issue of an inadequate number of multi-style license plate samples, this paper also proposes a multi-style license plate generation method. In the single-stage methods, CLPRNet demonstrates better detection performance on the CCPD dataset and better recognition performance on the FN, Rotate, and Tilt subsets of the CCPD dataset. Compared with the existing license plate recognition methods, CLPRNet can recognize more styles of license plates. Test results and ablation experiments have shown the effectiveness of our proposed method.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/wu25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/wu25a.html</guid>
        
        
      </item>
    
      <item>
        <title>On Learning Frequency-Instance Correlations by Model-Agnostic Training for Synthetic Speech Detection</title>
        <description>The goal of Synthetic Speech Detection (SSD) is to detect spoofing speech synthesized by text-to-speech and voice conversion. Most existing SSD methods focus only on mining frequency-wise dependency by customizing frequency-aggregation modules in SSD models. However, the instance-wise dependency is usually under-explored, which is critical for identifying the synthetic speech in a global view. In this paper, we propose a novel model-agnostic training strategy for SSD that exploits both local (frequency-wise) and global (instance-wise) contexts, which do not rely on a customized architecture and can be flexibly integrated into previous SSD models. Specifically, we propose an inter-frequency correlation module to capture the local context by reconstructing the masked frequency information from the unmasked frequency context. Meanwhile, an inter-instance correlation module is performed to explore the global context among different instances by promoting intra-class compactness and inter-class dispersion in the latent space. These two complementary modules operate from distinct contextual perspectives, leading to improvements in SSD performance. Extensive experiments show that our method significantly improves the performance of two state-of-the-art models on the 2019 dataset and 2021 dataset of ASVspoof.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/wang25g.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/wang25g.html</guid>
        
        
      </item>
    
      <item>
        <title>Diffusion-based Adversarial Attack to Automatic Speech  Recognition</title>
        <description>Recent studies have exposed the substantial vulnerability of voice-activated smart devices to adversarial examples, predominantly targeting the robustness of automatic speech recognition (ASR) systems. Most of adversarial examples generated by introducing adversarial perturbations within the $l_p$ norm bounds to benign audio inputs. However, these attacks are constrained by the parametric bounds of perturbations or the features of disturbance, which limits their effectiveness. To improve the acoustic realism of adversarial examples and enhance attack performance, we propose a novel attack framework called Diffusion-based Adversarial Attack, leveraging DiffVC, a diffusion-based voice conversion model, to map audio to a latent space and employing Adversarial Latent Perturbation (ALP) to embed less perceptible and more robust perturbations. Extensive evaluations demonstrate that our method enhances targeted attack performance. Notably, the Word Error Rate (WER) has shown an average increase of 101 absolute points over clean speech audio and 25 absolute points over C&amp;W attack. Additionally, the Success Rate (SR) has achieved an average increase of 11 absolute points over the C&amp;W attack and 16 absolute points over SSA attack. Additionally, our approach also stands out for its high audio quality and efficiency.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/wang25f.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/wang25f.html</guid>
        
        
      </item>
    
      <item>
        <title>Counterfacual Fairness for Graph Neural Networks with Limited and Privacy Protected Sensitive Attributes</title>
        <description>Graph Neural Networks (GNNs) have shown outstanding performance in learning graph representations, which increases their application in high-risk areas. However, GNNs may inherit biases from the graph data and make unfair predictions towards the protected sub-groups. To eliminate bias, a natural idea is to achieve counterfactual fairness from a causal perspective. Concretely, counterfactual fairness requires sufficient sensitive attributes as guidance, which is infeasible in the real world. The reason is that users with various privacy preferences may selectively publish their sensitive attributes and only limited sensitive attributes can be collected. Besides, the users who publish sensitive attributes still face privacy risks. In this paper, we first consider the situation in which the sensitive attributes are limited and propose a framework called PCFGR (Partially observed sensitive Attributes in Counterfactual Fair Graph Representation Learning) to learn fair graph representation from limited sensitive attributes. The framework trains a sensitive attribute estimator, which is applied to provide sufficient and accurate sensitive attributes. With these sensitive attributes, it can generate counterfactuals and eliminate the bias efficiently. Secondly, we aim to protect the privacy of the sensitive attributes and further propose PCFGR$\backslash$D. Specifically, PCFGR$\backslash$D first perturbs the sensitive attributes using Local Differential Privacy (LDP). Then it employs forward correction loss to train an accurate sensitive attributes estimator.  We conduct extensive experiments and the experiment results show that it outperforms other alternatives in balancing utility and fairness.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/wang25e.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/wang25e.html</guid>
        
        
      </item>
    
      <item>
        <title>Prompting vision-language fusion for Zero-Shot Composed Image Retrieval</title>
        <description>The composed image retrieval (CIR) aims to retrieve target image given the combination of an image and a textual description as a query. Recently, benefiting from vision-language pretrained (VLP) models and large language models (LLM), the use of textual inversion or generating large-scale datasets has become a novel approach for zero-shot CIR task (ZS-CIR). However, the existing ZS-CIR models overlook one case where the textual description is often too brief or inherently inaccurate, making it challenging to effectively integrate the reference image into the query for retrieving the target image. To address this problem, we propose a simple yet effective method—prompting vision-language fusion (PVLF), which adapts representations in VLP models to dynamically fuse the vision and language (V&amp;L) representation spaces. In addition, by injecting the context learnable prompt tokens in Transformer fusion encoder, the PVLF promotes the comprehensive coupling between V&amp;L modalities, enriching the semantic representation of the query. We evaluate the effectiveness and robustness of our method on various VLP backbones, and the experimental results show that the proposed PVLF outperforms previous methods and achieves the state-of-the-art on two public ZS-CIR benchmarks (CIRR and FashionIQ).</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/wang25d.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/wang25d.html</guid>
        
        
      </item>
    
      <item>
        <title>IST-YOLO: Infrared Small Target Detector based on Improved YOLOv8</title>
        <description>Compared with natural images, the target of a single-frame infrared small target image occupies fewer pixels, has fuzzy imaging, less shape and texture information, and a more complex background. This leads to lower detection accuracy and makes it difficult to achieve accurate target localization. Therefore, in this paper, an infrared small target detection algorithm, IST-YOLO, is proposed based on yolov8. First, our algorithm improves the structure of standard model by adding an upsampling layer and a higher resolution detection head, which has a better ability to detect small targets. Second, we designed the Adaptive Residual Module (ARM) by combining the residual structure with the frequency adaptive dilated convolution to enhance the capacity of extracting deep small target position information while retaining the rich semantic information in the shallow layers. Finally, the Local and Globa Fusion (LGFusion) module is designed to enhance the information interaction between local and global features of the model. Experiments show that the accuracy of IST-YOLO outperforms both standard and popular algorithms.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/wang25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/wang25c.html</guid>
        
        
      </item>
    
      <item>
        <title>Multi-Scale Dual-Attention Unfolding Network for Compressed Sensing Image Reconstruction</title>
        <description>Deep Unfolding Networks have emerged as a prominent strategy in compressed sensing image reconstruction, effectively merging optimization techniques with deep learning through end-to-end training of truncated inferences. Despite their advantages, these algorithms generally require extensive iterations and parameters, potentially limited by storage capacity. Additionally, the image-level transmission at each iterative step does not optimally harness the inter-scale feature information available. To address these issues, we introduce a novel approach in this paper: the $\textbf{M}$ulti-$\textbf{S}$cale $\textbf{D}$ual-$\textbf{A}$ttention $\textbf{U}$nfolding $\textbf{N}$etwork ($\textbf{MSDAUN}$) for compressed sensing image reconstruction. We propose a cross-stage multi-scale deep reconstruction module $\textbf{D}$ as an iterative process, which is composed of multiple attention sub-modules. These include Cross Attention Transformer($\textbf{CAT}$) Modules that enhance the reconstruction with multi-channel inertia, thereby facilitating feature-level transmission and robust information exchange. Concurrently, Texture Attention Transformer($\textbf{TAT}$) Modules are designed to meticulously extract salient reconstruction information, subsequently channeling it into the texture path to effectuate the precise prediction of textural regions, thereby contributing to the meticulous restoration of textural details.  Our comprehensive experimental evaluation across diverse datasets confirms that MSDAUN surpasses existing state-of-the-art methods. This work presents significant potential for further advancements and applications in inverse imaging problems and optimization models.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/wang25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/wang25b.html</guid>
        
        
      </item>
    
      <item>
        <title>When and How to Grow? On Efficient Pre-training via Model Growth</title>
        <description>The remarkable performance of GPT models has attracted widespread attention for large-scale language models. Despite their stunning performance, the huge pre-training cost is prohibitive. Progressive pre-training takes advantage of the faster convergence speed of small models to save computing overhead and shows great potential in accelerating pre-training. This work studies the two key issues in progressive pre-training: growth schedule and growth operation. First, we estimate the optimal growth point in theory. Then, we find in experiments that the growth operation can be performed after the model enters the convergence stage to achieve a high speed-up ratio. On the other hand, we propose progressive dimensionality growth for width expansion and redundant layers for depth expansion. Progressive dimensionality growth is a smoothed operation and improves training stability. Redundant layers implement function-preserving at a small cost and inherit the core parameters of adjacent layers, improving the utilization of knowledge learned by the original model. Our method follows strict function preservation and produces good training dynamics. Experimental results show that our method outperforms the baselines and achieves an acceleration rate of about 1.5 times while achieving the same training effect.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/wang25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/wang25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Constrained Implicit Learning Framework for Neural Network Sparsification</title>
        <description>This paper presents a novel approach to sparsify neural networks by transforming them into implicit models characterized by an equilibrium equation rather than the conventional hierarchical layer structure. Unlike traditional sparsification techniques reliant on network structure or specific loss functions, our method simplifies the process to a simple constrained least-squared problem with sparsity-inducing constraints or penalties. Additionally, we introduce a scalable algorithm that can be parallelized, addressing the computational complexities associated with this transformation while maintaining efficiency. Experimental results on CIFAR-100 and 20NewsGroup datasets demonstrate the high effectiveness of our method, particularly in scenarios with high pruning rates. This approach offers a versatile and efficient solution for neural network parameter reduction. Furthermore, we observe that a moderate subset of the training data suffices to achieve competitive performance, highlighting the robustness and information-capturing capability of our approach.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/tsai25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/tsai25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Rethinking Literary Plagiarism in LLMs through the Lens of Copyright Laws</title>
        <description>The swift advancement of Generative Artificial Intelligence (AI) has outstripped the development of corresponding laws and regulations, highlighting books’ copyright infringement as a significant public concern and sparking numerous legal disputes. Although fair use doctrine exemption for using copyrighted materials in training datasets without the copyright holder’s permission, content generated by such AI systems may still violate copyright laws. Previous research on copyright infringement has primarily focused on character-level analysis, which is narrower in scope compared to the comprehensive requirements of copyright law. To address this challenge, we developed a LLM-based similarity measurement mechanism. We guided the generative AI to produce relevant book content by employing carefully crafted prompts. Subsequently, we created datasets by comparing this generated content with the original texts from famous books. We conducted various experiments, including various similarity detection techniques and plot plagiarism detection. The experimental results show that the AI-generated content (AIGC) is 78.72% similar to the original text, confirming that generative AI has the potential to infringe upon copyrights. Moreover, our study examines copyright infringement issues related to the content generated by generative AI and other domains such as code, images, and licensing. Our research will provide valuable insights for refining laws and regulations about generative AI.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/tan25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/tan25c.html</guid>
        
        
      </item>
    
      <item>
        <title>Improve Diverse Commonsense Generation by Enhancing Subgraphs</title>
        <description>Commonsense reasoning (CSR) requires rationale beyond the explicit knowledge mentioned in the context. Many existing methods use knowledge graphs (KGs) to generate rationale as additional evidence for CSR. However, rationale extracted from KGs (e.g., ConceptNet) often includes irrelevant information, which easily introduces noise and affects the evidential quality generated. Similar to brainstorming to generate diverse ideas, we introduce a synonym expansion method to expand input concepts, ultimately constructing a task relevant knowledge subgraph. Additionally, we propose a pruning model that learns to score and prune the knowledge subgraph, removing parts that are not directly related to the input context. The  proposed method improves the quality and diversity of rationale, which benefits generative commonsense reasoning tasks. Experiments on two datasets validated the effectiveness of our method, which demonstrates comparable performance with existing methods.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/tan25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/tan25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Adapting the Attention of Cloud-Based Recognition Model to Client-Side Images without Local Re-Training</title>
        <description>The mainstream workflow of image recognition applications is first training one global model on the cloud for a wide range of classes and then serving numerous clients. Images uploaded by each client typically come from a small subset of classes. From the cloud-client discrepancy on the range of image classes, the recognition model is desired to have strong adaptiveness, intuitively by focusing on each client’s local dynamic class subset, while incurring negligible overhead. In this work, we propose to plug a new intra-client and inter-image attention (ICIIA) module into existing backbone recognition models, requiring only one-time cloud-based training to be client-adaptive. In particular, given an image to be recognized from a certain client, ICIIA introduces multi-head self-attention to retrieve relevant images from the client’s local images, thereby calibrating the focus and the recognition result. We further identify the bottleneck of ICIIA’s overhead being in linear projection, propose to group and shuffle the features before the projections, and allow increasing the number of feature groups to dramatically improve efficiency without scarifying much accuracy. We extensively evaluate ICIIA and compare its performance against several baselines, demonstrating effectiveness and efficiency. Specifically, for a partitioned version of ImageNet-1K with the backbone models of MobileNetV3-L and Swin-B, ICIIA improves the classification accuracy to 83.37% (+8.11%) and 88.86% (+5.28%), while adding only 1.62% and 0.02% of FLOPs, respectively. Source code is available in the supplementary materials.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/tan25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/tan25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Orchestrating Plasticity and Stability: A Continual Knowledge Graph Embedding Framework with Bio-Inspired Dual-Mask Mechanism</title>
        <description>Learning in biological systems involves the intricate modeling of diverse entities and their interrelations, leading to the evolution of logical knowledge networks with accumulating experience. Analogously, knowledge graphs serve as semantic representations of entity relationships, playing a vital role in natural language processing and graph representation learning. However, contemporary knowledge graph embedding models often neglect real-world event updates, while existing continual knowledge graph research predominantly relies on conventional learning methods that inadequately leverage graph structure, thereby compromising their continual learning capabilities. This study introduces a novel Continual Mask Knowledge Graph Embedding framework (CMKGE), designed to address these limitations. CMKGE integrates semantic attributes, network structure, and continual learning mechanisms to capture the dynamic evolution of knowledge. Inspired by biological signal propagation and Dale’s principle, we introduce a dual-mask mechanism for neuronal inhibition and activation. This mechanism automatically filters critical old knowledge, enhancing model plasticity and stability. Through comprehensive evaluations on four datasets, we demonstrate CMKGE’s superiority over state-of-the-art continual embedding models.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/song25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/song25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Fairness and Privacy Guarantees in Federated Contextual Bandits</title>
        <description>This paper considers the contextual multi-armed bandit (CMAB) problem with fairness and privacy guarantees in a federated environment. We consider merit-based exposure as the desired \emph{fair} outcome, which provides exposure to each action in proportion to the reward associated. We model the algorithm’s effectiveness using fairness regret, which captures the difference between fair optimal policy and the policy output by the algorithm. Applying a fair CMAB algorithm to each agent individually leads to fairness regret linear in the number of agents. We propose that collaborative – federated learning can be more effective and provide the algorithm Fed-FairX-LinUCB that also ensures differential privacy. The primary challenge in extending the existing privacy framework is designing the communication protocol for communicating required information across agents. A naive protocol can either lead to weaker privacy guarantees or higher regret. We design a novel communication protocol that allows for (i) Sub-linear theoretical bounds on fairness regret for Fed-FairX-LinUCB and comparable bounds for the private counterpart, Priv-FairX-LinUCB (relative to single-agent learning), (ii) Effective use of privacy budget in Priv-FairX-LinUCB. We demonstrate the efficacy of our proposed algorithm with extensive simulations-based experiments. We show that both Fed-FairX-LinUCB and Priv-FairX-LinUCB achieve near-optimal fairness regret.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/solanki25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/solanki25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Non-Oblivious Performance of Random Projections</title>
        <description>Random projections are a cornerstone of high-dimensional computations. However, their analysis has proven both difficult and inadequate in capturing the empirically observed accuracy. To bridge this gap, this paper studies random projections from a novel perspective, focusing on data-dependent, that is, \emph{non-oblivious}, performance. The key contribution is the precise and data-dependent accuracy analysis of Rademacher random projections, achieved through elegant geometric methods of independent interest, namely, \emph{Schur-concavity}. The result formally states the following property: the less spread-out the data is, the better the accuracy. This leads to notable improvements in accuracy guarantees for data characterized by sparsity or distributed with a small spread. The key tool is a novel algebraic framework for proving Schur-concavity properties, which offers an alternative to derivative-based criteria commonly used in related studies. We demonstrate its value by providing an alternative proof for the extension of the celebrated Khintchine inequality.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/skorski25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/skorski25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Knowledge Graph Large Language Model (KG-LLM) for Link Prediction</title>
        <description>The task of multi-hop link prediction within knowledge graphs (KGs) stands as a challenge in the field of knowledge graph analysis, as it requires the model to reason through and understand all intermediate connections before making a prediction. In this paper, we introduce the Knowledge Graph Large Language Model (KG-LLM), a novel framework that leverages large language models (LLMs) for knowledge graph tasks. We first convert structured knowledge graph data into natural language and then use these natural language prompts to fine-tune LLMs to enhance multi-hop link prediction in KGs. By converting the KG to natural language prompts, our framework is designed to learn the latent representations of entities and their interrelations. To show the efficacy of the KG-LLM Framework, we fine-tune three leading LLMs within this framework, including Flan-T5, LLaMa2 and Gemma. Further, we explore the framework’s potential to provide LLMs with zero-shot capabilities for handling previously unseen prompts. Experimental results show that KG-LLM significantly improves the models’ generalization capabilities, leading to more accurate predictions in unfamiliar scenarios.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/shu25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/shu25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Towards Calibrated Losses for Adversarial Robust Reject Option Classification</title>
        <description>Robustness towards adversarial attacks is a vital property for classifiers in several applications such as autonomous driving, medical diagnosis, etc. Also, in such scenarios, where the cost of misclassification is very high, knowing when to abstain from prediction becomes crucial. A natural question is which surrogates can be used to ensure learning in scenarios where the input points are adversarially perturbed and the classifier can abstain from prediction? This paper aims to characterize and design surrogates calibrated in &quot;Adversarial Robust Reject Option&quot; setting. First, we propose an adversarial robust reject option loss $\ell_{d}^{\gamma}$ and analyze it for the hypothesis set of linear classifiers $\mathcal{H_\text{lin}}$. Next, we provide a complete characterization result for any surrogate to be $(\ell_{d}^{\gamma},\mathcal{H_{\text{lin}}})$- calibrated. To demonstrate the difficulty in designing surrogates to $\ell_{d}^{\gamma}$, we show negative calibration results for convex surrogates and quasi-concave conditional risk cases (these gave positive calibration in adversarial setting without reject option). We also empirically argue that Shifted Double Ramp Loss (DRL) and Shifted Double Sigmoid Loss (DSL) satisfy the calibration conditions. Finally, we demonstrate the robustness of shifted DRL and shifted DSL against adversarial perturbations on a synthetically generated dataset.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/shah25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/shah25a.html</guid>
        
        
      </item>
    
      <item>
        <title>GRU-M: A Joint Impute and Learn Approach for Sequential Prediction under Missing Data</title>
        <description>Sequential Prediction in presence of missing data is an old research problem. Classically, researchers have tackled this by imputing data first and then building predictive models. This 2-stage process is typically prone to errors and to circumvent this, researchers have provided a variety of techniques which employ a joint impute and learn approach before prediction. Among these, Recurrent Neural Networks (RNNs) have been very popular given their natural ability to tackle sequential data efficiently. Existing state-of-art approaches either (i)do not impute (ii) do not completely factor the available information around a gap, (iii)ignore position information within a gap and so on. Our approach intelligently addresses these gaps by proposing a novel architecture which jointly imputes and learns by taking into account (i)information from either end of the gap (ii)proximity to the left/right-end of a gap (iii)the length of the gap. In context of this work, prediction means either sequence classification or forecasting. In this paper, we demonstrate the utility of the proposed architecture on forecasting tasks. We benchmark against a range of state-of-art baselines and in scenarios where data is either (a)naturally missing or (b)synthetically masked.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/pachal25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/pachal25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Dude: Dual Distribution-Aware Context Prompt Learning For Large Vision-Language Model</title>
        <description>Prompt learning methods are gaining increasing attention due to their ability to customize large vision-language models to new domains using pre-trained contextual knowledge and minimal training data. However, existing works typically rely on optimizing unified prompt inputs, often struggling with fine-grained classification tasks due to insufficient discriminative attributes. To tackle this, we consider a new framework based on a dual context of both domain-shared and class-specific contexts, where the latter is generated by Large Language Models (LLMs) such as GPTs. Such dual prompt methods enhance the model’s feature representation by joining implicit and explicit factors encoded in LLM knowledge. Moreover, we formulate the Unbalanced Optimal Transport (UOT) theory to quantify the relationships between constructed prompts and visual tokens. Through partial matching, UOT can properly align discrete sets of visual tokens and prompt embeddings under different mass distributions, which is particularly valuable for handling irrelevant or noisy elements, ensuring that the preservation of mass does not restrict transport solutions. Furthermore, UOT’s characteristics integrate seamlessly with image augmentation, expanding the training sample pool while maintaining a reasonable distance between perturbed images and prompt inputs. Extensive experiments across few-shot classification and adapter settings substantiate the superiority of our model over current state-of-the-art baselines.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/nguyen25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/nguyen25c.html</guid>
        
        
      </item>
    
      <item>
        <title>Joint learning of Gaussian graphical models in heterogeneous dependencies of high-dimensional transcriptomic data</title>
        <description>In biology, the construction of a gene co-expression network is a difficult research problem, due to the high dimensionality of the data and the heterogeneity of the samples. Furthermore, observations from two or more groups sharing the same biological variables require the comparison of gene co-expression patterns with some commonalities between the groups. In this context, we propose a mixture of Gaussian graphical models for paired data to estimate heterogeneous dependencies and recover sub-population networks from these complicated biological data with certain sparsity and symmetry constraints of two groups of dependent variables. We develop an efficient generalized expectation-maximization (EM) algorithm for penalized maximum likelihood estimation with the fusion of a graphical lasso penalty. As a result, we demonstrate the numerical performance of our method in simulation studies, with the new method outperforming the classical graphical lasso approach in terms of model fitting. A real-world application for estimating gene networks on a high dimensional ecological transcriptomics data set of nine-spined stickleback has also been provided. Our new approach identified similarities and differences between groups of genes from the brain and liver tissues of samples collected from two habitats. These results show the efficiency of our approach to the identification of complicated interactions from high-dimensional and heterogeneous gene expression data.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/nguyen25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/nguyen25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Asian Conference on Machine Learning: Preface</title>
        <description>Preface to ACML 2024.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/nguyen25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/nguyen25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Robust Transfer Learning for Active Level Set Estimation with Locally Adaptive Gaussian Process Prior</title>
        <description>The objective of active level set estimation for a black-box function is to precisely identify regions where the function values exceed or fall below a specified threshold by iteratively performing function evaluations to gather more information about the function. This becomes particularly important when function evaluations are costly, drastically limiting our ability to acquire large datasets. A promising way to sample-efficiently model the black-box function is by incorporating prior knowledge from a related function. However, this approach risks slowing down the estimation task if the prior knowledge is irrelevant or misleading. In this paper, we present a novel transfer learning method for active level set estimation that safely integrates a given prior knowledge while constantly adjusting it to guarantee a robust performance of a level set estimation algorithm even when the prior knowledge is irrelevant. We theoretically analyze this algorithm to show that it has a better level set convergence compared to standard transfer learning approaches that do not make any adjustment to the prior. Additionally, extensive experiments across multiple datasets confirm the effectiveness of our method when applied to various different level set estimation algorithms as well as different transfer learning scenarios.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/ngo25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/ngo25a.html</guid>
        
        
      </item>
    
      <item>
        <title>How Classification Baseline Works for Deep Metric Learning: A Perspective of Metric Space</title>
        <description>Deep Metric Learning (DML) stands as a powerful technique utilized for training models to capture semantic similarities between data points across various domains, including computer vision, natural language processing, and recommendation systems. Current approaches in DML often prioritize the development of novel network structures or loss functions while overlooking metric properties and the intricate relationship between classification and metric learning. This oversight results in significant time overhead, particularly when the number of categories increases. To address this challenge, we propose extending the loss function used in classification to function as a metric, thereby imposing constraints on the distances between training samples based on the triangle inequality. This approach is akin to proxy-based methods and aims to enhance the efficiency of DML. Drawing inspiration from metrically convex metrics, we introduce the concept of a &quot;weak-metric&quot; to overcome the limitations associated with certain loss functions that cannot be straightforwardly extended to full metrics. This ensures the effectiveness of DML under various circumstances. Furthermore, we extend the Cross Entropy loss function to function as a weak-metric and introduce a novel metric loss derived from Cross Entropy for experimental comparisons with other methods. The results underscore the credibility and reliability of our proposal, showcasing its superiority over state-of-the-art techniques. Notably, our approach also exhibits significantly faster training times as the number of categories increases, making it a compelling choice for large-scale datasets.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/mou25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/mou25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Diffusion Model Based Posterior Sampling for  Noisy Linear Inverse Problems</title>
        <description>With the rapid development of diffusion models and flow-based generative models, there has been a surge of interests in solving noisy linear inverse problems, e.g., super-resolution, deblurring, denoising, colorization, etc, with generative models. However, while remarkable reconstruction performances have been achieved, their inference time is typically too slow since most of them rely on the seminal diffusion posterior sampling (DPS) framework and thus to approximate the intractable likelihood score, time-consuming gradient calculation through back-propagation is needed. To address this issue, this paper provides a fast and effective solution by proposing a simple closed-form approximation to the likelihood score. For both diffusion and flow-based models, extensive experiments are conducted on various noisy linear inverse problems such as noisy super-resolution, denoising, deblurring, and colorization. In all these tasks, our method (namely DMPS) demonstrates highly competitive or even better reconstruction performances while being significantly faster than all the baseline methods.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/meng25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/meng25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Multi-Task Network Guided Multimodal Fusion for Fake News Detection</title>
        <description>Fake news detection has become a hot research topic in the multimodal domain. Existing multimodal fake news detection research utilizes a series of feature fusion networks to gather useful information from different modalities of news posts. However, how to form effective cross-modal features? And how cross-modal correlations impact decision-making? These remain open questions. This paper introduces MMFND, a multi-task guided multimodal fusion framework for fake news detection , which introduces multi-task modules for feature refinement and fusion. Pairwise CLIP encoders are used to extract modality-aligned deep representations, enabling accurate measurement of cross-modal correlations. Enhancing feature fusion by weighting multimodal features with normalised cross-modal correlations. Extensive experiments on typical fake news datasets demonstrate that MMFND outperforms state-of-the-art approaches.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/ma25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/ma25a.html</guid>
        
        
      </item>
    
      <item>
        <title>FedLF: Adaptive Logit Adjustment and Feature Optimization in Federated Long-Tailed Learning</title>
        <description>Federated learning offers a solution paradigm to the challenge of preserving privacy in distributed machine learning. However, datasets distributed across each client in the real world are inevitably heterogeneous, and if the datasets can be globally aggregated, they collectively exhibit long-tailed distribution, which greatly affects the performance of the model. The traditional approach to federated learning primarily addresses the heterogeneity of data among clients, yet it fails to address the phenomenon of class bias in global long-tailed data. This results in the trained model focusing on the head classes while neglecting the equally important tail classes. Consequently, it is essential to develop a methodology that can consider classes holistically. To address the above problems, we propose a new method called FedLF, which introduces three modifications in the local training phase: adaptive logit adjustment, continuous class centred optimization, and feature decorrelation. We compare seven different methods with varying degrees of data heterogeneity and long-tailed distribution. Extensive experiments on benchmark datasets CIFAR-10-LT and CIFAR-100-LT demonstrate that our approach effectively mitigates the problem of model performance degradation due to data heterogeneity and long-tailed distribution. our code is available at https://github.com/18sym/FedLF.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/lu25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/lu25a.html</guid>
        
        
      </item>
    
      <item>
        <title>The data-driven transferable adversarial space</title>
        <description>Deep Neural Network (DNN) models are vulnerable to deception through the intentional addition of imperceptible perturbations to benign examples, posing a significant threat to security-sensitive applications. To address this, understanding the underlying causes of this phenomenon is crucial for developing robust models. A key research area involves investigating the characteristics of adversarial directions, which have been found to be perpendicular to decision boundaries and associated with low-density regions of the data. Existing research primarily focuses on adversarial directions for individual examples, while decision boundaries and data distributions are inherently dataset-dependent. This paper explores the space of adversarial perturbations within a dataset. Specifically, we represent adversarial perturbations as a linear combination of adversarial directions, followed by a non-linear projection. Using the proposed greedy algorithm, we train the adversarial space spanned by the set of adversarial directions.  Experiments on Cifar10 and ImageNet substantiate the existence of the adversarial space as an embedded space within the entire data space. Furthermore, the learned adversarial space enables statistical analysis of decision boundaries. Finally, we observe that the adversarial space learned on one DNN model is model-agnostic, and that the adversarial space learned on a vanilla model is a subset of that learned on a robust model, implicating data distribution as the underlying cause of adversarial examples.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/liu25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/liu25c.html</guid>
        
        
      </item>
    
      <item>
        <title>I Mean I Am a Mouse: meets for Bilingual Multimodal Meme Sarcasm Classification from Large Language Models</title>
        <description>Multimodal image-text memes are widely used on social networks and present significant challenges for high-precision sentiment analysis, social network analysis, and understanding diverse user communities, especially due to their deep cultural and regional influences. However, most existing studies on multimodal memes focus primarily on Englishspeaking communities and on preliminary tasks, such as harmful meme detection. In this paper, we focus on a more specific challenge: high-precision sarcasm classification in various contexts. We introduce a novel dataset for classifying sarcasm in multimodal memes, covering both Chinese and English languages. This dataset serves as a critical resource for developing and evaluating models that detect sarcasm across different cultural contexts. Furthermore, we propose a framework named Mmeets, which leverages Large Language Models (LLMs) and abductive reasoning to interpret the relationships between images and text, enhancing text understanding. Mmeets employs a pre-trained AltCLIP vision-language model alongside a cross-attention mechanism to effectively fuse image and text data, capturing subtle semantic connections. Our experimental results show that the Mmeets method outperforms state-of-the-art techniques in sarcasm classification tasks.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/liu25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/liu25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Understanding Transcriptional Regulatory Redundancy by Learnable Global Subset Perturbations</title>
        <description>Transcriptional regulation through cis-regulatory elements (CREs) is crucial for numerous biological functions, with its disruption potentially leading to various diseases. These CREs often exhibit redundancy, allowing them to compensate for each other in response to external disturbances, highlighting the need for methods to identify CRE sets that collaboratively regulate gene expression effectively. To address this, we introduce GRIDS, an in silico computational method that approaches the task as a global feature explanation challenge to dissect combinatorial CRE effects in two phases. First, GRIDS constructs a differentiable surrogate function to mirror the complex gene regulatory process, facilitating cross-translations in single-cell modalities. It then employs learnable perturbations within a state transition framework to offer global explanations, efficiently navigating the combinatorial feature landscape. Through comprehensive benchmarks, GRIDS demonstrates superior explanatory capabilities compared to other leading methods. Moreover, GRIDS’s global explanations reveal intricate regulatory redundancy across cell types and states, underscoring its potential to advance our understanding of cellular regulation in biological research.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/liu25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/liu25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Robust Multi-Agent Reinforcement Learning for Autonomous Vehicle in Noisy Highway Environments</title>
        <description>The field of research on multi-agent reinforcement learning (MARL) algorithms in self-driving vehicles is rapidly expanding in mixed-traffic scenarios where autonomous vehicles (AVs) and human-driven vehicles (HDVs) coexist. Most studies assume that all AVs can obtain accurate state information. However, in real-world scenarios, noisy sensor measurements have a significant impact. To address this issue, we propose an effective and robust MARL algorithm Multi-Agent Proximal Policy Optimization with Curriculum-based Adversarial Learning (CA-MAPPO) for situations where the observation perturbations are considered. The proposed approach incorporates adversarial samples during training and adopts a curriculum learning approach by gradually increasing the noise intensity. By evaluating the proposed approach in the ideal environment and scenarios under noise attacks with varying intensities, experiment results demonstrate that the proposed algorithm enables AVs to achieve a success rate of over 70% for the multi-lane highway on-ramp merging task, achieving a maximum average speed of up to over 19 $m/s$ and performing significantly better than the state-of-the-art MARL algorithms such as MAPPO and MAACKTR.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/lin25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/lin25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Visible-Infrared Person Re-Indentification via Feature Fusion and Deep Mutual Learning</title>
        <description>Visible-Infrared Person Re-Identification (VI-ReID) aims to retrieve a set of person images captured from both visible and infrared camera views. Addressing the challenge of modal differences between visible and infrared images, we propose a VI-ReID network based on Feature Fusion and Deep Mutual Learning (DML). To enhance the model’s robustness to color, we introduce a novel data augmentation method called Random Combination of Channels (RCC), which generates new images by randomly combining R, G, and B channels of visible images. Furthermore, to capture more informative features of individuals, we fuse the features from the middle layer of the network. To reduce the model’s dependence on global features, we employ a fusion branch as an auxiliary branch, facilitating synchronous learning of global and fusion branches through Deep Mutual Learning . Extensive experiments on the SYSU-MM01 and RegDB datasets validate the superiority of our method, showcasing its excellent performance when compared to other state-of-the-art approaches.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/lin25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/lin25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Enhancing Aspect Sentiment Quad Prediction through Dual-Sequence Data Augmentation and Contrastive Learning</title>
        <description>Aspect sentiment quad prediction (ASQP) endeavors to analyze four sentiment elements in sentences. Recent studies utilize generative models to achieve this task, yielding commendable outcomes. However, these studies often fall short of fully leveraging the relationships between sentiment elements and have difficulty effectively handling implicit sentiment expressions. Furthermore, this task also confronts the obstacle of data scarcity stemming from the substantial expenses involved in data annotation. To address these limitations, we propose dual-sequence data augmentation to achieve diverse input and target expressions, while we incorporate contrastive learning to instigate the model to distinctly represent the presence or absence of these pivotal features pertaining to implicit aspects and opinion terms. Additionally, we introduce a prediction normalization strategy to refine the produced results. Empirical findings from experiments on four publicly available datasets show the superiority of our method, surpassing multiple baseline approaches and achieving state-of-the-art performance on the benchmark.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/li25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/li25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Differentially Private Deep Learning with Importance-based Adaptive Gradient Processing</title>
        <description>In recent years, with the rapid development of neural network technology, the application of deep learning in the field of artificial intelligence has made significant progress and improvement. However, during the training of neural network models, the utilization of datasets is involved, and these datasets may contain sensitive information from users. Attackers might exploit the well-trained models to gain access to this sensitive information, leading to privacy breaches. Considering this risk, some deep learning algorithms incorporate differential privacy technology to safeguard the privacy of the trained model. This protection comes at the cost of certain model performance, achieved by adding controllable random noise. In this paper, we propose a differential privacy deep learning algorithm based on the importance of each layer’s gradients, called DP-AdamILG. DP-AdamILG further mitigates the impact of noise addition on model performance. It accomplishes this by combining the dynamic privacy budget allocation strategy with the formation of noise gradients based on the importance of each layer’s gradients. And the algorithm’s privacy is theoretically proven. Experimental results show that the DP-AdamILG algorithm can reach good performance of the neural network model and show strong robustness.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/li25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/li25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Large Vision-Language Models as Emotion Recognizers in Context Awareness</title>
        <description>Context-aware emotion recognition (CAER) is a complex and significant task that requires perceiving emotions from various contextual cues. Previous approaches primarily focus on designing sophisticated architectures to extract emotional cues from images. However, their knowledge is confined to specific training datasets and may reflect the subjective emotional biases of the annotators. Furthermore, acquiring large amounts of labeled data is often challenging in real-world applications. In this paper, we systematically explore the potential of leveraging Large Vision-Language Models (LVLMs) to empower the CAER task from three paradigms: 1) We fine-tune LVLMs on two CAER datasets, which is the most common way to transfer large models to downstream tasks. 2) We design zero-shot and few-shot patterns to evaluate the performance of LVLMs in scenarios with limited data or even completely unseen. In this case, a training-free framework is proposed to fully exploit the In-Context Learning (ICL) capabilities of LVLMs. Specifically, we develop an image similarity-based ranking algorithm to retrieve examples; subsequently, the instructions, retrieved examples, and the test example are combined to feed  LVLMs to obtain the corresponding sentiment judgment. 3) To leverage the rich knowledge base of LVLMs, we incorporate Chain-of-Thought (CoT) into our framework to enhance the model’s reasoning ability and provide interpretable results. Extensive experiments and analyses demonstrate that LVLMs achieve competitive performance in the CAER task across different paradigms. Notably, the superior performance in few-shot settings indicates the feasibility of LVLMs for accomplishing specific tasks without extensive training.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/lei25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/lei25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Towards Robust Saliency Maps</title>
        <description>Saliency maps are one of the most popular tools to interpret the operation of a neural network: they compute input features deemed relevant to the final prediction, which are often subsets of pixels that are easily understandable by a human being. However, it is known that relying solely on human assessment to judge a saliency map method can be misleading. In this work, we propose a new neural network verification specification called saliency-robustness, which aims to use formal methods to prove a relationship between Vanilla Gradient (VG) – a simple yet surprisingly effective saliency map method – and the network’s prediction: given a network, if an input $x$ emits a certain VG saliency map, it is mathematically proven (or disproven) that the network must classify $x$ in a certain way. We then introduce a novel method that combines both Marabou and Crown – two state-of-the-art neural network verifiers, to solve the proposed specification. Experiments on our synthetic dataset and MNIST show that Vanilla Gradient is surprisingly effective as a certification for the predicted output.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/le25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/le25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Membership Inference Attacks Against Time-Series Models</title>
        <description>Analyzing time-series data that may contain personal information, particularly in the medical field, presents serious privacy concerns. Sensitive health data from patients is often used to train machine-learning models for diagnostics and ongoing care. Assessing the privacy risk of such models is crucial to making knowledgeable decisions on whether to use a model in production, share it with third parties, or deploy it in patients’ homes. Membership Inference Attacks (MIA) are a key method for this kind of evaluation, however time-series prediction models have not been thoroughly studied in this context. We explore existing MIA techniques on time-series models, and introduce new features, focusing on the seasonality and trend components of the data. Seasonality is estimated using a multivariate Fourier transform, and a low-degree polynomial is used to approximate trends. We applied these techniques to various types of time-series models, using datasets from the health domain. Our results demonstrate that these new features enhance the effectiveness of MIAs in identifying membership, improving the understanding of privacy risks in medical data applications.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/koren25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/koren25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Clustering-Augmented Fraud Detection on Graphs Using Label-Aware Feature Aggregation</title>
        <description>Fraud detection has emerged as a pivotal process in different fields (e.g., e-commerce, social networks). Since interactions among entities provide valuable insights into fraudulent activities, such behaviors can be naturally represented as graphs, where graph neural networks (GNNs) have been developed as prominent models to boost the efficacy of fraud detection. However, the application of GNNs in this domain encounters significant challenges, primarily due to class imbalance and a mixture of homophily and heterophily of fraud graphs. To address these challenges, in this paper, we propose LACA, which implements fraud detection on graphs using Label-Aware feature aggregation to advance GNN training, which is regularized by Clustering Augmented optimization. Specifically, label-aware feature aggregation simplifies adaptive aggregation in homophily-heterophily mixed neighborhoods, preventing gradient domination by legitimate nodes and mitigating class imbalance in message passing. Clustering-augmented optimization provides fine-grained subclass semantics to improve detection performance, and yields additional benefit in addressing class imbalance. Extensive experiments on four fraud datasets demonstrate that LACA can significantly improve fraud detection performance on graphs with different imbalance ratios and homophily ratios, outperforming state-of-the-art GNN models.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/jing25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/jing25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Middle Code Prediction: Enhancing Code Generation for Uncommon Programming Languages in Robotics</title>
        <description>Generating executable code through natural language instructions to drive robotic movements is considered a crucial step towards achieving embodied intelligence. However, in the robotics domain, the scarcity of programming language data often necessitates manually encapsulating high-level APIs to enable Large Language Models(LLMs) to predict code correctly, which is time-consuming and incomplete. Therefore, this paper proposes a three-stage Middle Code Prediction(MCP) scheme, by injecting appropriate prompts at different stages, the LLMs can shift towards predicting middle code that it understands more easily. This middle code can then be converted into the final code through specific scripts, accomplishing the task of generating code in uncommon programming languages automatically and without the need for manually encapsulating high-level APIs. We tested our approach on Hospital Item Transport Dataset(HITD) and found that MCP could improve the mean accuracy of various baseline models to varying degrees, with an overall increase of 31%, while also enhancing the noise resistance of fine-tuned models. We conducted real-world experiments on industrial robotic arms, verifying the feasibility of MCP in scenarios with no API and partial API encapsulation. The method proposed in this paper provides a guideline for code generation in uncommon programming languages within the context of LLMs. Our experimental dataset is available at https://github.com/Ghbbbbb/MCP.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/jia25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/jia25c.html</guid>
        
        
      </item>
    
      <item>
        <title>DCoT: Dual Chain-of-Thought Prompting for Large Multimodal Models</title>
        <description>Inference augmentation techniques such as Chain-of-Thought have already made their mark in Large Language Models (LLMs). However, transferring these advances to Large Multimodal Models (LMMs) presents greater challenges. Drawing inspiration from human cognitive processes, this paper proposes a plug-and-play Dual Chain-of-Thought strategy, a novel pipeline that combines visual and textual guidance to improve the performance of LMMs in complex multimodal tasks. The DCoT strategy uses a dual guidance mechanism to use bounding box markers to guide the model’s attention to the image region related to the query problem in the visual aspect, so as to achieve fine-grained image guidance, and in the text aspect, we propose a Fast In-Context Retrieval Framework (FICRF) dynamically and automatically obtains the most suitable examples from the well-built demonstration example cluster as context guidance according to the current problem. This bimodal approach that utilizes visual and textual guidance enhances the inference capabilities of LMMs. Extensive experiments on different LMMs and benchmark datasets have validated its effectiveness, opening up a new path in multimodal inference. Showcasing how the synergistic combination of visual and textual instructions can take the performance of these models to new heights, while demonstrating the potential of Chain-of-Thought and In-Context Learning as a superior alternative to the fine-tuning of LMMs.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/jia25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/jia25b.html</guid>
        
        
      </item>
    
      <item>
        <title>A More Efficient Inference Model for Multimodal Emotion Recognition</title>
        <description>With the widespread adoption of the Internet and mobile Internet, an increasing number of individuals are expressing their emotions on short-video platforms. Contemporary multimodal emotion analysis technologies facilitate a more comprehensive recognition and understanding of emotions through the analysis of various data sources including text, facial expressions, audio, hand gestures, among others. Consequently, the significance of sentiment analysis is becoming increasingly pronounced. However, existing research indicates that most emotion analysis techniques are not sufficiently rapid and efficient in light of the exponential proliferation of short video content. In addition, most sentiment analysis models demonstrate significant differences in the contribution of each modality, with text and visual modalities often exerting a greater influence than audio modes. Furthermore, in the pursuit of heightened accuracy, certain models are designed to be exceedingly complex, while others prioritize swift reasoning at the expense of accuracy. This paper proposes a more efficient multimodal sentiment analysis model, presenting three distinct advantages. Firstly, residual-free connectivity modules capable of extracting 3-D attentional weights are proposed to process visual modal features, maintaining accuracy while improving inference efficiency. Secondly, adoption of multi-scale hierarchical context aggregation (aggregation followed by interaction) for audio modality to capture coarse- and fine-grained audio contextual information through multilevel aggregation, thereby enriching audio modality features and minimizing disparities between modalities’ contributions. Finally, attainment of a superior balance between accuracy and speed, thereby enhancing adaptability to the fast-paced short video environment and meeting the burgeoning demand for video content processing.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/jia25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/jia25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Chain Association-based Attacking and Shielding Natural Language Processing Systems</title>
        <description>Association as a gift enables people do not have to mention something in completely straightforward words and allows others to understand what they intend to refer to. In this paper, we propose a chain association-based adversarial attack against natural language processing systems, utilizing the comprehension gap between humans and machines. We first generate a chain association graph for Chinese characters based on the association paradigm for building search space of potential adversarial examples.  Then, we introduce an discrete particle swarm optimization algorithm to search for the optimal adversarial examples.  We conduct comprehensive experiments and show that advanced natural language processing models and applications, including large language models, are vulnerable to our attack, while humans appear good at understanding the perturbed text. We also explore two methods, including adversarial training and associative graph-based recovery, to shield systems from chain association-based attack. Since a few examples that use some derogatory terms, this paper contains materials that may be offensive or upsetting to some people.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/huang25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/huang25c.html</guid>
        
        
      </item>
    
      <item>
        <title>On the sample complexity of privately learning half-spaces</title>
        <description>We present a differentially private learner for half-spaces over a finite domain $\mathcal{X}^d\subseteq\mathbb{R}^d$ with sample complexity $\tilde{O}(d\cdot\log^*\vert\mathcal{X}\vert)$, which improves over $\tilde{O}(d^{2.5}\cdot 8^{\log^*\vert\mathcal{X}\vert})$, the state-of-the-art result of [Kaplan et al., 2020]. The building block of our result is a novel differentially private algorithm that learns half-spaces by iteratively learning thresholds on angles.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/huang25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/huang25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Refining Visual Perception for Decoration Display: A Self-Enhanced Deep Captioning Model</title>
        <description>Traditional decoration displays usually include renderings and corresponding descriptions to give users a deeper understanding and feeling. Nevertheless, describing massive renderings undoubtedly requires a lot of manpower. Thanks to the development of artificial intelligence, especially deep learning techniques, image captioning has been developed to automatically generate captions for given images. However, the defect of exploring “perceptive’’ words (e.g., bright, capacious, and comfortable, etc) is exposed when transferring existing captioning approaches to the decoration display task. To address this issue, in this paper, we propose a self-enhanced deep captioning model, which generates the captions with visual perception using the designed Self-Enhanced Transformer (SET). In detail, SET first pre-trains the scene-aware encoder, which employs the multi-task-based multi-modal transformer to enhance the perceptive semantics of the visual representations. Then, SET combines the pre-trained encoder with the transformer decoder for fine-tuning and designs a knowledge-enhanced module on the top of the decoder to adaptively fuse the decoded representations and retrieved language cues for making more suitable word prediction. In experiments, we first validate SET on the MS-COCO dataset, and we achieve at least 0.6 improvements on the CIDEr-D score. Furthermore, to address the effectiveness of SET on the decoration display task, we collect a new dataset called DecorationCap. We present a thorough empirical analysis to verify the generality of SET and find that SET surpasses other comparison methods with at least 6.8 improvements on the CIDEr-D score.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/huang25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/huang25a.html</guid>
        
        
      </item>
    
      <item>
        <title>SceneWeaver: Text-Driven Scene Generation with Geometry-aware Gaussian Splatting</title>
        <description>With the widespread use of virtual reality applications, 3D scene generation has become a challenging new research frontier. 3D scenes have highly complex structures, so it is crucial to ensure that the output is dense, coherent, and includes all necessary structures. Many current 3D scene generation methods rely on pre-trained text-to-image diffusion models and monocular depth estimators, but they often lack rich geometric constraint information within the scene, leading to geometric distortion in the generated results. Therefore, we propose a two-stage geometry-aware progressive scene generation framework, SceneWeaver, which creates diverse, high-quality 3D scenes from text or image inputs. In the first stage, we introduce a multi-level depth refinement mechanism combined with image inpainting and point cloud updating strategies to construct a high-quality initial point cloud. In the second stage, 3D Gaussians are initialized based on the point cloud and continuously optimized. To address the challenge of insufficient geometric constraints in the Gaussian Splatting optimization process, we utilize the rich appearance and geometry information within the scene to perform a geometry-aware optimization, resulting in high-quality scene generation results. Comprehensive experiments across multiple scenes demonstrate the significant potential and advantages of our framework compared with several baselines.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/hou25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/hou25a.html</guid>
        
        
      </item>
    
      <item>
        <title>InfantCryNet: A Data-driven Framework for Intelligent Analysis of Infant Cries</title>
        <description>Understanding the meaning of infant cries is a significant challenge for young parents in caring for their newborns. The presence of background noise and the lack of labeled data present practical challenges in developing systems that can detect crying and analyze its underlying reasons. In this paper, we present a novel data-driven framework, “InfantCryNet,” for accomplishing these tasks. To address the issue of data scarcity, we employ pre-trained audio models to incorporate prior knowledge into our model. We propose the use of statistical pooling and multi-head attention pooling techniques to extract features more effectively. Additionally, knowledge distillation and model quantization are applied to enhance model efficiency and reduce the model size, better supporting industrial deployment in mobile devices. Experiments on real-life datasets demonstrate the superior performance of the proposed framework, outperforming state-of-the-art baselines by 4.4% in classification accuracy. The model compression effectively reduces the model size by 7% without compromising performance and by up to 28% with only an 8% decrease in accuracy, offering practical insights for model selection and system design.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/hong25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/hong25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Preserving Spatial-Temporal Relationship with Adaptive Node Sampling in Hierarchical Dynamic Graph Transformers</title>
        <description>Dynamic Graph Transformers (DGTs) have demonstrated remarkable performance in various applications, such as social networks, traffic forecasting, and recommendation systems. Despite their effectiveness in capturing long-range dependencies, training DGTs for large graphs remains a challenge. Mini-batch training is usually used to alleviate this challenge but this approach often fails to capture complex dependencies or sacrifice performance.  To deal with the above problems, we propose the $\underline{A}$daptive Node $\underline{S}$ampling in $\underline{H}$ierarchical $\underline{D}$ynamic $\underline{G}$raph $\underline{T}$ransformers (ASH-DGT) architecture that focuses on sampling the set of suitable nodes preserving spatial-temporal relationships in the dynamic graph for training DGTs. Unlike previous methods that use random sampling or structural sampling, our motivation is that the contribution of nodes to learning performance can be time-sensitive, while we still care about spatial correlation in the dynamic graph with consideration to the global and local structure of the graph.  Through extensive evaluations on popular real-world datasets for node classification and link prediction, ASH-DGT consistently outperforms multiple state-of-the-art methods, achieving both higher accuracy and significant improvements in training efficiency.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/hoang25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/hoang25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Pic@Point: Cross-Modal Learning by Local and Global Point-Picture Correspondence</title>
        <description>Self-supervised pre-training has achieved remarkable success in NLP and 2D vision. However, these advances have yet to translate to 3D data. Techniques like masked reconstruction face inherent challenges on unstructured point clouds, while many contrastive learning tasks lack in complexity and informative value. In this paper, we present Pic@Point, an effective contrastive learning method based on structural 2D-3D correspondences. We leverage image cues rich in semantic and contextual knowledge to provide a guiding signal for point cloud representations at various abstraction levels. Our lightweight approach outperforms state-of-the-art pre-training methods on several 3D benchmarks.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/herzog25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/herzog25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Zero-Reference Lighting Estimation Diffusion Model for Low-Light Image Enhancement</title>
        <description>Diffusion model-based low-light image enhancement methods rely heavily on paired training data, which limits its extensive application. Meanwhile, existing unsupervised methods lack effective bridging capabilities for unknown degradation. To address these limitations, we firstly propose a novel zero-reference lighting estimation diffusion model for low-light image enhancement called Zero-LED. It utilizes the stable convergence ability of diffusion models to bridge the gap between low-light domains and real normal-light domains and successfully alleviates the dependence on pairwise training data via zero-reference learning. Specifically, we first design the initial optimization network to preprocess the input image and implement bidirectional constraints between the diffusion model and the initial optimization network through multiple objective functions. Subsequently, the degradation factors of the real-world scene are optimized iteratively to achieve effective light enhancement. In addition, we explore a frequency-domain based and semantically guided appearance reconstruction module that encourages feature alignment of the recovered image at a fine-grained level and satisfies subjective expectations. Finally, extensive experiments demonstrate the superiority of our approach to other state-of-the-art methods and more significant generalization capabilities.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/he25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/he25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Enhancing Textbook Question Answering with Knowledge Graph-Augmented Large Language Models</title>
        <description>Previous works on Textbook Question Answering suffer from limited performance due to the small-scale neural network based backbone. To alleviate the issue, we propose to utilize LLMs as the backbone of TQA tasks. To this end, we utilize two methods, the raw-context based prompting method and the knowledge graph based prompting method. Specifically, we introduce the Textbook Question Answering-Knowledge Graph (TQA-KG) method, which first converts textbook content into structural knowledge graphs and then combining knowledge graph into LLM prompting, thereby enhancing the model’s reasoning capabilities and answer accuracy. Extensive experiments conducted on the CK12-QA dataset illustrate the effectiveness of the method, achieving an improvement of 5.67% in accuracy compared to current state-of-the-art methods on average.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/he25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/he25a.html</guid>
        
        
      </item>
    
      <item>
        <title>MLP-Mixer based surrogate model for seismic ground motion with spatial source and geometry parameters</title>
        <description>Seismic motion simulations enable high-precision predictions but are computationally demanding. This study introduces a deep learning surrogate model using the MLP-Mixer architecture to address this challenge. Traditional models using independent Multi-layer Perceptrons (MLPs) fail to capture spatial correlations, while U-shaped Neural Operators (U-NOs) require high computational costs for high-resolution inputs and outputs. Our proposed model, the Multiple MLP-Mixer (Multi-MLP-Mixer), integrates global and local spatial information through multiple MLP-Mixer blocks and dual patch-wise affine transformations. We demonstrate the effectiveness of our method with simulation data from anticipated megathrust earthquakes in the Nankai Trough, achieving performance comparable to state-of-the-art models with significantly improved computational efficiency.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/hachiya25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/hachiya25a.html</guid>
        
        
      </item>
    
      <item>
        <title>TSMCR: Two-stage Supervised Multi-modality Contrastive Representation for Ultrasound-based Breast Cancer Diagnosis</title>
        <description>Contrastive learning has demonstrated great performance in breast cancer diagnosis. However, few existing works inspect label information in contrastive representation learning, especially for multi-modality ultrasound scenes. In this work, a two-stage supervised multi-modality contrastive representation classification network (TSMCR) is proposed for assisting breast cancer diagnosis on the multimodality ultrasound. TSMCR consists of two-stage supervised multimodality contrastive learning (SMCL) and deep support vector machine (DSVM). By a novel contrastive loss, SMCL  handles the consistency between modalities and the sample separability. Further, two-stage SMCL learns expressive representation by gradually pulling the similar samples of positive pairs closer and pushing the dissimilar samples of negative pairs apart in the projection space. Besides, on the fusion of the multi-level contrastive representation, DSVM is to jointly learn the representation network and classifier again in a unified framework to improve the generation performance. The experimental results on the multimodality ultrasound dataset show the proposed TSMCR achieves superior performance with an accuracy of 87.51%, sensitivity of 86.67%, and specificity of 88.36%.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/gong25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/gong25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Enhancing DETRs for Small Object Detection via Multi-Scale Refinement and Query-Aided Mining</title>
        <description>Small object detection (SOD) aims to precisely localize and accurately classify objects from limited spatial extent and discernible features. Despite significant advancements in object detection driven by CNN-based and Transformer-based methods, SOD remains a significant challenge. This is primarily due to their minimal spatial dimensions and distinct features which pose difficulties in both computational efficiency and effective supervision. Particularly, Transformer-based detectors suffer from the high computational cost caused by the introduction of a feature pyramid network (FPN) and the sparse supervision for the encoder output due to insufficient positive queries. Current approaches attempt to mitigate these issues through sparse attention mechanisms and auxiliary one-to-many label assignment strategies. However, these approaches often still suffer from inefficiencies in processing multi-scale information and a deficiency in generating adequate positive queries for small objects. To address this issue, we propose a novel small object detector MRQM, which integrates Multi-scale Refinement and Query-aided Mining. The scale-aware encoder strategically refines features across multiple scales from a bi-directional feature pyramid network (BiFPN) through iterative updates. This process not only reduces redundant computations but also significantly enhances the representation of features at various scales. Furthermore, the IoU-aware head integrates the dynamic anchors mining strategy and one-to-many label assignments to fully mine potential high-quality auxiliary positive queries for small instances, and mitigate issues related to sparse supervision for the encoder. Extensive experiments on the SODA-D and VisDrone datasets consistently demonstrate the superiority and effectiveness of our MRQM method.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/fu25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/fu25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Combinatorial Causal Bandits without Graph Skeleton</title>
        <description>In combinatorial causal bandits (CCB), the learning agent chooses a subset of variables in each round to intervene and collects feedback from the observed variables to minimize expected regret or sample complexity. Previous works study this problem in both general causal models and binary generalized linear models (BGLMs). However, all of them require prior knowledge of causal graph structure or unrealistic assumptions. This paper studies the CCB problem without the graph structure on binary general causal models and BGLMs. We first provide an exponential lower bound of cumulative regrets for the CCB problem on general causal models. To overcome the exponentially large space of parameters, we then consider the CCB problem on BGLMs. We design a regret minimization algorithm for BGLMs even without the graph skeleton and show that it still achieves $O(\sqrt{T}\ln T)$ expected regret, as long as the causal graph satisfies a weight gap assumption. This asymptotic regret is the same as the state-of-art algorithms relying on the graph structure. Moreover, we propose another algorithm with $O(T^{\frac{2}{3}}\ln T)$ regret to remove the weight gap assumption.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/feng25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/feng25a.html</guid>
        
        
      </item>
    
      <item>
        <title>MLCL: A Framework for Reducing Language Imbalance in Sino-Tibetan Languages through Adapter Structures</title>
        <description>Multilingual pre-trained models have been widely applied in natural language processing (NLP) tasks, including text classification. However, due to the varying amounts of language resources, these models exhibit performance imbalance across different languages, a phenomenon known as language imbalance. Existing research on mitigating language imbalance primarily harnesses text and image data, neglecting the auditory aspects of languages. This neglect results in an incomplete solution to language imbalance, as it fails to exploit the rich linguistic nuances conveyed through speech. To address these issues, this paper introduces a novel framework called MultiLingual Contrastive Learning (MLCL) to reduce language imbalance. By incorporating concepts from comparative linguistics into neural networks, we utilize the phonetic similarities among languages within the Sino-Tibetan family to tackle the problem of language imbalance in multilingual pre-trained models.  To evaluate our method’s effectiveness, we conducted tests using two synthetic datasets derived from the Flores200 and mms datasets across various models. The experimental results show that, in terms of language imbalance metrics, our model surpasses all baseline models.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/fang25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/fang25a.html</guid>
        
        
      </item>
    
      <item>
        <title>A Retrieval-Augmented Contrastive Framework for Legal Case Retrieval Based on Event Information</title>
        <description>Similar case retrieval is a crucial aspect of the legal retrieval field, significantly contributing to LegalAI systems. This task aims to retrieve cases that are highly relevant to the query case, thereby enhancing the efficiency of legal practitioners. Recent methods have leveraged the rich semantic knowledge of pre-trained models, greatly improving retrieval performance. However, these methods often overlook key legal elements within the complex language structures of case texts, such as legal event information that can impact case outcomes and judgments. This oversight results in the underutilization of critical case information. To address this issue, we proposed RAEvent, a similar case retrieval contrastive framework augmented by legal event information. This framework utilizes an enhanced case event information database to provide auxiliary information for case retrieval and employs contrastive learning techniques to better extract similar features in cases. In comparison to a range of baseline approaches, the results of our experiments highlight the efficacy of our framework. Moreover, our research provides fresh perspectives and makes a valuable contribution to the ongoing studies in similar case retrieval tasks.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/fan25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/fan25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Constructing a Knowledge-Guided Mental Health Chatbot with LLMs</title>
        <description>The global shortage of mental health resources has severely impacted the ability to address psychological distress, affecting approximately 658 million people. Despite the effectiveness of psychotherapy and counseling, less than 35% of those in need receive help. Traditional conversational agents often lack emotional support, leading to mechanical interactions that detract from user experience. This paper introduces the &quot;Mental Health Chatbot,&quot; a conversational agent based on a pre-trained large language model. This chatbot innovatively uses retrieval-augmentation techniques to extract relevant knowledge from psychological diagnostics and treatment manuals, providing tailored psychotherapeutic interventions. It effectively identifies mental disorders and their severity, suggesting appropriate interventions. Evaluated through pre-trained model similarity comparisons, large language model scoring, and expert assessments, results show that the Mental Health Chatbot enhances the accuracy of smaller models and accelerates the inference speed of larger models through retrieval-augmentation. The optimized training process enables more human-like interactions, improving user experience and demonstrating the chatbot’s potential and practical application in addressing mental health challenges.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/fan25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/fan25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Motion meets Attention: Video Motion Prompts</title>
        <description>Videos contain rich spatio-temporal information. Traditional methods for extracting motion, used in tasks such as action recognition, often rely on visual contents rather than precise motion features. This phenomenon is referred to as ’blind motion extraction’ behavior, which proves inefficient in capturing motions of interest due to a lack of motion-guided cues. Recently, attention mechanisms have enhanced many computer vision tasks by effectively highlighting salient visual areas. Inspired by this, we propose a modified Sigmoid function with learnable slope and shift parameters as an attention mechanism to modulate motion signals from frame differencing maps. This approach generates a sequence of attention maps that enhance the processing of motion-related video content. To ensure temporal continuity and smoothness of the attention maps, we apply pair-wise temporal attention variation regularization to remove unwanted motions (e.g., noise) while preserving important ones. We then perform Hadamard product between each pair of attention maps and the original video frames to highlight the evolving motions of interest over time. These highlighted motions, termed video motion prompts, are subsequently used as inputs to the model instead of the original video frames. We formalize this process as a motion prompt layer and incorporate the regularization term into the loss function to learn better motion prompts. This layer serves as an adapter between the model and the video data, bridging the gap between traditional ’blind motion extraction’ and the extraction of relevant motions of interest. We show that our lightweight, plug-and-play motion prompt layer seamlessly integrates into models like SlowFast, X3D, and TimeSformer, enhancing performance on benchmarks such as FineGym and MPII Cooking 2.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/chen25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/chen25a.html</guid>
        
        
      </item>
    
      <item>
        <title>FTP: A Human Pose Estimation Method Integrating Temporal and Fine-Grained Feature Fusion</title>
        <description>Human pose estimation is a significant research direction in the field of computer vision, with critical applications in human motion reconstruction and analysis. Currently proposed human pose estimation methods primarily focus on single-modality sensor information, such as RGB images and LiDAR point clouds. While these methods have achieved promising results within their respective domains, they remain limited by the inherent deficiencies of each modality, hindering their applicability across diverse real-world scenarios. With the recent introduction of numerous multi-modality human pose datasets, multi-modality approaches have begun to develop. However, existing multi-modality fusion methods mainly consider the global feature relationships between different modalities, without modeling finer-grained features or the dynamic temporal relationships between modalities. To address this issue, we propose a novel pipeline that integrates point cloud and image features, explicitly encoding fine-grained features and dynamic temporal relationships between the two modalities. Additionally, we employ a discriminator structure for semi-supervised training. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) performance compared to previous methods.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/cai25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/cai25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Exploring Beyond Curiosity Rewards: Language-Driven Exploration in RL</title>
        <description>Sparse rewards pose a significant challenge for many reinforcement learning algorithms, which struggle in the absence of a dense, well-shaped reward function. Drawing inspiration from the curiosity exhibited in animals, intrinsically-driven methods overcome this drawback by incentivizing agents to explore novel states. Yet, in the absence of domain-specific priors, sample efficiency is hindered as most discovered novelty has little relevance to the true task reward. We present iLLM, a curiosity-driven approach that leverages the inductive bias of foundation models — Large Language Models, as a source of information about plausibly useful behaviors. Two tasks are introduced for shaping exploration: 1) action generation and 2) history compression, where the language model is prompted with a description of the state-action trajectory. We further propose a technique for mapping state-action pairs to pretrained token embeddings of the language model in order to alleviate the need for explicit textual descriptions of the environment. By distilling prior knowledge from large language models, iLLM encourages agents to discover diverse and human-meaningful behaviors without requiring direct human intervention. We evaluate the proposed method on BabyAI-Text, MiniHack, Atari games, and Crafter tasks, demonstrating higher sample efficiency compared to prior curiosity-driven approaches.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/bougie25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/bougie25a.html</guid>
        
        
      </item>
    
      <item>
        <title>EVaDE : Event-Based Variational Thompson Sampling for Model-Based Reinforcement Learning</title>
        <description>Posterior Sampling for Reinforcement Learning (PSRL) is a well-known algorithm that augments model-based reinforcement learning (MBRL) algorithms with Thompson sampling. PSRL maintains posterior distributions of the environment transition dynamics and the reward function, which are intractable for tasks with high-dimensional state and action spaces. Recent works show that dropout, used in conjunction with neural networks, induces variational distributions that can approximate these posteriors. In this paper, we propose Event-based Variational Distributions for Exploration (EVaDE), which are variational distributions that are useful for MBRL, especially when the underlying domain is object-based. We leverage the general domain knowledge of object-based domains to design three types of event-based convolutional layers to direct exploration. These layers rely on Gaussian dropouts and are inserted between the layers of the deep neural network model to help facilitate variational Thompson sampling. We empirically show the effectiveness of EVaDE-equipped Simulated Policy Learning (EVaDE-SimPLe) on the 100K Atari game suite.</description>
        <pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v260/aravindan25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v260/aravindan25a.html</guid>
        
        
      </item>
    
  </channel>
</rss>
