Proceedings of Machine Learning Research

Exposing LLM Safety Gaps Through Mathematical Encoding: New Attacks and Systematic Analysis

Mon, 29 Jun 2026 00:00:00 +0000

Large language models (LLMs) employ safety mechanisms to prevent harmful outputs,yet these defenses primarily rely on semantic pattern matching. We show that en-coding harmful prompts as coherent mathematical problems—using formalisms such asset theory, formal logic, and quantum mechanics—bypasses these filters at high rates,achieving 46–56% average attack success across eight target models and two establishedbenchmarks. Crucially, the effectiveness depends not on mathematical notation itself,but on whether a helper LLM deeply reformulates the harmful content into a genuinemathematical problem: rule-based encodings that apply mathematical formatting with-out such reformulation perform no better than unencoded baselines. We introduce anovel Formal Logic encoding that achieves attack success comparable to Set Theory,demonstrating that this vulnerability generalizes across mathematical formalisms. Ad-ditional experiments with repeat post-processing confirm that these attacks are robustto simple prompt augmentation. Notably, newer models (GPT-5, GPT-5-Mini) showsubstantially greater robustness than older models, though they remain vulnerable. Ourfindings highlight fundamental gaps in current safety frameworks and motivate defensesthat reason about mathematical structure rather than surface-level semantics.

Reinforcement Learning–Based Wind Farm Layout Optimization Using Neural Surrogate Models and Real Wind Data

Mon, 29 Jun 2026 00:00:00 +0000

This paper presents a data-driven framework for wind farm layout optimization that integrates neural surrogate modeling and reinforcement learning to address complex wake interactions and high computational costs associated with physics-based simulations. A deep neural network surrogate is trained on Reynolds-Averaged Navier–Stokes (RANS) simulation data to predict turbine power output from localized flow velocity features sampled within a 1.5 rotor-diameter neighborhood. Validation on a held-out test dataset shows normalized prediction errors below 5% when benchmarked against both manufacturer-derived power curves and RANS CFD results. The surrogate is embedded within a reinforcement learning framework to perform sequential, wake-aware turbine placement using realistic wind data for Winnipeg, Canada. The proposed approach achieves stable convergence and produces denser, higher-performing layouts than a genetic algorithm baseline under identical constraints, demonstrating its effectiveness for scalable wind farm optimization.

Real-Time Jailbreak Detection via Safety-Weighted Semantic Entropy Probes

Mon, 29 Jun 2026 00:00:00 +0000

Large language models remain vulnerable to jailbreak attacks that bypass safety alignment. Existing defenses often require multi-pass generation or gradient analysis, limiting real-time deployment. We introduce Safety-Weighted Semantic Entropy (SWSE) Probes, a lightweight method for detecting jailbreak attempts at the token-before-generation stage using neural probes on model hidden states. Inspired by semantic entropy approaches for hallucination detection, our method estimates jailbreak likelihood from a single forward pass by training probes on safety-aware entropy scores derived from clustered model responses. Evaluated on Llama-3.2-3B-Instruct using 9,697 harmful and 7,000 benign prompts, our concatenated multi-layer MLP probes achieve ROC AUC of 0.989 and 96.7% accuracy with 100$\times$ less computation than multi-sampling defenses.

A Multimodal Data Extraction Pipeline with Table Layout Correction

Mon, 29 Jun 2026 00:00:00 +0000

Financial documents such as paystubs, invoices, and financial statements contain heterogeneous layouts and visually complex tables, making reliable information extraction challenging for both optical character recognition (OCR) based pipelines and end-to-end vision–language models (VLMs). In this paper, we present a pipeline that unifies layout analysis, one-shot multimodal table correction, and downstream extraction and reasoning without any model fine-tuning. The pipeline converts document images into a hybrid Markdown-HTML representation and applies a multi-modal correction module to rectify layout-level errors in tables, yielding demonstrable improvements in Tree-Edit-Distance-based Similarity (TEDS) scores. Additionally, using this corrected representation, the system performs robust schema-based extraction and document-level question answering. Experimental results across paystub field extraction and Finance question-answering (QA) tasks show that our approach consistently outperforms both OCR-only pipelines and direct VLM baselines. These results demonstrate that incorporating explicit table layout and multimodal table correction provides a scalable and generalizable path toward robust financial document understanding.

ToxiSight: Leveraging Moderator Expertise Through Behavioral Measurement in Gaming Toxicity Annotation

Mon, 29 Jun 2026 00:00:00 +0000

Content moderation systems commonly treat human annotators as interchangeable label sources, resolving disagreements through majority voting or expert arbitration. We present ToxiSight, an annotation platform that reframes this assumption: rather than extracting consensus, the system supports moderator reasoning by treating hesitation, revision, and disagreement as signals revealing where content is genuinely ambiguous and where taxonomic guidelines fail. ToxiSight integrates gaming-specific contextual widgets with behavioral telemetry, capturing the cognitive processes underlying toxicity validation decisions. Through deployment with 10 professional moderators across 60,000 lines of gaming chat, we demonstrate that behavioral patterns expose systematic category failures invisible to traditional inter-annotator metrics. The Controversial category shows 72% revision rates with fast processing times, indicating immediate recognition of definitional breakdown, while Threats (Life-Threatening) exhibits 75% revisions with slow processing, signaling genuine interpretive complexity. Completion rates improved from 60% to 95%, and moderators reported reduced decision stress when permitted to express uncertainty. This case study demonstrates that trustworthy toxicity detection requires annotation systems designed around the irreducible complexity of human judgment, not against it.

Fed-Universe: A Semantic-Geometric-Topological-Human (S-G-T-H) Stack for Negotiated Alignment in Federated Systems

Mon, 29 Jun 2026 00:00:00 +0000

Conventional federated learners implicitly optimize for global means, creating a “one-size-fits-all” paradigm that inevitably suppresses the minority under cross-site heterogeneity. To bridge this critical gap, we present Fed-Universe, a generalizable Semantic–Geometric–Topological–Human (S–G–T –H) architecture that transforms passive averaging into an active negotiation for decentralized alignment. The S-layer deploys an edge-capable LLM as a semantic surrogate to translate heterogeneous inputs into standardized patient profile prompts. The G-layer enforces geometric quality assurance via a continuous cosine similarity gate to prevent the mis-rejection of valid minority features. The T-layer executes topological Pareto control, identifying a sensitivity-based knee point ($\lambda$k) to balance multi-objective aggregation without majority degradation. Finally, the H-layer operationalizes a cognitive twin dynamic to mirror the user’s real-time psychological state, dynamically adapting human-computer interaction modes via active intent alignment. To ground this theoretical architecture empirically, we validated the S-G-T core on a binational synthetic clinical network simulation. This proof-of-concept demonstrated a “zero-sum escape”: a minority node (<3% data volume) achieved utility parity (loss reduction from 0.857 to 0.340) alongside dominant hubs. Our Fed-Universe framework is actively being expanded to target distinct scales of alignment failure: institutional silos (Fed-Ultra), societal bias (Fed-Urban), and individual cognitive depletion (Fed-Human).

A Flexible Fair Learning Framework via Group-Aware Surrogate Loss Reweighting

Mon, 29 Jun 2026 00:00:00 +0000

This paper presents a new algorithmic fairness framework called $\boldsymbol{\alpha}$-$\boldsymbol{\beta}$ Fair Machine Learning ($\boldsymbol{\alpha}$-$\boldsymbol{\beta}$ FML), designed to optimize fairness levels across sociodemographic attributes. Our framework employs a new family of surrogate loss functions, paired with loss reweighting techniques, allowing precise control over fairness-accuracy trade-offs through tunable hyperparameters $\boldsymbol{\alpha}$ and $\boldsymbol{\beta}$. To efficiently solve the learning objective, we propose Parallel Stochastic Gradient Descent with Surrogate Loss (P-SGD-S) and establish convergence guarantees for both convex and nonconvex loss functions. Experimental results demonstrate that our framework improves overall accuracy while reducing fairness violations, offering a smooth trade-off between standard empirical risk minimization and strict minimax fairness. Results across multiple datasets confirm its adaptability, ensuring fairness improvements without excessive performance degradation.

Hierarchy-Aware Supervised Uncertainty Estimation for Black-box LLM Taxonomic Reasoning

Mon, 29 Jun 2026 00:00:00 +0000

Large language models (LLMs) are increasingly used for scientific decision support, yet reliable confidence estimation remains difficult in black-box settings. We study uncertainty estimation for hierarchical taxonomic reasoning generated by a black-box LLM in a long-tailed biodiversity monitoring pipeline. Using proxy features extracted by an open-source tool LLM, we train lightweight supervised estimators with hierarchy-aware supervision to predict rank-wise correctness. Across three tool LLMs, the supervised estimators consistently outperform a token-likelihood baseline for micro discrimination and selective prediction under a single global rejection threshold, improving micro AUROC from 0.57 to 0.75-0.80. The best results are achieved by a rank-specific multi-head design (H3), suggesting that accounting for hierarchical output structure is important when a unified abstention rule is required.

A Hybrid Mathematical–Economic and Artificial Intelligence Framework for Competitive Market Analysis and Strategic Positioning

Mon, 29 Jun 2026 00:00:00 +0000

nan

Diffusion-based Long and Short Term Interest Sequence Recommendation

Mon, 29 Jun 2026 00:00:00 +0000

Sequential recommendation requires modeling both stable long-term preferences and dynamic short-term intents. However, most existing methods rely on static fusion strategies, which cannot adaptively balance these signals. To address this, we propose DiffLSRec, a diffusion-based framework that performs progressive fusion of long- and short-term representations. The long-term embedding is treated as a prior, while short-term intent provides guidance during multi-step denoising, enabling dynamic and fine-grained integration. We further enhance short-term modeling with token-level contextual information and regulate the fusion process using SNR-adaptive guidance. Experiments on three Amazon datasets show that DiffLSRec consistently outperforms representative baselines across multiple metrics.

Clinical Trial Recommendation with LLM-Based Query Generation and Graph-Based Pairwise Re-ranking

Mon, 29 Jun 2026 00:00:00 +0000

Clinical trials are essential for drug development and advancing medical treatments. However, many fail due to the challenges of patient recruitment, as identifying suitable participants is both expensive and time-consuming. Recent advances in large language models have demonstrated strong potential in healthcare settings, offering a promising way to automate this process. In this study, we propose an LLM-based recommendation pipeline that suggests a ranked list of clinical trials based on patient characteristics. In the first stage, a medical LLM generates focused search queries from patient notes via chain-of-thought prompting. These queries are used to retrieve a candidate set from a large-scale clinical trial corpus via dense semantic search. In the second stage, candidates are then re-ranked via pairwise re-ranking with graph aggregation. We evaluate our pipeline on the TREC Clinical Trials 2021 and 2022 benchmarks. Query-generated retrieval, achieves significant improvements over raw retrieval, with Recall@1000 improving by 33.8% and 46.1% on TREC 2021 and 2022. Pairwise re-ranking with graph aggregation further improves nDCG@10 by 12.8% and 11.9%, P@10 by 11.2% and 10.1%,and MRRby18.2% and 12.3% on TREC 2021 and 2022 respectively. All results are obtained only by using an open-source 8B-parameter model, without task-specific fine-tuning or closed-source API dependence.

Back to the future: revival of evidence theory and modal logic for robust and interpretable AI

Mon, 29 Jun 2026 00:00:00 +0000

Years ago, when AI research was still predominantly a theoretical endeavor due to the scarcity of data, we developed a multi-valued modal logic interpretation of evidence measures. This work attracted some interest within the research community at the time but was gradually forgotten, even by us. Only recently have we revisited this modal logic interpretation of evidence theory, with the aim of developing neuro-symbolic modelling workflows capable of handling imperfect real-world data. The purpose of this position paper is to highlight the largely unexploited potential of these formalisms and to encourage the research community to further develop them toward more reliable, genuinely reasoning, and interpretable AI models. We begin by providing a concise overview of the key concepts underlying multi-valued mappings, evidence theory, and modal logic, along with our proposed multi-valued modal logic interpretation of evidence measures. We then present recent results demonstrating how these interpretations can be used to learn class expressions in weakly supervised learning scenarios. In addition, we show how these class representations can support reasoning under uncertainty in real-world applications. Finally, we discuss the untapped potential of the evidence-based approach for analyzing and quantifying the complexity of learning tasks, and we outline promising directions for future research.

To correct or not to correct?: Assessing the multiple comparisons problem for association rule mining of environmental DNA (eDNA) detection survey datasets

Mon, 29 Jun 2026 00:00:00 +0000

Unsupervised machine learning is a valuable exploratory tool for the small, noisy, and incomplete datasets characteristic of ecological and environmental research, where data limitations often render supervised approaches impractical. Here, we apply association rule mining (ARM) to a brook trout (\textit{Salvelinus fontinalis}) environmental DNA (eDNA) dataset with two objectives: (1) demonstrating ARM as a screening tool to identify environmental correlates and guide targeted metadata collection, and (2) evaluating Bonferroni and Benjamini–Hochberg corrections for managing Type I error rates during significance-based pruning. Our results show that, while both methods retained high-quality rules, the Bonferroni procedure eliminated several ecologically interesting associations that survived Benjamini–Hochberg correction. For small environmental datasets in an exploratory context, we thus favour Benjamini–Hochberg as a more appropriate correction strategy, notwithstanding the need for further external validation.

Cluster-Aware Retrieval-Augmented Generation with Hybrid Retrieval for Faithful Medical Report Summarization

Mon, 29 Jun 2026 00:00:00 +0000

Large Language Models (LLMs) can generate fluent medical summaries but may hallucinate facts not supported by source clinical text, limiting safe clinical adoption. In our prior work, we improved relevance through embedding-based patient clustering and cluster-wise GPT-4.0 summarization; however, summaries could still include unsupported claims due to a lack of explicit evidence grounding. This paper extends that pipeline with a cluster-aware Retrieval-Augmented Generation (RAG) layer to ground summaries in retrieved evidence. For each cluster, we construct two evidence artifacts: (i) a Cluster Profile aggregating clinical statistics (e.g., means, ranges, abnormality rates), and (ii) a Snippet Bank of patient report excerpts. Evidence is retrieved via a hybrid retriever that combines TF-IDF and dense-embedding similarity with weighted scoring. We enforce citation-constrained prompting, requiring each major claim to cite retrieved evidence or be marked as “insufficient evidence”. We evaluate cluster-wise RAG summaries using metrics for faithfulness (supported-claim rate), completeness (coverage of key abnormal indicators), and safety and overreach (diagnostic, medication, and absolute claims). Experiments on a synthetic hypertension dataset (150 patients stratified into low-, average-, and high-risk) show that our approach reduces hallucinations while preserving the personalization benefits of clustering.

CERA: Context-Engineered Reviews Architecture for Synthetic Dataset Generation

Mon, 29 Jun 2026 00:00:00 +0000

Aspect-Based Sentiment Analysis (ABSA) models require large-scale annotated datasets that are scarce, expensive to create, and suffer from class imbalance. While Large Language Models (LLMs) offer promising synthetic data generation, existing approaches lack factual grounding and provide limited aspect-level control. We present CERA (Context-Engineered Reviews Architecture), a training-free framework that generates realistic, controllable synthetic review text for ABSA through structured context engineering, i.e., carefully composing what an LLM receives as input rather than modifying the model itself. CERA’s three-phase pipeline integrates agentic web-search factual grounding with multi-agent verification, demographic-grounded persona diversity, and configurable polarity balance. Evaluated across three review domains and four architectures, CERA achieves Real-data-level corpus diversity (Distinct-2 of 0.736 vs. Real’s 0.776) while heuristic prompting collapses to 0.254, and scales to 8,000 reviews without quality degradation. Human evaluation confirms CERA reviews approach chance-level detection in a triplet Turing test (30% vs. 33% chance), nearly twice the rate of heuristic prompting (18%).

Struct-SQL: Distilling Structured Reasoning for Small Text-to-SQL Models

Mon, 29 Jun 2026 00:00:00 +0000

Deploying accurate Text-to-SQL systems at the enterprise level faces a difficult trilemma involving cost, security and performance. Current solutions force enterprises to choose between expensive, proprietary Large Language Models (LLMs) and low-performing Small Language Models (SLMs). Efforts to improve SLMs often rely on distilling reasoning from large LLMs using unstructured Chain-of-Thought (CoT) traces, a process that remains inherently ambiguous. Instead, we hypothesize that a formal, structured reasoning representation provides a clearer, more reliable teaching signal, as the Text-to-SQL task requires explicit and precise logical steps. To evaluate this hypothesis, we propose Struct-SQL, a novel Knowledge Distillation (KD) framework that trains an SLM to emulate a powerful large LLM. Consequently, we adopt a query execution plan as a formal blueprint to derive this structured reasoning. Our SLM, distilled with structured CoT, achieves an absolute improvement of 8.1% over an unstructured CoT distillation baseline. A detailed error analysis reveals that a key factor in this gain is a marked reduction in syntactic errors. This demonstrates that teaching a model to reason using a structured logical blueprint is beneficial for reliable SQL generation in SLMs.

An LLM-based Data Augmentation Method for Different Personas to Enhance Alcohol User Prediction at the Population-Level

Mon, 29 Jun 2026 00:00:00 +0000

Alcohol is one of the most widely consumed psychoactive substances globally and is associated with considerable health, social, and legal consequences. This study presents an automated framework for the early identification of alcohol users by classifying their social media posts, addressing the substantial class imbalance commonly observed in such data. To mitigate the underrepresentation of alcohol users, our framework employs a dual-phase augmentation strategy: we first utilize classical data augmentation techniques, and then significantly enhance this approach by integrating generative AI models to synthesize realistic user data and achieve near-balanced datasets. As the core methodological innovation, we introduce the Persona-driven Data Augmentation Method (P-DAM). This technique leverages well-established psychological theories to generate diverse personas that closely resemble real individuals, thereby substantially enhancing the quality of synthetic training data. Models trained using P-DAM demonstrate highly accurate prediction of alcohol users from unlabelled X posts representative of the Canadian population and yield population-level estimates that align with Health Canada statistics, with a minimal deviation of 1.72%. This work not only validates the effectiveness of psychologically based data augmentation but also demonstrates the potential of persona-driven, LLM-based predictive models as a robust and cost-effective alternative to traditional population surveys for estimating national alcohol use prevalence and, in the future, could be applied to other national health trends.

Optimizing RAG for Academic Advising: A Hybrid Routing and Metadata Filtering Approach for Enhanced Accuracy and Efficiency

Mon, 29 Jun 2026 00:00:00 +0000

Academic advisors are essential to university life, yet they are often overwhelmed by the large number of student questions and the complexity of university policies. While Artificial Intelligence (AI) can help by searching through digital handbooks, current practices often struggle with two main problems: they are too slow and they often pull the wrong information. These systems typically search through thousands of documents for every single question, which not only wastes time but also creates "noise" where the AI gets confused by similar but irrelevant data. For an advisor, receiving the wrong policy information is a risk that cannot be ignored. This paper introduces a Hybrid Routing Layer designed to make AI a reliable assistant for professional advisors. Instead of a "brute-force" search that looks at everything, our system acts as an intelligent filter. It uses two main tools: first, a "Regex Router" that instantly finds specific items like course codes and second, a "Semantic Router" that understands the meaning behind policy questions. By narrowing the search area before the AI even begins to look for answers, we eliminate the noise that causes errors. We tested our system using a diverse set of real-world advising queries collected from the Wilfrid Laurier University (WLU) website. Our results show that this approach reduces the search space by 97%. It also makes the system 7x faster, cutting the wait time from 8.2 seconds down to just 1.3 seconds. Most importantly, it significantly improves the accuracy of the information provided to the advisor. This work provides a fast, accurate, and low-cost framework that allows advisors to support students with greater confidence and efficiency.

Measuring the LEAK: A Fine-Grained Metric for Partial Information Leakage in Attempted Jailbreaking of Large Language Models

Mon, 29 Jun 2026 00:00:00 +0000

Large language model safety evaluation commonly relies on binary attack success rate (ASR) metrics, which fail to capture partial information leakage when models incompletely comply with adversarial prompts. We propose LEAK (Level of Exposed Actionable Knowledge), a fine-grained metric that decomposes attacker goals into objective components (OCs) and assigns weighted scores based on the degree each component is exposed. This enables precise identification of leaked behaviours and supports targeted safety improvements. We evaluated four scoring mechanisms, chrF, BERTScore, Sentence Transformers, and LLM-as-a-judge—across five open-source models. Embedding-based metrics struggled to distinguish malicious content from benign discussions, while LLM judges demonstrated superior discrimination. At-scale evaluation across 41 OCs spanning phishing, car theft, malware, and bullying showed LLM judges achieved 80-98% accuracy in distinguishing between OCs which should score high and which should score low with respect to the amount of exposed actionable knowledge, with Qwen 8B reaching 97.56%.

Calame: An Open Source Transcription Software

Mon, 29 Jun 2026 00:00:00 +0000

While research on automatic speech processing is very active, its outcomes remain mainly inaccessible to people without programming skills or expertise. Moreover, studies focus mostly on high-resource languages and conventional setups, preventing a wider adoption and social impact of these technologies. Automatic speech processing systems can be needed in a variety of use cases, such as automatic transcription of meetings, interviews, or even conferences. They can also be useful for subtitling and dictation, or to interact with voice assistants. Non-experts may rely on commercial solutions, but these typically lack modularity, o!er only partial functionalities, increase exposure to cyber threats, and impose significant financial barriers for potential users. As automatic transcription techniques improve, it becomes crucial to make these tools accessible to both the research community and the general public. To make language technology more inclusive, we released Calame, a free, open-source, and accessible software for automatic multilingual speech processing, available for both local and remote use. Its current language coverage includes English and French, with Quebec French and other low-resource languages being gradually incorporated with state-of-the-art fine-tuned models.

A Generalizable AI-Driven Decision Support System for Infectious Disease Modeling

Mon, 29 Jun 2026 00:00:00 +0000

The increasing complexity of infectious disease dynamics highlights the need for integrated and proactive surveillance systems. Traditional approaches remain largely reactive, relying on confirmed case reports and lagging indicators. This work presents a generalizable AI-driven Decision Support System (DSS) for spatiotemporal disease modeling, developed and validated using avian influenza surveillance in Canada. The proposed DSS consists of three core components: a digital surveillance module that leverages online activity for early warning signals; a spatiotemporal risk prediction module that models geographic disease risk using multi-source environmental and ecological data; and an expert system dashboard that integrates analytical outputs into an interactive, user-centered interface. The proposed DSS aims to equip policymakers and emergency responders with the tools needed to mitigate the impact of AIV outbreaks, through more informed, timely, and targeted interventions.

Enhancing Stability in Rule-Based Post-Hoc Explanations

Mon, 29 Jun 2026 00:00:00 +0000

To trust an explanation, it must also stay the same — or at least be similar — when repeated. In much of the existing work this variance, called instability, is caused by random perturbations to the sample being explained. But this is a limited view, so in this work we study stability metrics when the data used to produce perturbations are unstable. We assess powerful explainers which use rules, where explanation instability stemming from training data becomes more apparent, and we theorize why multivariate normal distributions, producing correlated perturbed training data (+P) improve stability and fidelity in our setting. We also balance classes in the training data (+B) to further improve stability, along with exploring the potential of clustering (+C) for locality improvements to explanations. By providing both theoretical reasoning for the improvements and experiments on seven diverse datasets, with two different black-box architectures we found that the rule-based method we employed, BARBE, sharply increased in stability when trained with our modified process. BARBE+PB further exceeded the performance of other methods that improve stability like S-LIME and LORE. The final codes are available as a package on GitHub at https://github.com/IainNBSmith/Stable-BARBE.

Artificial Intelligence-Enhanced Digital Twin System for a New Generation of Intelligent Battery Management

Mon, 29 Jun 2026 00:00:00 +0000

Battery management systems are essential to estimate internal battery states and regulate current safely under nonlinear dynamics, noisy sensing, and changing operating regimes. We propose a closed-loop digital-twin–AI framework that couples (i) a physics-informed neural network (PINN) observer for real-time state estimation, (ii) a high-fidelity single-particle-model digital twin, and (iii) a reinforcement learning (RL) controller that optimizes discharge current under different C-rates. The digital twin provides physically consistent trajectories and synthetic supervision, while the PINN provides a low-latency state of charge (SOC) and state of health (SOH), from noisy measurements, allowing the RL agent to act with realistic partial observability. We evaluated the framework at different C rates, for both a single cell and a pack of cells. In addition, we include a particle swarm optimization capacity-identification module as an independent SOH benchmark. The results demonstrate stable AI-driven SOC regulation across C-rates and scalable extension from cell-to pack-level monitoring. Furthermore, this approach demonstrates a clear pathway to enhance capacity-based SOH estimation accuracy through richer physics integration and expanded training coverage.

Anatomically-conditioned Latent Diffusion Model for Data-Efficient Few-Shot Cross-Domain 3D Glioma MRI Synthesis

Mon, 29 Jun 2026 00:00:00 +0000

Accurate classification of diffuse gliomas is often hindered by domain shifts across centers and a lack of large, annotated datasets. We propose the Anatomically-conditioned Latent Diffusion Model (ALDM), a novel framework for data-efficient, few-shot 3D volumetric MRI synthesis. ALDM utilizes a two-stage approach: a 3D variational autoencoder learns anatomical priors from a data-rich source domain, while a conditional latent diffusion model, guided by tumor masks via a ControlNet, generates structurally coherent volumes for a data-scarce target domain. Evaluated in an extreme few-shot setting with only 16 target images, ALDM outperformed GAN and hybrid baselines, achieving a superior Fréchet Inception Distance (FID) of 85.40 and a downstream classification AUC of 0.987. Qualitative results confirm that the model preserves sharp pathology boundaries and cross-modal consistency, with visual fidelity improving progressively during training. By capturing essential diagnostic features, ALDM provides a robust tool for clinical data augmentation in low-resource settings.

Next-Generation AI Vegetation Analytics: Low-Cost PSRI Translation from RGB and NDVI for Precision Crop Monitoring

Mon, 29 Jun 2026 00:00:00 +0000

Plant senescence monitoring is important for crop health assessment and precision agriculture, but senescence-sensitive indices such as the Plant Senescence Reflectance Index (PSRI) are not always readily available in low-cost imaging workflows. This study presents an inexpensive artificial intelligence approach to translate RGB and Normalized Difference Vegetation Index (NDVI) imagery into PSRI maps using supervised image-to-image regression. Using tiled samples from the open-sourced Canadian Cropland Dataset, three models (UNet, pix2pix GAN, and R2AttUNet) and a baseline model (linear regressor) were trained and evaluated under a consistent pipeline using mean absolute error (MAE), peak signal-to-noise ratio (PSNR), and structural similarity index measure (SSIM). The three models all outperformed the baseline across all scores; of those, UNet achieved the best performance on both validation and test sets, producing the lowest MAE (0.0248) and the highest PSNR (19.85) and SSIM (0.9009) on the test set, while GAN showed competitive but weaker results and R2AttUNet underperformed. The close validation and test agreement indicates stable generalization under the current split. These results demonstrate the feasibility of estimating PSRI from inexpensive RGB+NDVI inputs and support the use of lightweight convolutional models for low-cost vegetation monitoring.

On Efficient Computational Methods for Transformer-Based Symbolic Music Generation

Mon, 29 Jun 2026 00:00:00 +0000

Although Transformer models have shown particular promise for symbolic music generation, their quadratic computational complexity with respect to sequence length presents significant challenges for longer musical pieces. In this paper, we describe the goals and progress of an ongoing dissertation addressing these challenges through three interconnected research directions, aiming at the development of (i) novel tokenisation strategies that significantly reduce sequence lengths while maintaining generation quality, (ii) efficient methods for incorporating arbitrary musical information into attention mechanisms through both additive and multiplicative approaches, yielding statistically significant improvements over strong baselines, and (iii) a hierarchical attention architecture that explicitly models the multi-level structure of music across beats, bars, and larger segments using specialised block-sparse attention patterns. Results achieved so far support our central hypothesis that domain-aware architectural choices, informed by music theory, can yield significant improvements over generic sequence-modelling approaches.

Rotary Informational Embeddings for Symbolic Music Generation

Mon, 29 Jun 2026 00:00:00 +0000

In this paper, we present preliminary results on rotary informational embeddings (RotIE), an extension of rotary positional embeddings (RoPE) for Transformer-based symbolic music generation. With RotIE, we adapt the rotary mechanism to encode arbitrary integer-valued information such as pitch, absolute time, or intra-bar positions directly into the attention computation, allowing the model to depend on relative differences in musical attributes rather than on sequential position only. We focus on one representative per-head strategy and evaluate it on the Lakh MIDI and POP909 datasets. The presented results show improved perplexity over a regular Transformer, the Music Transformer, and a RoPE baseline, particularly on longer unseen sequences.

Efficient Additive Relative Information Attention for Transformer-based Symbolic Music Composition

Mon, 29 Jun 2026 00:00:00 +0000

Symbolic music generation deals with automatically composing music in which the latter is treated as a language whose words represent musical events. In recent years, approaches based on the Transformer architecture using relative positional attention showed particular promise. However, a drawback common between the existing approaches is their limitation to relative distances between the positions of tokens only, rather than properties of the elements represented by them. To overcome this limitation, we introduce an efficient novel method for additive relative information injection based on block-sparse matrix operations. We evaluate the effectiveness of our approach by comparing it to different network architectures and conducting an array of experiments which show improvements over previous approaches.

blksprs: A Triton Library for Block-Sparse Matrix Operations

Mon, 29 Jun 2026 00:00:00 +0000

In this paper, we introduce blksprs, a Triton-based PyTorch library for block-sparse matrix operations designed for machine-learning approaches. In contrast to existing approaches, blksprs supports a significantly wider range of operations, including but not limited to matrix multiplication, softmax, gather, scatter, transposition, (interleaved) repeat, and more. Furthermore, it supports flexible sparsity specification for all input and output matrices of these operations. These features facilitate applications that would have previously been infeasible. We provide a formal specification and demonstrate that blksprs can consistently outperform standard PyTorch and existing Triton implementations. In a practical evaluation, we were able to reduce training time by up to 35% and memory consumption by up to 45% when employing blksprs for the training of a Transformer neural network, with minimal implementational overhead.

RAG-Safe: A Recall-First Safety Framework Comparing Open-Source and Commercial LLM Moderation Pipelines

Mon, 29 Jun 2026 00:00:00 +0000

False negatives—missed detections of harmful content—remain the dominant risk in safety-critical moderation pipelines. We introduce RAG-Safe, a recall-first framework that integrates distribution-preserving contrastive augmentation, committee-diverse retrieval, and a recall-oriented decision policy into a unified moderation architecture. The framework is evaluated using a compact, fully auditable testbed designed to enforce strict leakage control: original samples alone determine the train–test split, and all paraphrases inherit their parent assignment. Within this controlled setting, conventional retrieval-augmented pipelines—both commercial (API embeddings + hosted LLM) and open-source (FAISS + local LLaMA-3)—consistently under-detect unsafe content (FLAGGED recall 0.44). Applying RAG-Safe raises FLAGGED recall to approximately 0.56 across both stacks while preserving overall accuracy ( 0.66) and macro-F1 ( 0.65). A non-RAG classifier baseline provided in our public repository shows similar recallfirst behaviour, reinforcing that these gains are not architecture-specific. Rather than comparing individual model components, we interpret the results as pipeline-level evidence that boundary-focused augmentation, retrieval diversity, and calibrated thresholds jointly shift LLM moderation into a safer operating regime. We conclude by discussing limitations—particularly domain transferability and adversarial robustness—and outline directions for scaling RAG-Safe to broader moderation contexts. Keywords: Content moderation, Recall-first classification, Distribution-preserving data augmentation, Committee-based retrieval, Retrieval-augmented large language models, Safety-critical AI

Toward Autonomous SOC Operations: End-to-End LLM Framework for Threat Detection, Query Generation, and Resolution in Security Operations

Mon, 29 Jun 2026 00:00:00 +0000

Security Operations Centers (SOCs) face mounting operational challenges. These challenges come from increasing threat volumes, heterogeneous SIEM platforms, and time-consuming manual triage workflows. We present an end-to-end threat management framework that integrates ensemble-based detection, syntax-constrained query generation, and retrieval-augmented resolution support to automate critical security workflows. Our detection module evaluates both traditional machine learning classifiers and large language models (LLMs), then combines the three best-performing LLMs to create an ensemble model, achieving 82.8% accuracy while maintaining 0.120 false positive rate on SIEM logs. We introduce the SQM (Syntax Query Metadata) architecture for automated evidence collection. It uses platform-specific syntax constraints, metadata-based retrieval, and documentation-grounded prompting to generate executable queries for IBM QRadar and Google SecOps. SQM achieves a BLEU score of 0.384 and a ROUGE-L score of 0.731. These results are more than twice as good as the baseline LLM performance. For incident resolution and recommendation generation, we demonstrate that integrating SQM-derived evidence improves resolution code prediction accuracy from 78.3% to 90.0%, with an overall recommendation quality score of 8.70. In production SOC environments, our framework reduces average incident triage time from hours to under 10 minutes. This work demonstrates that domain-constrained LLM architectures with retrieval augmentation can meet the strict reliability and efficiency requirements of operational security environments at scale.

CompressNAS : A Fast and Efficient Technique for Model Compression using Decomposition

Mon, 29 Jun 2026 00:00:00 +0000

Deep Convolutional Neural Networks (CNNs) are increasingly difficult to deploy on microcontrollers (MCUs) and lightweight NPUs (Neural Processing Units) due to their growing size and compute demands. Low-rank tensor decomposition, such as Tucker factorization, is a promising way to reduce parameters and operations with reasonable accuracy loss. However, existing approaches select ranks locally and often ignore global trade-offs between compression and accuracy. We introduce CompressNAS, a MicroNAS-inspired framework that treats rank selection as a global search problem. CompressNAS employs a fast accuracy estimator to evaluate candidate decompositions, enabling efficient yet exhaustive rank exploration under memory and accuracy constraints. In ImageNet, CompressNAS compresses ResNet-18 by 8$\times$ with less than 4% accuracy drop; on COCO, we achieve 2$\times$ compression of YOLOv5s without any accuracy drop and 2$\times$ compression of YOLOv5n with a 2.5% drop.

STRESNET & STYOLO : A New Family of Compact Classification and Object Detection Models for MCUs

Mon, 29 Jun 2026 00:00:00 +0000

Recent advancements in lightweight neural networks have significantly improved the efficiency of deploying deep learning models on edge hardware. However, most existing architectures still compromise accuracy for latency, which limits their applicability on MCU/NPU-based devices. In this work, we introduce two new model families — STResNet for image classification and STYOLO for object detection — jointly optimized for accuracy, efficiency, and memory footprint on resource-constrained platforms. The proposed STResNet series (ranging from Nano to Tiny variants) achieves competitive ImageNet-1K accuracy within a 4M parameter budget. Specifically, STResNetMilli attains 70.0% Top-1 accuracy with only 3.0M parameters, outperforming MobileNetV1 and ShuffleNetV2 at comparable computational complexity. For object detection, STYOLOMicro and STYOLOMilli achieve 30.5% and 33.6% mAP, respectively, on the MS-COCO dataset, surpassing YOLOv5n and YOLOX-Nano in both accuracy and efficiency. Furthermore, when STResNetMilli is used as a backbone with the Ultralytics detection head, it approaches the performance of the YOLOv11n model under the latest Ultralytics training environment.

CEAR: Certified Ensemble Adversarial Robustness in DNNs

Mon, 29 Jun 2026 00:00:00 +0000

Deep Neural Networks (DNNs) are highly susceptible to adversarial perturbations, leading to extensive research on robustness for safety-critical applications. State-of-the-art empirical defense mechanisms improve the robustness of DNNs through the training phase, but still struggle against adaptive white-box attacks. On the other hand, certified defenses offer provable guarantees of robustness within a specified perturbation bound. These guarantees hold regardless of the level of perturbations, even if the attacker is given full knowledge of the model. In this paper, we propose CEAR, an ensemble-based robust method that utilizes a hybrid of empirical and certified defense mechanisms. CEAR trains each network within the ensemble using varying Gaussian noise and temperatures to obfuscate gradients and logits, making the model more resistant to stronger gradient-based attacks. We then use noisy logits and propose two different voting mechanisms to further improve robustness. Furthermore, we extend randomized smoothing to verify the robustness of ensemble-based classifiers. Our experimental evaluations on MNIST, CIFAR10, and TinyImageNet datasets demonstrate superior certified accuracy on average, increased robustness radius, and decreased transferability compared to baseline methods.

Evaluation Protocols Under Extreme Class Imbalance: Evidence from a Newborn Screening Case Study

Mon, 29 Jun 2026 00:00:00 +0000

Evaluation protocols, such as cross-validation and bootstrap, are extensively used when experimenting with machine learning and AI models for obtain reliable performance estimates. However, the choice of the specific configurations used—e.g., 5-fold versus 10-fold cross-validation—or the strategies for hyper-parameter tuning is often arbitrary, with researchers relying on frequently used defaults. There is limited knowledge about how these selections influence the reported performance, particularly in scenarios characterized by extreme class imbalance. In such challenging scenarios, researchers often apply resampling strategies, such as, random oversampling, or smote, to improve the performance on the rare class. However, their effects on performance estimation under such extreme conditions also remains largely unexplored. This paper investigates the implications of mutliple evaluation protocol choices in the context of extreme class imbalance, using a real-world case study in newborn screening to illustrate the practical impact on model assessment and reliability. Our findings show that some design choices critically influence the variability of the results, and different configurations can affect results robustness, sometimes leading to conflicting conclusions about the best-performing model.

Evaluating Role-Based Prompt Architectures in In-Context Learning

Mon, 29 Jun 2026 00:00:00 +0000

In-context learning (ICL) enables Large Language Models (LLMs) to generate predictions based on prompts without additional fine-tuning. While prompt engineering has been widely studied, the impact of role design within prompts remains underexplored. This study examines the influence of role configurations in zero-shot and few-shot learning scenarios using GPT-3.5 and GPT-4o from OpenAI and Llama2-7b and Llama2-13b from Meta. We evaluate the models’ performance across datasets, focusing on tasks like sentiment analysis, text classification, question answering, and math reasoning. Our findings suggest the potential of role-based prompt structuring to enhance LLM performance.

AdaPrivate-TS: Private Thompson Sampling for Contextual Bandits with Privacy Amplification

Mon, 29 Jun 2026 00:00:00 +0000

We present AdaPrivate-TS, a differentially private contextual bandit algorithm that combines Thompson Sampling with batched zCDP composition. Our key insight is that differential privacy noise inflates the posterior covariance in a structured way—adding N(0, $\sigma$$^2$ I) noise to b yields sampling covariance v$^2$ A$^{-}$$^1$ + $\sigma$$^2$ A$^{-}$$^2$, which Thompson Sampling interprets as increased uncertainty rather than pure corruption. Under event-level privacy (protecting individual interactions) with stochastic contexts, we prove that the privacy cost is only O($\sqrt{}$d \cdot log T/$\sqrt{}$$\rho$)—logarithmic in T—because parallel composition amortizes noise across batches. Additionally, we explore privacy amplification via Poisson subsampling, which can reduce effective noise at stringent privacy budgets. Experiments on synthetic and real-world datasets demonstrate: (1) AdaPrivate-TS achieves 93-99% of non-private performance at $\varepsilon$ $\in$ [0.5, 5], outperforming UCB by 0.5-3.7% and up to 18% with tuned adaptive exploration at extreme $\varepsilon$; (2) privacy amplification provides additional 2-5% gains at low $\varepsilon$; (3) on MovieLens and Jester, AdaPrivate-TS achieves the best overall performance among event-level baselines, dominating at $\varepsilon$ $\geq$ 2; (4) under DP-SVD private features, TS’s advantage over UCB grows to +11%, confirming noise-as-uncertainty is not limited to reward privacy. We provide rigorous proofs for privacy guarantees under interactive zCDP composition and comprehensive evaluation including convergence curves, 12-seed CIs, and DP-SVD feature ablation. Keywords: Differential Privacy, Thompson Sampling, Contextual Bandits, Privacy Amplification, zCDP

STRIDE Moves Market Sentiment

Mon, 29 Jun 2026 00:00:00 +0000

Aspect-based sentiment analysis in the financial domain requires models to reason over sparse, entity-centric signals while remaining robust to linguistic variability and conflicting cues. We introduce STRIDE, a reinforcement learning framework that reformulates keyword selection as a sequential decision-making problem, integrating Directional Stimulus Prompting (DSP) with stable reward-driven policy optimization. To address the instability of sparse, high-variance reward signals, STRIDE incorporates exponential moving average (EMA) smoothing into the REINFORCE objective, enabling more reliable gradient estimates for policy learning. We evaluate STRIDE on two benchmark financial sentiment datasets: SEntFiN 1.0 and FinEntity. On SEntFiN 1.0, STRIDE achieves state-of-the-art F1-score (0.946) and near state-of-the-art accuracy (0.950). On FinEntity, STRIDE exceeds the previous state-of-the-art F1-score by 4.2%, achieving state-of-the-art performance on both accuracy (0.942) and F1-score (0.933). Across both datasets, the results demonstrate that EMA-smoothed rewards provide consistent improvements of 2.6% to 4.3% F1 relative to unsmoothed baselines, validating the effectiveness of stability-aware reward formulation for financial aspect-based sentiment analysis. The source code for reproducibility is available at: https://github.com/sujayrittikar/stride_sentiment_analysis.

Density-Aware Graph Generation with Learnable Edge Prediction

Mon, 29 Jun 2026 00:00:00 +0000

Generating realistic graph-structured data is challenging due to discrete structures, variable sizes, and class-specific connectivity patterns that resist conventional generative modeling. While recent graph generation methods employ generative adversarial network (GAN) frameworks to handle permutation invariance and irregular topologies, they typically rely on random edge sampling with fixed probabilities, limiting their capacity to capture complex structural dependencies between nodes. We propose a density-aware conditional graph generation framework using Wasserstein GANs (WGAN) that replaces random sampling with a learnable distance-based edge predictor. Our approach embeds nodes into a latent space where proximity correlates with edge likelihood, enabling the generator to learn meaningful connectivity patterns. A differentiable edge predictor determines pairwise relationships directly from node embeddings, while a density-aware selection mechanism adaptively controls edge density to match class-specific sparsity distributions observed in real graphs. We train the model using a WGAN with gradient penalty, employing a GCN-based critic to ensure generated graphs exhibit realistic topology and align with target class distributions. Experiments on benchmark datasets demonstrate that our method produces graphs with superior structural coherence and class-consistent connectivity compared to existing baselines. The learned edge predictor captures complex relational patterns beyond simple heuristics, generating graphs whose density and topology closely match real structural distributions. Our results show improved training stability and controllable synthesis, making the framework effective for realistic graph generation and data augmentation.

Segmentation Expert-Mixture Regularization: An Adaptive Learning Method for Imbalanced Regression Problems

Mon, 29 Jun 2026 00:00:00 +0000

Imbalanced regression poses a significant challenge for models across a diverse set of domains, where rare extreme cases are often the most important. Standard regression methods, which optimize global error objectives, tend to prioritize high-density regions of the target space, resulting in systematically degraded performance in low-density, extreme regions. Although prior work has focused on data-level strategies that modify the target distribution, comparatively little attention has been devoted to modifying the learning process itself, making it imbalance-aware. In this paper, we introduce Segmentation Expert-Mixture Regularization (SER), a novel algorithm-level framework for imbalanced regression. SER partitions the target space into regions of varying density and leverages a mixture-of-experts architecture to promote specialization across these regions. A regularization mechanism ensures smooth transitions between the built data partitions and provides a global coherence across segment boundaries. This ensures an adaptive and stable learning method over the entire target space. By integrating segmentation, expert specialization, and regularization within a unified learning framework, SER improves robustness and predictive performance, especially in the rare, extreme, and most important target cases. Our experiments show consistent improvements over standard models, particularly in extreme target quantiles. We further analyze the impact of segmentation design, parameter sensitivity, and performance variation across the target distribution. To foster reproducibility and future research, all our code is publicly released.

Uncovering Latent Subgroups: Spectral Clustering for Fairness Analysis in Contrastive Embeddings

Mon, 29 Jun 2026 00:00:00 +0000

Contrastive learning enables scalable representation learning in computer vision and healthcare, yet embedding spaces may encode unequal geometric structure across latent subpopulations, leading to downstream performance disparities. Conventional fairness audits relying on demographic labels often fail to detect such structural bias. This work examines whether contrastive embeddings contain fairness relevant latent subgroups that can be identified without demographic supervision. We introduce a label free spectral fairness audit that constructs similarity graphs over CLIP embeddings and applies eigengap based spectral clustering. Experiments on CheXpert reveal stable latent subgroups with noticeable geometric distortions and performance gaps, exposing hidden fairness risks missed by demographic based evaluations. This work enables label-free discovery of hidden fairness and reliability risks in contrastive embeddings, supporting safer, more transparent deployment of foundation models in healthcare and other high-stakes domain.

PICAB: A Permutation-Invariant Contextual Attention Bandit for Energy-Constrained Edge AI

Mon, 29 Jun 2026 00:00:00 +0000

The deployment of Deep Neural Networks (DNNs) on resource-constrained edge devices presents a fundamental challenge: high-accuracy models often exceed the compute and energy budgets of local hardware, while full cloud offloading incurs unpredictable network latency. Inference splitting has emerged as a promising solution to this trade-off, enabling a DNN to be partitioned layer-wise across the edge-cloud continuum. However, optimizing these split decisions is non-trivial; the action space comprising valid cut points and available target nodes fluctuates dynamically with each request, rendering standard fixed-output Reinforcement Learning (RL) architectures ineffective. In this paper, we propose a Permutation-Invariant Contextual Attention Bandit (PICAB), a lightweight deep learning framework designed for real-time DNN partitioning. Our architecture employs a Multi-Head Attention mechanism to encode variable-sized sets of candidate execution plans, allowing the agent to generalize across diverse network environment without retraining. By incorporating the target node’s battery state into the attention mechanism, the agent learns energy-aware and robust offloading decisions. We evaluate our approach using heterogeneous workloads with periodic IoT inference streams. Experimental results demonstrate that our algorithm achieves a 23% reduction in makespan compared to the metaheuristic baselines while maintaining similar Energy-Delay Product (EDP), effectively balancing the trade-off between inference latency and energy sustainability.

ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification

Mon, 29 Jun 2026 00:00:00 +0000

Recent work on time-series models has leveraged self-supervised training to learn meaningful features and patterns in order to improve performance on downstream tasks and generalize to unseen modalities. While these pretraining methods have shown great promise in one-to-many scenarios, where a model is pre-trained on one dataset and fine-tuned on a downstream dataset, they have struggled to generalize to new datasets when more datasets are added during pre-training. This is a fundamental challenge in building foundation models for time-series data, as it limits the ability to develop models that can learn from a large variety of diverse datasets available. To address this challenge, we present a new pre-training paradigm for time-series data called ADAPT, which can efficiently align the physical properties of data in the time-series domain, enabling mixed-batch pre-training despite the extreme discrepancies in the input sizes and channel dimensions of pre-training data. We trained on 162 time-series classification datasets and set new state-of-the-art performance for classification benchmarks. We successfully train a model within the time-series domain on a wide range of datasets simultaneously, which is a major building block for building generalist foundation models in time-series domains.

Enhancing Few-shot Node Classification with High Order Graph Neural Networks

Mon, 29 Jun 2026 00:00:00 +0000

Graph neural networks (GNNs) have achieved significant success in node classification tasks. However, their performance declines when trained on a few examples per class or when applied to unseen classes. Meta-learning tackles this problem by training models across many small learning tasks so that they can quickly adapt to new classes from limited data. In this setting, each task provides a small support set with a few labelled nodes per class, and the model is evaluated on a separate query set of unseen nodes from the same classes. However, applying meta-learning to graphs is particularly challenging due to the interconnected nature of graphs. Existing approaches often enrich tasks with additional contextual information or modify training objectives to better exploit neighbouring nodes and labels through supervised or self-supervised signals. However, the structural complexity of graphs makes it difficult to design stable and transferable tasks, as structural differences across tasks can lead to inconsistent feature representations. We argue that this limitation stems less from the meta-learning framework itself and more from the limited expressive power of standard GNNs. To overcome this, we leverage higher-order GNNs to generate richer node representations during both training and testing, improving the model’s ability to generalize to new classes. Extensive experiments on multiple benchmark datasets demonstrate consistent improvements over state-of-the-art methods. The source code for this project is publicly available at \url{https://github.com/sirajummprince/HIGH-META}.

Disentangling dataset size from synthetic diversity in tuberculosis chest X-ray classification

Mon, 29 Jun 2026 00:00:00 +0000

Synthetic data augmentation is often proposed as a remedy for limited and imbalanced medical imaging datasets. We study tuberculosis detection on the Tuberculosis Chest X-ray Database by training a 256$\times$256 WGAN-GP and a 512$\times$512 latent diffusion model fine-tuned from RoentGen-v2. We evaluated both for image quality and downstream utility. On generative metrics, diffusion outperforms WGAN-GP, achieving lower FID (6.56 vs. 9.28) and substantially lower radiology-aligned Rad-Dino FID (117.89 vs. 201.78), along with higher SSIM/MS-SSIM under our deterministic gen–real pairing protocol. However, in controlled DenseNet-121 classifier experiments under a fixed optimization budget (4,000 steps with identical selection criteria), synthetic augmentation does not outperform a count-matched duplicate-real control at matched dataset size. The duplicate-real control yields the best downstream performance despite adding no new information (e.g., 0.9981 $\pm$ 0.0013 test AUPRC at r = 5), while the best synthetic setting is diffusion at low ratio (r = 0.25). Increasing the synthetic-to-real ratio is not beneficial: high synthetic proportions degrade downstream performance, with particu- larly sharp deterioration for WGAN-GP at large ratios. Overall, the study demonstrates that superiority on generative metrics does not guarantee downstream benefit and highlights the importance of rigorous, count-matched augmentation controls when claiming gains from synthetic data.

SAMURAI: A Two-Stage Foundation Model Pipeline for Robust Optic Nerve Head Segmentation in Fundus Images

Mon, 29 Jun 2026 00:00:00 +0000

Accurate segmentation of the optic nerve head (ONH) is essential for automated glaucoma assessment using the Cup-to-Disc Ratio (CDR). However, conventional convolutional neural networks (CNNs) often exhibit performance degradation under domain shift caused by variations in fundus imaging devices and protocols. Foundation models offer a potential solution due to their large-scale pre-training and intrinsic feature invariance. While the Segment Anything Model (SAM) offers a robust alternative, recent adaptations have resorted to complex, task-specific architectural modifications to handle retinal geometry. In this paper, we propose SAMURAI, a two-stage foundation model pipeline that combines a YOLOv12x-based ONH localizer with a minimally adapted MedSAM foundation model. We rigorously evaluate this supervised baseline against exploratory variants incorporating geometric inductive biases (polar transformations) and semi-supervised learning (SSL). On the REFUGE benchmark, our simplified approach establishes a new state-of-the-art, achieving an Optic Cup Dice of 0.920, significantly outperforming specialized models like FunduSAM (0.867). Furthermore, our ablation study reveals that additional architectural complexity does not confer measurable performance gains over the foundation baseline. These findings suggest that large-scale pre-trained foundation models provide sufficient robustness for ONH segmentation without task-specific architectural modifications.

TAG-BiLSTM: A Temporal Attention-guided BiLSTM for Real-time Rear-end Collision Prediction in Intelligent Transportation Systems

Mon, 29 Jun 2026 00:00:00 +0000

Rear-end collision prediction is a critical component of modern Intelligent Transportation Systems (ITS), enabling the early detection of hazardous driving conditions and supporting advanced driver assistance functions. Although recent deep learning (DL) approaches have shown strong predictive capability, many architectures, especially those relying on heavy attention or graph-based computations, suffer from high inference cost, limiting their suitability for real-time deployment on edge devices. To address this challenge, we propose a light-weight temporal attention-guided bidirectional long short-term memory (TAG-BiLSTM) network that efficiently models historical vehicle-kinematic sequences while capturing long-range temporal dependencies. The BiLSTM backbone encodes forward and backward motion context, while the attention gate selectively emphasizes the most safety-critical frames associated with rapid deceleration, headway compression, or abrupt motion changes. Experimental evaluation on benchmark open-source datasets, including the Next Generation Simulation (NGSIM)1 and highD2, demonstrates the robustness of the proposed model, achieving a mean performance exceeding 94%.

Interpretable Dynamic Rule Attention for Medical Coding

Mon, 29 Jun 2026 00:00:00 +0000

Automatic medical coding maps clinical text to International Classification of Diseases (ICD) codes. While highly accurate, recent neural models operate as black boxes, limiting clinical trust and accountability. We address this by proposing an interpretable, rule-guided attention method for a BioClinicalBERT model fine-tuned on Medical Information Mart for Intensive Care III (MIMIC-III) discharge summaries. Our lightweight approach incorporates domain knowledge via keyword mappings, softly biasing attention toward clinical evidence without restricting the model’s learning capacity. Evaluated on the full ICD-9 task, the model improves micro-F1 (0.330 to 0.384), micro-precision (0.391 to 0.420), and micro-recall (0.285 to 0.353). A McNemar test confirms a statistically significant shift in prediction behaviour (p < 10^-10), while quantitative analysis proves significantly increased attention mass on diagnostic keywords (p < 10^-15). This transparency incurs minimal computational overhead, utilizing linear-time matching without altering the core transformer architecture. Qualitative visualizations further demonstrate that this rule guidance yields clearer, evidence-aligned decision patterns without sacrificing predictive accuracy.

Expanding AI Literacy: How Emotional Intelligence Supports Productive Uncertainty

Mon, 29 Jun 2026 00:00:00 +0000

This paper presents a conceptual framework for understanding the role of uncertainty in the context of LLM usage in education. We argue that AI does not eliminate uncertainty but displaces it, often reducing epistemic uncertainty while amplifying its metacognitive and emotional forms. Throughout the paper, we use hypothetical examples to illustrate the theoretical concepts discussed, although these examples are not meant to provide a predictive account of student outcomes. Finally, we present emotional intelligence skills as a vital but often-overlooked component of AI literacy, helping students recognize the uncertainty that arises in co-creation with LLMs not as a cause for unproductive anxiety, but as a catalyst for creativity.

Personalized Stability Triggers: A Model-Agnostic Framework for Adaptive Early Prediction of At-Risk Students in Education

Mon, 29 Jun 2026 00:00:00 +0000

Early prediction of student performance and timely intervention are important in education. A fundamental challenge is the trade-off between temporal earliness (intervening sooner) and predictive reliability (waiting for sufficient data). Conventional machine learning models typically impose static observation windows that do not account for the heterogeneous behaviors of individual students. In this paper, we propose the Personalized Stability Trigger (PST), a novel dynamic framework that identifies the optimal inference moment for each student based on the stochastic convergence of model confidence. By leveraging the ensemble variance of a Random Forest estimator, PST detects an "information plateau" where further data collection yields marginal information gain. We validate this framework on two disparate educational datasets: ASSISTments (micro-scale problem solving) and OULAD (macro-scale course engagement). Experimental results demonstrate that PST reduces observation latency by up to 20% compared to fixed-window baselines while preserving >99.5% of prediction accuracy. These findings indicate that stability-driven triggers offer a scalable, robust, and model-agnostic solution for early prediction of at-risk students.

Hardware Accelerated Privacy-Preserving Ensemble Learning for X-Ray Image Diagnostics

Mon, 29 Jun 2026 00:00:00 +0000

The adoption of machine learning (ML) in highly regulated and sensitive domains such as healthcare is constrained by escalating concerns regarding data privacy and stringent legal and regulatory frameworks. Although Privacy-Preserving Machine Learning (PPML) techniques provide strong formal guaranties against data leakage, they frequently incur a non negligible reduction in predictive performance. This inherent privacy–accuracy trade-off constitutes a primary obstacle to the practical deployment of PPML systems. This research introduces a novel PPML framework that leverages hardware acceleration techniques in conjunction with ensemble learning to alleviate accuracy degradation and improve performance simultaneously. X-ray images are ideal for PPML diagnostics, as they provide clear, high contrast visualizations that promote fast detection of complex ailments. The resulting system constitutes a robust PPML architecture that attains state of the art performance, achieving an accuracy of 94% relative to existing single model baselines, while simultaneously reducing false negatives by an average of 62%.

LLM Sycophancy: How Users Flag and Respond

Mon, 29 Jun 2026 00:00:00 +0000

While concerns about LLM sycophancy have grown among researchers and developers, how users themselves experience this behavior remains largely unexplored. We analyze Reddit discussions to investigate how users detect, mitigate, and perceive sycophantic AI. We develop the DCR epistemology that maps user experiences across three stages: observing sycophantic behaviors, detecting sycophancy, and responding to these behaviors. Our findings reveal that users employ various detection techniques, including cross-platform comparison and inconsistency testing. We document diverse mitigation approaches, including persona-based prompts and targeted language patterns in prompt engineering. We find sycophancy’s effects are context-dependent rather than universally harmful. Specifically, vulnerable populations experiencing trauma, mental health challenges, or isolation actively seek and value sycophantic behaviors as emotional support. Users develop both technical and folk explanations for why sycophancy occurs. These findings challenge the assumption that sycophancy should be eliminated universally. We conclude by proposing context-aware AI design that balances risks with benefits of affirmative interaction, while discussing implications for user education and transparency.

Multi-Dimensional Model Integrity and Responsibility Assessment Index and Scoring Framework

Mon, 29 Jun 2026 00:00:00 +0000

Artificial intelligence in high-stakes tabular domains cannot be evaluated by predictive performance alone, yet current practice still assesses explainability, fairness, robustness, privacy, and sustainability mostly in isolation. We propose the Model Integrity and Responsibility Assessment Index (MIRAI), a unified evaluation framework that measures tabular models across these five dimensions under a controlled comparison setting and aggregates them into a single score. MIRAI combines established metrics through normalized and direction-aligned dimension scores, which enables direct comparison across models with different architectural and computational profiles. Experiments on healthcare, financial, and socioeconomic datasets show that higher predictive performance does not necessarily imply better overall integrity and responsibility. In several cases, simpler models achieve a stronger cross-dimensional balance than more complex deep tabular architectures. MIRAI provides a compact and practical basis for responsible model selection in regulated settings.

Does Context Compression Preserve Refusal Alignment?

Mon, 29 Jun 2026 00:00:00 +0000

Context compression reduces inference cost by encoding inputs into compact representations while preserving semantic content. An open question is whether semantic preservation alone is sufficient to maintain downstream behaviours such as refusal alignment. We investigate this question and find that encoder-based compression systematically weakens refusal behaviour in instruction-tuned language models, despite high reconstruction fidelity. This effect persists across model families and compression architectures. Mechanistic analysis shows that compression attenuates activation along the decoder’s learned refusal direction. We further explore Memory Steering, a lightweight inference-time intervention that restores refusal rates to near-baseline levels without retraining and operates entirely in compressed representation space. These results demonstrate that semantic preservation does not guarantee behavioural preservation under compression, highlighting the need to explicitly preserve alignment-relevant features in compression-aware systems.

A Meta-Analysis of Evaluation Framework Reliability and Cross-Domain Generalization

Mon, 29 Jun 2026 00:00:00 +0000

We conduct the first systematic meta-analysis comparing 20 Retrieval-Augmented Generation (RAG) evaluation frameworks, spanning traditional metrics and interpretability methods, from 2020 through 2026, using identical samples across three knowledge domains. Applying all twenty frameworks to 200 question-context-answer triples from RAGBench, we obtain Cochran’s Q = 10,055.63 (p < 0.001) with I^2 = 99.81%, indicating that the large majority of score variance reflects true differences between frameworks rather than sampling noise. Pairwise Pearson correlations range from r = -0.28 to r = 0.90 (median r = 0.21), and three distinct clusters emerge: LLM-as-judge methods (within-cluster mean r = 0.55), a mixed-methods group (mean r = 0.63), and an outlier cluster containing BERTScore, GaRAGe, HALT-RAG, QAFactEval, and RAGChecker (mean r = 0.01). Cluster assignments are perfectly stable across four normalization schemes (ARI = 1.0), and bootstrap resampling confirms co-assignment probabilities of at least 0.75 within the LLM-as-judge and mixed-methods clusters, with the outlier cluster ranging from 0.46 to 0.97. A cross-cluster consensus protocol labels 92.5% of samples as contested, with only 6% receiving unanimous faithful verdicts. These results demonstrate that current evaluation frameworks do not measure a unified construct, and we provide empirically grounded selection guidelines for future research.

Reclaiming the Loop: From the Consensus Trap to Pluralistic Data Annotation

Mon, 29 Jun 2026 00:00:00 +0000

This research challenges the dominant “ground truth” paradigm in machine learning, arguing that current annotation practices suppress meaningful human disagreement in favor of artificial consensus. It identifies two structural failures in annotation pipelines: the allocation gap (mismatch between annotator identity and data context) and the representation gap (erasure of nuance during label aggregation). The proposed solution introduces a pluralistic annotation infrastructure that incorporates identity-aware task assignment and rationale-aware aggregation to preserve lived experience and dissent. By reframing disagreement as a high-fidelity epistemic signal rather than noise, the work advances a model of situated knowledge stewardship aimed at promoting epistemic justice in AI systems.

Are You the A-hole? A Fair, Multi-Perspective Ethical Reasoning Framework

Mon, 29 Jun 2026 00:00:00 +0000

Standard methods for aggregating natural language judgments, such as majority voting, often fail to produce logically consistent results when applied to high-conflict domains, treating differing opinions as noise. We propose a neuro-symbolic aggregation framework that formalizes conflict resolution through Weighted Maximum Satisfiability (MaxSAT). Our pipeline utilizes a language model to map unstructured natural language explanations into interpretable logical predicates and confidence weights. These components are then encoded as soft constraints within the Z3 solver, transforming the aggregation problem into an optimization task that seeks the maximum consistency across conflicting testimony. Using the Reddit r/AmItheAsshole forum as a case study in large-scale moral disagreement, our system generates logically coherent verdicts that diverge from popularity-based labels 62% of the time, corroborated by an 86% agreement rate with independent human evaluators. This study demonstrates the efficacy of coupling neural semantic extraction with formal solvers to enforce logical soundness and explainability in the aggregation of noisy human reasoning.

Cause-Conditioned Multi-Task Learning for Answerable Question Suggestion in MRC

Mon, 29 Jun 2026 00:00:00 +0000

Machine Reading Comprehension (MRC) systems struggle when user questions are unanswerable given the passage: most simply output “no answer”, leaving users without guidance on how to recover useful information. We introduce a \textit{cause-conditioned multi-task learning (MTL)} framework that turns failure into follow-up by jointly (1) classifying an input as answerable or as one of six fine-grained unanswerability causes (Entity Swap, Number Swap, Antonym, Negation, Mutual Exclusion, No Information), and (2) generating a revised, context-grounded answerable question conditioned on the predicted cause label and an extracted guidance sentence. Using an ensemble of strong readers plus LLMs-as-judges, we apply majority voting to test whether rewrites become answerable. A human study further assesses fluency, relevance, and usefulness. Our cause-conditioning MTL framework yields better recovery from unanswerable inputs and earns strong human ratings, advancing user-supportive, failure-aware MRC.

Defending RAG Against Knowledge Poisoning Using Cross-Encoder Activation Signals

Mon, 29 Jun 2026 00:00:00 +0000

Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in externally retrieved evidence, but it also inherits security risks from the underlying corpus. In particular, an adversary can poison the knowledge source so that injected passages are retrieved and steer the model toward attacker-chosen targets. We propose Cross-Encoder Guardian RAG (CEG-RAG), a defense framework that leverages the internal activations of a cross-encoder reranker to detect and mitigate knowledge poisoning in RAG pipelines. CEG-RAG uses multi-instance learning (MIL) to jointly (i) detect whether the retrieved context is poisoned and (ii) localize suspicious chunks. Upon detection, it repairs the context by filtering and replacing high-risk chunks prior to answer generation while preserving a fixed context budget. Across three open-domain QA benchmarks—MS MARCO, Natural Questions (NQ), and HotpotQA—under a poisoning attack, \textsc{CEG-RAG} achieves high detection and localization performance (TPR >85% and >88.4%, respectively, at very low FPR), reduces the attack success rate (ASR) by an average of 88.74%, and recovers correct answers. Compared to recent baseline defenses, CEG-RAG consistently provides stronger protection, and a reranker sensitivity study demonstrates its robustness across different reranker configurations. These results position cross-encoder reranker activations as a practical foundation for securing RAG against knowledge poisoning. The code and data are available at https://github.com/CyberScienceLab/CEG-RAG.

Interactive Learning from Explanations with Adaptive Guidance

Mon, 29 Jun 2026 00:00:00 +0000

Explanatory Interactive Learning (XIL) has emerged as a promising paradigm to bridge the gap between machine learning models and human understanding by integrating Explainable Artificial Intelligence (XAI) methods directly into the training process. Traditionally, XIL methods in computer vision rely on expert annotations specifying the evidence present in the input, collected before training starts and regardless of the model behaviour during training. This can be detrimental to the interactive nature of XIL and miss out on the opportunity of taking advantage of the intermediate information about the model during training. In this paper, we formalize XIL as an interactive learning paradigm to provide guidance on model explanations through a series of interactions with an expert user during training. Furthermore, we introduce an approach to approximate the evidence from sparse adaptive interactions collected as guiding points indicating where explanations were deemed irrelevant by the expert during training. We evaluate the proposed framework using a simulated interactive loop to explore interactions in an adaptive setting. Our results show that by taking advantage of the information provided by the model explanations during training, the proposed adaptive framework is able to match, or even exceed, the performance and explainability of XIL methods trained with access to the ground-truth evidence with fewer interactions.

NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data

Mon, 29 Jun 2026 00:00:00 +0000

Inferring nationality from personal names is a critical capability for equity and bias monitoring, personalization, and a valuable tool in biomedical and sociological research. However, existing name-based nationality classifiers are typically trained on relatively small or source-specific labeled datasets, which can introduce coverage gaps and limit per-formance for underrepresented countries. While large language models (LLMs) demonstrate strong zero-shot performance for name-based nationality prediction, their computational cost and latency make them impractical for real-time, large-scale deployment. In this work, we created a large-scale name-nationality dataset from the Open Academic Graph (OAG) and introduce a framework that leverages LLMs as dataset enrichers rather than inference engines. We augment low-resource countries with LLM-generated names and evaluate on real and synthetic-tail test sets. We find that augmentation produces large gains when evaluation includes synthetic tail names and still offers a modest lift on tail-country metrics otherwise. Overall, NameBERT models achieve significantly higher accuracy than state-of-the-art baselines across both in- and out-of-domain tasks, while remaining efficient for large-scale inference compared to LLMs.

From Tweets to Model-Based Causal Spans: Noise-Robust Transformers for Social Media Sentiment Analysis in the Age of LLMs

Mon, 29 Jun 2026 00:00:00 +0000

Social media text is short, noisy, and rapidly evolving. Transformer-based sentiment models like BERTweet are brittle under lexical noise and offer limited explainability. We propose the Noise-Robust Causal Transformer (NRCT), which augments BERTweet with a contrastive objective that aligns semantically equivalent but lexically perturbed tweets, and a causal attention head trained to highlight sparse token spans that drive the model’s prediction. On Sentiment140 and TweetEval-Sentiment, NRCT matches clean accuracy, improves macro-F1 under synthetic noise, and produces token rationales that are more faithful than standard attention (higher deletion/insertion AUC). NRCT offers a practical trade-off be- tween accuracy, robustness, and model-based interpretability for social media sentiment analysis.

Fairness Audits of Institutional Risk Models in Deployed ML Pipelines

Mon, 29 Jun 2026 00:00:00 +0000

Fairness audits of institutional risk models are critical for understanding how deployed machine learning pipelines allocate resources. Drawing on multi-year collaboration with Centennial College, where our prior ethnographic work introduced the ASP-HEI Cycle, we present a replica-based audit of a deployed Early Warning System (EWS), replicating its model using institutional training data and design specifications. We evaluate disparities by gender, age, and residency status across the full pipeline (training data, model predictions, and post-processing) using standard fairness metrics. Our audit reveals systematic misallocation: younger, male, and international students are disproportionately flagged for support, even when many ultimately succeed, while older and female students with comparable dropout risk are under-identified. Post-processing amplifies these disparities by collapsing heterogeneous probabilities into percentile-based risk tiers. This work provides a replicable methodology for auditing institutional ML systems and shows how disparities emerge and compound across stages, highlighting the importance of evaluating construct validity alongside statistical fairness. It contributes one empirical thread to a broader program investigating algorithms, student data, and power in higher education.

Toward Believable Health & Wellness Conversational Agents: A Post-LLM Turing-like Evaluation Framework (Position Paper)

Mon, 29 Jun 2026 00:00:00 +0000

Large language model (LLM) conversational agents can be remarkably fluent yet still fail to feel fully “real” to users, especially in multi-session and higher-stakes interactions. This paper argues that the limiting problem is no longer surface language quality but \emph{believability}: the conditions under which an artificial conversational partner is experienced as a coherent social mind rather than a fluent text generator. We frame believability as an empirical limit case and propose an operational criterion of \emph{bounded practical indistinguishability} relative to an interaction envelope defined by a judge population, interaction contexts, and a time horizon. We then outline a “post-LLM Turing-like” evaluation approach that stress-tests modern detection cues using contextual scenario families, longitudinal re-contact, and multi-signal measurement combining human judgments with behavioral metrics. Finally, we instantiate the framework for a health and wellness agent being developed with an \emph{industry partner} (details anonymized), arguing that wellness settings sharply amplify the importance of epistemic calibration, continuity, and boundary management. The goal is not to advocate deceptive deployment, but to make believability mechanistic and measurable so that both capabilities and risks can be assessed with clarity.

NLP-Assisted Case Identification and Interpretable Machine Learning for Long COVID Detection in Primary Care EMRs

Mon, 29 Jun 2026 00:00:00 +0000

Identifying patients with Long COVID Syndrome (LCS) remains a challenge due to various symptoms, heterogeneous clinical presentation, and inconsistent documentation in electronic medical records. In this study, we develop a machine learning framework that uses natural language processing (NLP) to identify confirmed cases of LCS from physician encounter notes and to predict individuals at risk. Using data from the Manitoba COVID-19 Cohort linked to the Manitoba Primary Care Research Network (MaPCReN), we construct a set of characteristics that incorporate demographics, socioeconomic indicators, and pre and post-COVID symptom profiles. We frame Long COVID identification as an extreme class-imbalance NLP classification problem ( 4% confirmed cases in the development cohort) and address this challenge using imbalance-aware learning through random under-sampling and over-sampling strategies. Logistic regression with elastic net regularization combined with under-sampling achieves the best performance, with a sensitivity of 0.95, specificity of 0.81, and an AUC of 0.94, identifying 1,124 potential LCS cases among 4,556 COVID-19 positive individuals. These results demonstrate that combining unstructured clinical text with interpretable, imbalance-aware learning enables scalable Long COVID surveillance and risk identification in real-world EMR settings.

VOLTS: Validated Output through Logit Tree Search for Reliable PDDL Planning with Small Language Models

Mon, 29 Jun 2026 00:00:00 +0000

Autonomous agents that must run on edge hardware cannot afford the compute footprint of frontier LLMs, yet they still need dependable task-planning. We address this gap by showing how a single pass with Llama 3.1 8B, 4-bit Small Language Model (SLM) can generate syntactically correct plans in the symbolic-planning formalism Planning Domain Definition Language (PDDL) while respecting tight memory and latency budgets. VOLTS rests on three ideas. (1) Action-token fine tuning: the SLM is fine-tuned on a custom vocabulary where every token encodes a complete grounded action, giving the model strong task heuristics without expanding its size. (2) Real-time validator: a lightweight symbolic module checks each candidate token against the current state during decoding, guaranteeing that any plan emitted contains no hallucinated or infeasible actions. (3) Parallel branching search: when several validated actions appear promising, VOLTS explores them in parallel branches within the same forward pass, preserving single-pass efficiency while widening search. Evaluated on 2000 problems (500 each in the IPC Blocksworld, Logistics, DriverLog, and Rover domains), VOLTS returns valid plans for 76% of tasks. Those plans average 1.08$\times$ the length of solutions from the classical Fast Downward planner, far outperforming GPT-4o (7% validity) and a finetuned baseline without in-loop validation (0.13%). Unlike Tree-Planner or LLM Modulo frameworks, VOLTS validates per token inside a single inference pass, eliminating costly iterative cycles. By coupling resource-aware neural guidance with deterministic symbolic checks, VOLTS opens the door to reliable, on-device planning for robots, drones, and embedded IoT agents where every millisecond and megabyte counts.

Reducing Representation Bias through Fairness-Driven Sampling in Contrastive Learning

Mon, 29 Jun 2026 00:00:00 +0000

Contrastive learning is a widely applicable Self-Supervised machine learning algorithm that has demonstrated state of the art performance often competing with supervised learning methods. However, the stochastic approach to sampling can inherently amplify representation bias, as over-represented groups are more likely to dominate contrastive pair construction while underrepresented groups receive limited exposure during training leading to imbalanced subgroup representation and biased downstream performance. To address this issue, we propose a fairness-driven sampling algorithm that leverages latent similarity structure to infer subgroup information and guide positive and negative pair selection without the reliance on annotated demographic attributes. Our fairness-driven approach is evaluated in terms of both fairness representation and utility. The results show that our fairness-driven sampling strategy not only increases representation across underrepresented latent subgroups, but maintains competitive accuracy with baseline Contrastive learning sampling. This method has the potential to improve fairness in downstream applications such as facial recognition, clinical diagnostics, and language models deployed in demographically diverse or low-resource contexts.

ASHC: Quantum-Inspired Hierarchical Clustering for Priority-Aware Coverage Path Planning

Mon, 29 Jun 2026 00:00:00 +0000

Coverage Path Planning (CPP) is a fundamental challenge in robotics, where the goal is to compute paths that ensure complete traversal of an environment. While classical CPP approaches perform well in structured or small-scale settings, they often struggle with scalability and lack mechanisms to adapt to context in large, complex environments. This study proposes Amplitude Structured Hierarchical Clustering, a quantum-inspired hierarchical CPP framework that integrates amplitude-based contextual encoding into the CPP pipeline. The proposed method constructs an amplitude field inspired by quantum walk dynamics to represent spatial variation across the environment, enabling the principled decomposition of coverage targets into semantically coherent clusters to guide both intra-cluster and inter-cluster traversals. Through formal analysis and experiments across varied environments, this study demonstrates the feasibility and robustness of the approach. These results position ASHC as a promising direction in quantum-inspired planning for robotic coverage tasks.

GlossAdapter: Enhancing word sense disambiguation via LoRA adapters

Mon, 29 Jun 2026 00:00:00 +0000

Word sense disambiguation (WSD) is a long-standing problem in natural language processing (NLP). Recently, fine-tuned large pre-trained models with gloss and other lexical information have been used for WSD. But these models are parameter inefficient as the entire model needs to be trained. To deal with the problem, we propose GlossAdapter to WSD via Low-Rank Adaptation (LoRA) adapter modules and Part of Speech (POS) filtering. LoRA modules are parameter-efficient as they add only a few trainable parameters for a task, keeping the original weights of the pre-trained model frozen while maintaining the model quality. The proposed POS filtering aligns target word context with WordNet lexical categories to construct sentence-gloss pairs for effective model training. We fine-tune our model with SemCor3.0 dataset, and evaluated it with benchmark datasets Senseval-2, Senseval-3, SemEval-2013, and SemEval-2015. We perform experiments based on BERTbase and RoBERTalarge models. By adding only 0.5% of the parameters for RoBERTalarge, the results show that our LoRA adapter-based model combined with POS filtering outperforms the other state-of-the-art models.

Execution Aware A* for Cross Exchange Stablecoin Arbitrage

Mon, 29 Jun 2026 00:00:00 +0000

Cross-exchange cryptocurrency arbitrage enables low-risk profit from price discrepancies across exchanges, yet existing approaches employ negative cycle detection that targets opportunity identification rather than execution feasibility. We introduce an execution-aware pathfinding framework using A* search with domain-specific guidance heuristics, applied to stablecoins, a unique asset class exceeding $300 billion in market capitalization that bridges cryptocurrency and fiat currency, offering a novel dataset for arbitrage research. The problem is modelled as a weighted directed graph where nodes represent (exchange, stablecoin) pairs across 12 centralized exchanges and edges encode real-world costs including fees, slippage, gas, transfer delays, and exchange reliability. Three guidance heuristics and a multi-start strategy are evaluated over 7,200 search instances. Our slippage-aware heuristic h2 reduces node expansions by 29% relative to Dijkstra while matching its profit, demonstrating that domain-specific heuristics can meaningfully improve execution feasibility in real-time arbitrage planning. Code: https://github.com/kevinl03/Stablecoin-CrossExchange-Arbitrage.

FIFA-RS: Fine-grained Image-Feature Alignment for Structural Anomaly Reasoning in Remote Sensing

Mon, 29 Jun 2026 00:00:00 +0000

Traditional remote sensing change detection paradigms typically rely on bi-temporal image pairs to identify surface variations. However, in time-critical scenarios such as post-disaster assessment, pre-event images may be unavailable or subject to severe registration errors. To address this limitation, we propose \textbf{FIFA-RS}, a zero-shot framework that formulates change detection as a \textbf{single-temporal structural anomaly reasoning} problem. FIFA-RS enhances the ability of vision–language models to characterize anthropogenic structures without relying on temporal references. Built upon a frozen CLIP backbone, the proposed framework adopts a lightweight two-stage adaptation strategy that combines token-level high-pass adaptation with an image-only 2D spatial high-pass enhancement branch. The former suppresses token-level common bias and emphasizes relative feature differences, while the latter sharpens local geometric structures such as building contours and boundaries. These structurally enhanced features are further aggregated through learnable multi-scale fusion for dense pixel-level anomaly localization. Extensive experiments indicate that FIFA-RS exhibits strong cross-dataset generalization across diverse remote sensing scenarios. When trained on LEVIR-CD using only post-event images and evaluated on the WHU Building Dataset in a zero-shot setting, the proposed method achieves a \textbf{95.07% Pixel AUC} and a \textbf{58.51% F1-score}. These results suggest that lightweight structural adaptation provides an effective and efficient solution for single-temporal remote sensing analysis.

Waste-Container Lifting Using Residual Reinforcement Learning On Large-Scale Crane with Underactuated Tools

Mon, 29 Jun 2026 00:00:00 +0000

This paper studies the container lifting phase of urban waste-container recycling task with a hydraulic loader crane and an underactuated discharge unit. The task requires accurate hook–ring alignment under tight geometric tolerances while suppressing oscillations of the suspended unit. To address this, we propose a residual reinforcement learning framework that combines a nominal Cartesian controller for trajectory tracking and anti-sway control with a learned residual policy for compensating unmodeled dynamics. The residual policy is trained with PPO. Simulation results show improved tracking accuracy, reduced oscillations, and higher lifting success than the nominal controller alone.

Probabilistic TopK Sparse Autoencoder for Interpreting the Activations of Large Language Models

Mon, 29 Jun 2026 00:00:00 +0000

Sparse Autoencoders (SAEs) have emerged as a popular solution for extracting interpretable features from language model activations. However, existing SAE designs suffer from deterministic activations that starve gradients to “dead” components, and produce uncalibrated coefficients that provide no meaningful notion of uncertainty. To address these limitations, we introduce Probabilistic TopK SAEs, a novel approach that augments the TopK SAEs with probabilistic gating through Binary Concrete distribution. This stochastic sampling helps mitigate gradient starvation to dead neurons while producing coefficient magnitudes that are more correlated with the confidence of feature presence. Empirical experiments with GPT-2 and Qwen3 shows that our method achieves consistent Pareto improvements over the baselines in high sparsity settings (small number of activated features) while maintaining a larger set of alive dictionary features.

BARBiE: An Associative Rule-Based Interactive Framework for Explaining Black-Box Model

Mon, 29 Jun 2026 00:00:00 +0000

Post-hoc explainable artificial intelligence is often provided as a product, typically in the form of static explanation such as a feature-importance ranking or a local surrogate explanation. In contrast, real-world decision workflows demand explanation as a process, characterized by interactivity in which users explore the decision output with what-if questions to develop understanding and trust. Existing explainers are often static, and their output is sensitive to how local samples around the instance are selected. Although rule-based local surrogates can expose feature interactions, user edits often require repeated resampling and retraining, limiting their usability for real-time what-if analysis. To address these gaps, we introduce BARBiE, a model-agnostic framework for instance-level explanation that integrates an association-rule surrogate with an interactive interface. For a given query instance, BARBiE constructs an instance-centered neighborhood, queries the black-box model for labels, and trains a compact association-rule surrogate. Explanations are provided only when the surrogate output matches the black-box decision for the query instance. BARBiE presents IF–THEN rules with support, confidence, and a p-value from Fisher’s exact test. In addition, BARBiE computes rule-grounded, signed feature importance by aggregating instance-aware contributions from the rule base. Importantly, BARBiE supports quick what-if analysis without resamples and retraining the surrogate model. Across four tabular datasets and a user study, we evaluated BARBiE against LIME, SHAP, and BARBE using user ratings of informativeness, understandability, trustworthiness, and satisfaction. Across tasks, BARBiE consistently received higher ratings than the baselines, providing supports that process-centric interactive explanations improve informativeness and understandability and contribute to higher trust and user satisfaction.

Vulnerability of machine learning models for gender recognition in Virtual Reality

Mon, 29 Jun 2026 00:00:00 +0000

Virtual Reality (VR) systems continuously capture fine-grained behavioral signals such as head motion, hand trajectories, and gaze dynamics. These spatio-temporal signals have been shown to contain distinctive patterns enabling accurate gender classification through machine learning models. While predictive performance under nominal conditions is often high, the robustness of such models to structured behavioral perturbations remains largely unexplored. In this paper, we present a systematic robustness analysis of VR-based gender classification models under a comprehensive catalog of realistic behavioral adversarial attacks. We evaluate multiple model families, including ensemble-based tabular classifiers and neural architectures, using statistical and dynamic motion features extracted from public VR datasets. More than one hundred perturbation scenarios targeting metric coherence, global motion style, multimodal synchronization, and latent behavioral structure are assessed using balanced accuracy, flip rate, and confidence stability metrics. Our results reveal significant vulnerability to coordinated, structurally consistent attacks, particularly those affecting global motion properties or metric integrity, while localized noise-like perturbations exhibit limited impact. These findings demonstrate that high nominal accuracy does not guarantee robustness and highlight the necessity of robustness-aware evaluation frameworks for VR-based behavioral inference systems.

Measuring and Closing the Retrieval Gap in Financial Question Answering

Mon, 29 Jun 2026 00:00:00 +0000

Retrieval-augmented generation (RAG) is increasingly applied to financial question answering over long regulatory documents, yet evaluations typically measure only chunklevel retrieval or end-to-end answer quality, leaving a systematic understanding of where and why pipelines fail out of reach. We introduce an oracle-based evaluation framework that decomposes retrieval performance into document, page, and chunk discovery, providing empirical upper bounds at each granularity and exposing a consistent retrieval gap that persists even when the correct document is found. We systematically evaluate several retrieval strategies on 150 FinanceBench questions, spanning dense, sparse, hybrid, hierarchical, query reformulation, and reranking methods using a shared multi-document index. Our analysis shows that while methods such as Multi-HyDE and cross-encoder reranking improve document recall, page-level retrieval substantially lags behind oracle bounds across all baselines. We further break down performance by question type and document type, revealing that retrieval difficulty varies significantly across these dimensions and that no single strategy closes the gap uniformly. As a targeted intervention, we introduce a domain fine-tuned page scorer that ranks pages before chunk retrieval, achieving strong gains under cross-validation, suggesting that domain-specific and page-level modeling is a promising direction.

Multi-Objective Reference-Aligned Machine Unlearning

Mon, 29 Jun 2026 00:00:00 +0000

Machine unlearning aims to remove the influence of specific training samples while preserving the model’s utility. Existing single-objective approaches, such as gradient ascent or random relabeling, often induce catastrophic forgetting due to conflicting optimization dynamics and unbounded forgetting objectives that cause the model to drift from its pre-trained knowledge. We propose Reference-Aligned UnLearning (RAUL), a multi-objective framework that jointly optimizes forgetting and retention by replacing unbounded loss maximization with a bounded KL alignment of predictions on forgotten samples toward a reference distribution representing unseen data, instantiated either as a uniform distribution or an empirical distribution from a held-out reference set, which constrains the forgetting objective and reduces gradient conflict with retention. The resulting multi-objective optimization (MOO) problem is solved via Jacobian descent, which aggregates multiple gradients into a direction that does not conflict. Our results demonstrate that RAUL achieves the closest gap compared to full retraining.

Pure Leveled CKKS for CNN Inference: The Finite Limb Depth Bound and ResNet-20 Stress-Test

Mon, 29 Jun 2026 00:00:00 +0000

This paper presents a systems-level boundary analysis of pure Leveled CKKS homomorphic encryption applied to deep CNN inference, using two algorithmic co-designs as probing mechanisms: (Singular Value Decomposition) SVD-based kernel decomposition and scale-1 integer quantization. We formalize the Finite Limb Depth Bound, showing that scale-1 quantization delays but cannot eliminate rescaling, as the accumulated bit- width is bounded by the RNS prime limb width. A TinyConvNet smoke test confirms pipeline correctness (max noise 2.25$\times$10-4). Stress-testing ResNet-20 under a maxi-mized 60-prime chain at N =65536 reveals that residual shortcut additions induce exact linear RNS level divergence dk = 3 + 7k, exhausting the prime budget at the predicted shortcut index k$*$=8 after 169,741 rotations and 555.7 s. Under the tested 95% SVD energy threshold, average rank 1.9 on 3$\times$3 kernels exceeded the K/2 crossover, producing a 30.7% Ct-Pt overhead. Under the tested parameter regime, algorithmic co-designs alone were insufficient to eliminate bootstrapping; we outline a bootstrap starvation direction that targets reduced bootstrap frequency rather than full elimination.

Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation

Mon, 29 Jun 2026 00:00:00 +0000

Retrieval-Augmented Generation (RAG) significantly improves the factuality of Large Language Models (LLMs), yet standard pipelines often lack mechanisms to verify inter- mediate reasoning, leaving them vulnerable to hallucinations in high-stakes domains. To address this, we propose a domain-specific RAG framework that integrates explicit rea- soning and faithfulness verification. Our architecture augments standard retrieval with neural query rewriting, BGE-based cross-encoder reranking, and a rationale generation module that grounds sub-claims in specific evidence spans. We further introduce an eight-category verification taxonomy that enables fine-grained assessment of rationale faithfulness, distinguishing between explicit and implicit support patterns to facilitate structured error diagnosis. We evaluate this framework on the BioASQ and PubMedQA benchmarks, specifically analyzing the impact of dynamic in-context learning and rerank- ing under constrained token budgets. Experiments demonstrate that explicit rationale generation improves accuracy over vanilla RAG baselines, while dynamic demonstration selection combined with robust reranking yields further gains in few-shot settings. Using Llama-3-8B-Instruct, our approach achieves 89.1% on BioASQ-Y/N and 73.0% on Pub- MedQA, competitive with systems using significantly larger models. Additionally, we perform a pilot study combining human expert assessment with LLM-based verification to explore how explicit rationale generation improves system transparency and enables more detailed diagnosis of retrieval failures in biomedical question answering.

Explainable Medical Image Segmentation via Attention-Gated Fusion of Vision Transformers and U-Nets

Mon, 29 Jun 2026 00:00:00 +0000

Medical image segmentation is essential for assisting medical professionals in locating anomalies in images. The lack of explainability in current medical image segmentation frameworks demonstrates a gap in assisting clinicians in understanding how segmentation decisions are made, towards identifying the segmentation target. In this paper, we present a framework that offers an improved approach for assisting medical professionals in locating anomalies while providing visual explanations in the form of heatmaps of the target. We propose a dual encoder architecture using a U-Net encoder and Vision Transformer to perform accurate segmentation. We employ an attention fusion mechanism to fuse both encoder embeddings and generate an explainability heatmap that offers improved results for highlighting important features. We include discussion that reflects on the ways in which our approach advances the state of the art for medical decision making, in comparison with other current research, elaborating as well as on how the approach can be of value for distinct healthcare concerns. While our current results focus on how our dual encoder approach yields significant benefit, we also briefly discuss how to integrate textual explanations alongside, as a valued step forward for future work. Keywords: Explainable AI, Medical Applications of AI, Computer Vision Segmentation, AI for Social Good, Transformers, Attention.

ConTrans: Learning Text-enhanced Local–global Temporal Representations for Zero-shot Temporal Action Localization

Mon, 29 Jun 2026 00:00:00 +0000

Zero-shot Temporal Action Localization (ZS-TAL) aims to detect and locate previously unseen actions in untrimmed videos. However, existing approaches primarily focus on modeling long-range contextual information, often neglecting the critical relative-offset-based local correlations between video frames. Furthermore, their performance is hindered by limited feature representation capabilities due to the shallow nature of their network architectures. In this paper, we address these limitations by introducing a novel local-global multi-scale feature representation module. We propose a novel multi-scale encoder architecture, termed ConTrans, that integrates convolutional (Conv) inductive biases with transformer Self-attention to jointly capture fine-grained local dependencies and long-range global context, leading to more comprehensive feature representations than existing methods. Experimental evaluations on the ActivityNet-1.3 and THU-MOS14 datasets demonstrate that ConTrans significantly outperforms existing methods, establishing a new benchmark for ZS-TAL.

Query Refinement in Dense Retrieval Using LLM-Driven Relevance Feedback

Mon, 29 Jun 2026 00:00:00 +0000

Dense retrieval methods, which encode queries and documents into a shared semantic embedding space, have achieved strong performance in information retrieval tasks. However, their effectiveness diminishes in scenarios with limited or no domain-specific training data. To mitigate this limitation, recent approaches have leveraged large language models (LLMs) for query refinement in unsupervised dense retriever systems. A promising direction within this line of research involves using LLMs to assess the relevance of initially retrieved documents, and then incorporating the resulting relevance feedback to update the query embedding. Despite promising early results, a systematic investigation of how different prompting strategies and query update mechanisms influence retrieval performance remains absent. In this study, we explore four prompting strategies—Zero-Shot, Few-Shot, Role-Playing, and Chain-of-Thought—to guide LLMs in performing relevance judgments. Furthermore, we evaluate various query update formulas that utilize embeddings of LLM-identified relevant documents to refine query representations. Our experiments, conducted on two datasets and using two open-source LLMs, demonstrate that carefully crafted prompting combined with effective query updates can substantially enhance retrieval performance. These findings provide valuable insights for optimizing LLM-guided relevance feedback in unsupervised dense retrieval. All code and datasets are available at https://github.com/ftmkm97/ReFeed-IR.git.

EPAS: Efficient Training with Progressive Activation Sharing

Mon, 29 Jun 2026 00:00:00 +0000

We present a novel method for Efficient training with Progressive Activation Sharing (EPAS). This method bridges progressive training paradigm with the phenomenon of redundant representation across deeper layers of transformers. EPAS gradually grows an activation sharing region during training by switching decoder layers to activation sharing mode. This results in throughput increase due to reduced compute. To utilize deeper layer redundancy, the sharing region starts from the deep end of the model and grows towards the shallow end. The EPAS trained models allow for variable activation-sharing region lengths for different compute budgets during inference. Empirical evaluations with attention sharing (Q,K) in LLaMA models ranging from 125M to 7B parameters show up to an 11.1% improvement in training throughput and up to a 29% improvement in inference throughput while maintaining similar loss curve to the baseline models. Furthermore, applying EPAS in continual pretraining to transform TinyLLaMA into an attention-sharing model yields up to a 10% improvement in average accuracy over state-of-the-art methods, emphasizing the significance of progressive training in activation-sharing models.

Auditing Citation Behavior in AI-Generated Search Summaries: A Framework and a Case Study of Google AI Overviews

Mon, 29 Jun 2026 00:00:00 +0000

Search engines increasingly integrate Large Language Models (LLM) to generate natural-language summaries with cited sources, while a growing fraction of online content is partially or fully AI-generated. This convergence raises new questions about how generative search systems select citation sources, particularly with respect to document provenance. In this paper, we propose a system-agnostic observational framework for auditing citation behavior in AI-generated search summaries, modeling retrieval and citation as observable processes over query-document pairs and introducing rank- and provenance-conditioned citation measures. We instantiate the framework in a large-scale empirical study of Google AI Overviews on "Your Money or Your Life" queries drawn from the MS MARCO Web Search dataset. Our analysis shows that AI-generated documents are cited more frequently than human-authored documents even after controlling for retrieval rank, with the difference driven primarily by non-retrieved citations and most pronounced at highly ranked positions. These results highlight the importance of transparent, measurement-based auditing for understanding citation behavior in generative search systems.

Just-in-Time Defect Prediction Using Cost-Efficient Boosting Models

Mon, 29 Jun 2026 00:00:00 +0000

Just-in-time (JIT) software defect prediction (SDP) handles imbalanced commit data, reflecting real-world software where bugs are rare. Higher recall and a model’s ability to predict defects are crucial in such settings. Recently, many JIT-SDP approaches have been proposed, predominantly utilizing deep-learning (DL) models. However, tuned XGBoost among traditional classifiers, known for cost-efficiency, has not been explored. Therefore, we explore how hyperparameter(HP) tuned and SMOTE-rebalanced XGBoost perform in imbalanced datasets, focusing on AUC-ROC and Recall. Our findings indicate that selecting five key features can be as effective as using fourteen features. We further explain how HP tuning and the oversampling method improve XGBoost by 1.19%-6.48% in AUC-ROC and 19.32%-43.70% in Recall. Statistical analysis shows that the final XGBoost model achieves the best average performance among the evaluated baselines, with 0.7442 AUC-ROC, 0.4747 F1-Score, and 0.7099 Recall.

The Reliability Gap: A Multi-Dimensional Technical Audit of Memory Safety and Logical Integrity in LLM-Generated C Code

Mon, 29 Jun 2026 00:00:00 +0000

The integration of Large Language Models (LLMs) into C programming education offers a scalable solution to the systemic enrollment crisis in computer science. However, the non-deterministic nature of these probabilistic engines introduces a critical reliability gap, particularly in C programming - a domain characterized by manual memory management and a high risk of Undefined Behavior (UB). This study presents a multi-dimensional technical audit - integrating static structural analysis with dynamic runtime integrity checks - of four 2026 frontier models: GPT-OSS 120B, Llama 3.3 70B, Moonshot Kimi K2, and Qwen 3 32B. Utilizing an automated evaluation pipeline, 1195 code generations across a gradient of complexity (Pointers, Dynamic Memory, and Data Structures) were subjected to static analysis and dynamic runtime instrumentation. Experimental results reveal a significant "Technical Reliability Gap": while models achieve high Compilation Success Rates (CSR), dynamic analysis identified a frequent incidence of "Definitely Lost" heap memory and "logical hangs". We further identify a correlation between architectural paradigm and safety, finding that sparse Mixture-of-Experts (MoE) architectures exhibit lower Static Error Density (SED) and higher logic density than dense counterparts. We conclude by offering a framework for trust calibration based on these technical breaking points, assisting educators in mitigating the "oracle hazards" of AI-mediated instruction.

An Analytical Framework for Multi-Theoretic Ethical Stress Test (MTEST): Ethical Analytics on Sovereign AI and Artificial General Intelligence

Mon, 29 Jun 2026 00:00:00 +0000

The rise of pervasive computing and the pursuit of Artificial General Intelligence (AGI) have moved AI ethics from philosophical debate to a core requirement for global governance. However, ethical evaluation remains a highly subjective task, largely inaccessible to general technologists, and often ad-hoc- due in part to the absence of any structured, pluralistic framework capable of assessing alignment across diverse moral perspectives. This paper presents MTEST11- a Multi-Theoretic Ethical Stress Test that offers a systematic and quantifiable approach to evaluating the ethical soundness of propositions by alignment checks against most influential ethical theories (THE11), including utilitarianism, deontology, rights-based ethics, Rawlsian justice, virtue ethics, and others. The framework, while intentionally simplified for functional application, offers sufficient structure to support systematic quantitative analysis. It measures (i) ethical alignment of propositions, (ii) cross-theoretic consensus on propositions, (iii) moral congruence of individual theories on a proposition set, iv) and shields against any ethical blind spots of any single framework. It also reveals the (v) ethical value anchor set- the set of universally recognized ethical values on which a proposition is supported or contradicted. We demonstrate the utility of MTEST11 by applying it to perform quantitative and qualitative analysis of 14 provocative policy propositions from various sides of ongoing global debate on artificial general intelligence (AGI).

TASR: A Trustworthy LLM-based Framework for TCFD-Aligned Sustainability Report Analysis

Mon, 29 Jun 2026 00:00:00 +0000

Reliable and transparent assessment of environmental, social, and governance (ESG) disclosures is critical for sustainable finance, regulatory oversight, and risk-aware decision-making. However, existing sustainability reporting evaluations rely on costly manual reviews or third-party ratings, which limit reproducibility. This work proposes a trustworthy large language model (LLM)-based framework for automated sustainability report analysis aligned with the Task Force on Climate-related Financial Disclosures (TCFD). We propose TASR (Trustworthy Analysis for Sustainability Report), a three-stage framework for TCFD-aligned sustainability report analysis that integrates LLM-based scoring, benchmarking against third-party ESG ratings, and downstream predictive modeling. Experiments on 100 sustainability reports from U.S. oil, gas, and mining companies demonstrate strong alignment with Bloomberg Environmental Disclosure scores (Spearman’s $\rho$ = 0.70) and high score stability across repeated evaluations. Furthermore, predictive models trained on the LLM-generated TCFD scores achieve meaningful predictive performance in forecasting disclosure benchmarks, highlighting their practical utility for sustainability rating. The results suggest that LLM-based TCFD scoring offers a potentially scalable and transparent alternative for sustainability disclosure assessment.

Text2Edge: Language-Aware Temporal Graph Transformer for Dynamic Link Prediction

Mon, 29 Jun 2026 00:00:00 +0000

Dynamic link prediction in information networks (e.g., email, citation, social, and Wiki-pedia graphs) requires jointly modeling evolving topology and node-level semantics. However, incorporating language signals directly into temporal attention remains an open challenge. Many existing approaches either ignore textual information or attach static language features outside the attention mechanism, while naively using LLM-derived embeddings can be computationally costly and unstable without structural grounding. We introduce Text2Edge, a language-aware graph transformer that injects pretrained language representations into edge-sparse temporal attention, enabling semantic signals to influence how dynamic edges are weighted over time while preserving computational efficiency. Unlike purely language-based models or structure-only transformers, Text2Edge integrates semantic and structural information through a gated fusion mechanism, allowing the model to adaptively balance topology and language signals. To understand the role of semantics in dynamic link prediction, we conduct a controlled comparison between structural, semantic, and hybrid approaches. We evaluate Text2Edge alongside a strong structure-only transformer baseline (LPFormer) and language-augmented variants using BERT and LLaMa embeddings across four dynamic graph datasets. Our results show that structure-only models tend to plateau early, while semantic-aware models continue improving, indicating that semantic signals are critical in evolving real-world networks. The unified Text2Edge framework achieves the best overall performance, demonstrating that aligning pretrained language representations with edge-sparse temporal reasoning improves ranking quality and robustness without densifying the graph or fine-tuning the language encoder.

Smaller, Smarter, Greener: Reducing LLM Inference Emissions with RAG

Mon, 29 Jun 2026 00:00:00 +0000

The escalating computational demands of Large Language Models (LLMs) raise significant concerns regarding their environmental sustainability. While prior work has quantified training emissions, inference - which dominates a model’s lifecycle carbon footprint - remains underexplored in holistic evaluations that jointly consider efficiency and effectiveness. This study investigates whether smaller models augmented with Retrieval-Augmented Generation (RAG) can achieve Pareto-optimal configurations that balance accuracy and carbon emissions better than larger, non-RAG models. We conduct experiments across three model families (DeepSeek-r1, Qwen3, Gemma 3) on two question answering datasets (HotpotQA, Natural Questions), measuring end-to-end emissions using CodeCarbon. Our results show that on Natural Questions, RAG enables models as small as 0.6B parameters to outperform 12B-32B models in terms of F1 score with lower carbon emissions, in some cases achieving up to 90% emission reductions. However, on HotpotQA, the efficiency benefits are more nuanced, with RAG consistently improving F1, but not always reducing emissions. Our work provides a systematic analysis of the efficiency-effectiveness trade-off of incorporating RAG, offering practical guidance for environmentally sustainable AI.

Diagnosing and Repairing Factual Errors in RAG under Budget Constraints

Mon, 29 Jun 2026 00:00:00 +0000

Retrieval-Augmented Generation (RAG) improves the factuality of large language models by grounding responses in external evidence, yet real-world deployments remain fragile. Failures often stem from missing or weakly relevant evidence, as well as from generation that does not faithfully reflect the retrieved context. Many existing approaches rely on fine-tuning, privileged access to internal model signals, or resource-insensitive escalation strategies, which limits their practicality in black-box and budget-constrained settings. We propose D2R-RAG (Diagnose-to-Repair RAG), a model-agnostic and resource-aware framework that combines lightweight failure diagnosis with adaptive repair. D2R-RAG derives interpretable failure signatures from observable signals in the query, retrieved evidence, and generated response, and then selects from a small set of corrective actions under explicit latency and VRAM constraints. Experiments on FEVER and HotpotQA show that D2R-RAG improves reliability over recent baselines and achieves better accuracy–efficiency trade-offs across multiple compute budgets. The code is available at https://github.com/CyberScienceLab/D2R-RAG/.

Enhancing Thermal Image Object Detection using Spatial Edge-aware Attention and Self-supervision Pretext

Mon, 29 Jun 2026 00:00:00 +0000

Thermal cameras offer robust sensing for object detection in low-visibility driving conditions, but thermal images often suffer lower resolution and weaker object boundaries than RGB imagery. This paper presents SEA-YOLO-E (Spatial Edge Attention YOLO-E), an enhanced single-modality thermal object detector that integrates a SEA mechanism and semi-supervised learning to overcome these challenges. First, we introduce the SEA-YOLO architecture, which embeds an Edge Extractor and a novel SEA module into a YOLOv8 backbone to emphasize object boundaries and improve detection accuracy in thermal domains. Bases on it, we extend SEA-YOLO with a semi-supervised learning paradigm: a self-supervised rotation prediction pretext task leverages unlabeled infrared images to learn general feature representations, and synthetic thermal data mitigates class imbalance in training. The proposed two-phase training (self-supervised pretraining followed by supervised fine-tuning) significantly boosts detection performance. Experiments on multiple thermal driving datasets demonstrate that SEA-YOLO-E achieves state-of-the-art results, with improvements of up to 9–12% in mAP over existing detectors. Notably, our edge-enhanced attention and rotation-pretrained model outperforms recent multi-modal RGB-thermal detectors while using only thermal input.

Beyond Recency Bias: Combining Sequential and Global Collaborative Signals in LLM-Based Generative Recommendation for Sparse Data

Mon, 29 Jun 2026 00:00:00 +0000

Recommender systems often use multi-stage retrieval and ranking pipelines, where errors made early cannot be fixed later, which hurts overall recommendation quality. LLM-based generative recommendation avoids this by directly generating item identi- fiers, but many methods represent each item as a token sequence, which creates two concrete problems: generation is slow because tokens must be produced step by step, and it can fail due to beam-search local optima, where the correct item is dropped early because its first token has low probability. SETRec addresses these issues by represent- ing each item as an order-agnostic set of tokens and using query-guided simultaneous token generation, so the item’s CF and semantic tokens are generated in parallel without intra-item token dependency. However, SETRec still uses only one collaborative filtering (CF) token, and when that CF signal comes from a sequence-aware model, it is vulnera- ble to recency bias, especially in sparse and cold-start settings where recent interactions dominate. CFs-SETRec reduces this recency bias by adding a second CF token chosen to capture long-term preferences, and combining it with the sequential CF signal, which preserves both short-term behavior and long-term affinity and leads to more balanced recommendations under sparse data.

Scaling Limits of Deep Reinforcement Learning: A Stability Analysis with Maximal Update Parametrization

Mon, 29 Jun 2026 00:00:00 +0000

While scaling laws have revolutionized supervised learning, their implications for Deep Reinforcement Learning remain under-explored. This paper investigates the theoretical and practical scaling limits of Deep Q-Networks by controlling network parameterization across varying widths. Our empirical results on CartPole-v1 demonstrate that: (1) The standard Feature Learning regime (Mean-Field Theory, $\alpha=1$) achieves the highest peak performance (Return $79.6$) but suffers from catastrophic divergence and rank collapse at large widths; (2) The Lazy Training regime (NTK, $\alpha=0$) is performant (Return $72.1$) but numerically ill-conditioned; and (3) Maximal Update Parametrization ($\mu P$, $\alpha=0.5$) acts as a robust stabilizer, preventing divergence and rank collapse across the entire hyperparameter spectrum, albeit with more conservative learning dynamics (Return 49.7). These findings suggest that while feature learning is necessary for optimal control, naively scaling width without controlling update dynamics leads to optimization instability.

When Do LLMs Listen? Confidence-Guided Knowledge Acceptance in LLMs

Mon, 29 Jun 2026 00:00:00 +0000

Previous work shows that injecting external knowledge from Knowledge Graphs (KGs) can improve reasoning in Large Language Models on multiple-choice question answering. KGs provide structured factual knowledge that reduces errors and hallucinations without costly model updates. Most studies focus on which knowledge to extract from KGs and how to represent it in prompts to improve task accuracy. In contrast, this study examines knowledge acceptance in LLMs, investigating when models use, ignore, or resist injected knowledge. We introduce a confidence-guided framework that categorizes predictions as high, moderate, or low certainty. High certainty indicates a strong preference for a single answer, moderate reflects several plausible options with similar probabilities, and low corresponds to diffuse predictions with no clear preference. To study knowledge injection, we introduce KG-derived statements into the model’s context and track changes in prediction confidence. Interventions include supportive knowledge (reinforcing the model’s top choice), opposing knowledge (favoring alternatives), and irrelevant or noisy statements. Our analysis reveals consistent patterns: highly confident predictions largely ignore new evidence, while moderate and low-confidence predictions are more sensitive, with the model switching between similarly probable options. Low-confidence choices may gain probability but rarely overturn the initial decision. The model remains robust to noisy or irrelevant information as long as relevant knowledge dominates the context.

An Empirical Study of Attention-Based Cross-Modal Retrieval for Movies

Mon, 29 Jun 2026 00:00:00 +0000

We present an empirical study of attention-based cross-modal retrieval for movies. Our approach combines text overviews, poster images, and trailer thumbnails using a cross-attention fusion module to learn unified item representations. To support this study, we augment MovieLens 1M with metadata from The Movie Database (TMDB), including overview text, poster images, and static trailer thumbnails. We evaluate text-only, image-only, and fused representations on top-K retrieval metrics, and compare them with interaction-only baselines based on Bayesian Personalized Ranking (BPR) and LightGCN. The results show that image-only retrieval achieves the strongest Recall@K and NDCG@K performance, while the fused model produces qualitatively more semantically balanced recommendations but does not outperform the strongest unimodal baseline. These findings suggest that attention-based multimodal fusion can improve recommendation coherence and interpretability, while also highlighting the challenge of translating cross-modal signals into stronger ranking performance.

Domain-Adapted Fine-Tuning of ECG Foundation Models for Multi-Label Structural Heart Disease Screening

Mon, 29 Jun 2026 00:00:00 +0000

Transthoracic echocardiography is the reference standard for confirming structural heart disease (SHD), but its use as a first-line screening modality is limited by cost, workflow burden, and specialist availability. This study investigated whether open pretrained electrocardiogram (ECG) models can support echo-confirmed multi-label SHD detection using the public EchoNext Mini-Model benchmark. We focused on six moderate-or-greater echocardiography-derived abnormalities spanning reduced left ventricular ejection fraction, increased left ventricular wall thickness, aortic stenosis, mitral regurgitation, tricuspid regurgitation, and right ventricular systolic dysfunction. Under a common experimental pipeline, we compared engineered ECG features with gradient boosting, end-to-end waveform learning from scratch, and transfer from open ECG foundation models. We then evaluated continued in-domain self-supervised adaptation of ECG-FM on EchoNext waveforms followed by selective supervised fine-tuning, with emphasis on the trade-off between discrimination and adaptation cost. Among the evaluated configurations, the adapted ECG-FM models achieved the strongest overall performance. Across adaptation depths, peak macro-AUROC and macro-AUPRC reached 0.8509 and 0.4297, respectively, while a more parameter-efficient operating point preserved nearly identical AUROC (0.8501) and achieved the highest fixed-threshold macro-F1 (0.3691). Late fusion of the release-provided covariates did not improve threshold-independent discrimination, and the evaluated low-rank adaptation (LoRA) configuration, alternative foundation backbones, and mixture-of-foundation-model strategies did not surpass the best adapted single-backbone operating points. These findings indicate that, for ECG-based case finding and echocardiography triage, the most effective transfer strategy is to combine target-domain self-supervised adaptation with selective supervised updating of a pretrained ECG backbone.

Lightweight Neuro-Symbolic Anomaly Detection of Traffic

Mon, 29 Jun 2026 00:00:00 +0000

Traffic plays a crucial role in modern life, influencing economy, public safety, environmental health, and overall quality of life. As cities grow and transportation networks become more complex, urban planning and monitoring become an even more difficult task. For this reason, extensive sensor networks are deployed across road infrastructures to collect valuable data on a continuous basis. Real-world traffic data is highly dynamic and complex, which makes accurate understanding, timely and meaningful forecasting, and actionable monitoring of traffic behaviour a significant challenge. This paper proposes a neuro-symbolic workflow for lightweight, real-time traffic anomaly detection, designed to handle diverse traffic conditions effectively. The proposed approach is validated in two distinct case studies: a dense urban traffic corridor in Brussels, Belgium, and a large-scale highway network in the San Francisco Bay Area, USA. While the Brussels dataset offers fine-grained temporal data over an extended period, the San Francisco dataset covers a vast number of monitored locations. The results demonstrate the effectiveness of our method in identifying anomalous traffic behaviour, providing valuable insights for traffic management and decision-making.

Differentiable Logic Gate Networks for Low-Latency EEG Classification on Edge Devices

Mon, 29 Jun 2026 00:00:00 +0000

Real-time EEG classification on edge devices is bottlenecked by the floating-point arithmetic of conventional neural networks. We investigated Differentiable Logic Gate Networks (Diff-Logic) as a hardware-native alternative that compiles models into pure Boolean circuits executable via bitwise CPU operations. Through rigorous iso-parameter experiments across four EEG datasets spanning two classification tasks, binary dementia detection and 3-class emotion recognition, we compared Diff-Logic against matched capacity Multi-Layer Perceptron (MLP) and Binarized Neural Network (BNN) baselines at four complexity tiers (50k–500k parameters). On dementia screening, Diff-Logic achieved 80.2% Macro F1, outperforming the MLP baseline by 6.8%. On emotion recognition, the MLP retained a moderate performance advantage but incurred a 2.3$\times$ higher latency and 14$\times$ larger model size when deployed on a power-constrained (7W) Nvidia Jetson Orin Nano CPU (Single-core). Critically, Diff-Logic inference time remained nearly constant across a 10$\times$ increase in model scale, achieving a peak speedup of 2.9$\times$ over MLPs at the largest complexity tier. Our results establish logic-based neural architectures as a practical paradigm for resource-constrained brain–computer interfaces, achieving competitive or superior performance while natively satisfying the latency and memory constraints of portable edge deployment. Code is available on GitHub: https://github.com/Shyamal-Dharia/eeg-difflogic.

Citation Constraints and Reference Hallucinations in Large Language Models

Mon, 29 Jun 2026 00:00:00 +0000

This paper investigates reference hallucinations in large language models (LLMs) under different prompting constraints. Thirty-six academic-style documents were generated across four systems: Gemini 3, ChatGPT 5.1, ChatGPT 4o, and Microsoft 365 Copilot, and evaluated using an automated citation verification method that cross-checks references against Crossref, OpenAlex, and arXiv. The results show that stricter citation requirements are associated with higher rates of invalid or inconsistent references, whereas unconstrained prompts more frequently produce unsupported conceptual claims rather than fabricated citations. These findings indicate that hallucination behaviour depends on task structure rather than simply topic difficulty, highlighting the importance of prompt design and verification when LLMs are used for research-style writing and literature assistance.

Stabilizing Black-Box Prompt Optimization with Textual Regularization and Signal Aggregation

Mon, 29 Jun 2026 00:00:00 +0000

An increasing number of NLP applications interact with large language models (LLMs) through black-box APIs, making prompt engineering critical for controlling model behavior. Recent Automatic Prompt Optimization (APO) methods iteratively refine prompts using model-generated critiques (often called as textual gradients), but they predominantly optimize from failures and underutilize information contained in correct predictions, leading to instability and semantic drift. We propose TRAS (Textual Regularization with Aggregated Signals), a feedback-centric framework that is plug-and-play with existing APO search backbones. It retains the standard textual gradient signal from prior work for error correction, and introduces a complementary textual regularizer derived from successful predictions to preserve beneficial prompt components. Because both signals are stochastic and can be noisy, we further introduce Monte Carlo Signal Aggregation (MCSA), which samples multiple gradients or regularizers and aggregates them into a single actionable directive, emphasizing consistent, actionable advice while filtering out outliers. Motivated by rapid model churn, we also formalize Automatic Prompt Migration (APM), the practical problem of adapting an expert prompt across model versions or API providers without losing critical instructions. Across standard APO and APM scenarios, our approach consistently outperforms strong baselines, yielding higher accuracy, faster convergence, and lower query cost, while substantially reducing the degradation observed under naive prompt migration.

Learning Adaptive Wiener Processes for Stochastic Financial Datasets with Physics-Informed Kolmogorov-Arnold Encoder-Decoder Networks

Mon, 29 Jun 2026 00:00:00 +0000

Financial time series are non-stationary, heavy-tailed, and regime dependent, which complicates price forecasting and undermines the robustness of standard machine learning and deep learning models across assets. This work proposes a physics-informed framework, Adaptive Wiener KAN-RNN, that learns stochastic dynamics of the underlying financial dataset by operating directly in stochastic differential equation (SDE) parameter space: a Kolmogorov–Arnold Network (KAN) encoder transforms 120-day price windows into spline-based functional features tailored to drift ($\mu$_t) and log-volatility (log $\sigma$_t), and a long short-term memory (LSTM) or gated recurrent unit (GRU) decoder models their temporal evolution as latent state processes. A Wiener-process-based loss enforces consistency with geometric Brownian motion (GBM) by aligning the distribution of simulated and realized price paths, ensuring that the learned parameters remain stochastically coherent. Experiments on technology equities, including Apple (AAPL) and Microsoft (MSFT), show that this architecture delivers systematically lower error metrics and near-perfect explanatory power compared with dense KAN and conventional LSTM/GRU baselines, while yielding interpretable time-varying estimates of market drift and volatility.

Representation Effects in Child and Youth Mental Health Emergency Readmission Predictions

Mon, 29 Jun 2026 00:00:00 +0000

Predicting mental health–related emergency department readmission in youth remains challenging, and the role of data representation is underexplored. Using the National Survey on Drug Use and Health (ages 12–18), we compare three representations: (1) structured tabular features, (2) template-generated clinical text, and (3) LLM-derived sentence embeddings. Classical models are trained on tabular data and embeddings, while LLMs are applied to text. Results show that tabular features consistently yield the best and most stable performance. Templated text introduces a representational bottleneck and is less robust under distribution shift, while embeddings preserve some semantics but do not outperform tabular inputs. Representation choice is thus critical for predictive performance.

A Quantitative Evaluation Protocol for Assessing the Clinical Usefulness of 3D Saliency Explanations for MRI-based Alzheimer’s Classification

Mon, 29 Jun 2026 00:00:00 +0000

While Explainable AI (XAI) is widely considered essential for building clinical trust in MRI-based 3D deep learning models for Alzheimer’s disease (AD) detection, the clinical validation of these explanations is insufficiently rigorous. Current evaluation protocols for assessing clinical usefulness rely mainly on subjective visual inspections or limited attributions ’top-k’ regional overlap measures. These methods do not offer a standardized benchmark, making it difficult to objectively determine which explanation method most accurately aligns with the complex and distributed nature of neurodegenerative pathology. To address this gap, this paper proposes a quantitative evaluation protocol for assessing the clinical usefulness of 3D saliency maps through metric-based anatomical alignment. We implement a comprehensive scoring system based on AD neuropathology that assigns clinical importance weights to anatomical regions, allowing for mathematical verification of explanation integrity. We employ a variety of ranking and alignment metrics to evaluate five gradient-based XAI methods: Grad-CAM, Grad-CAM++, HiResCAM, Backpropagation, and Guided Backpropagation, applied to a pre-trained 3D DenseNet architecture. Our findings reveal notable disparities in usefulness that visual inspection and the existing regional overlap protocol often fail to detect properly. Among XAI methods, Grad-CAM++ demonstrated considerable instability and poor alignment with clinical relevance, while Backpropagation and Guided Backpropagation displayed superior spatial consistency by effectively prioritizing clinically significant biomarkers. This protocol provides a structured approach for evaluating explanation methods, advancing empirical alignment between XAI outputs and established pathological evidence.

Neuro-Symbolic Adaptive Collaboration of Arena-Based Argumentative LLMs for Contestable Legal Reasoning

Mon, 29 Jun 2026 00:00:00 +0000

Legal reasoning requires not only high accuracy but also the ability to justify decisions through verifiable and contestable arguments. However, existing Large Language Model (LLM) approaches, such as Chain-of-Thought (CoT) and Retrieval-Augmented Generation (RAG), often produce unstructured explanations that lack a formal mechanism for verification or user intervention. To address this limitation, we propose Adaptive Collaboration of Argumentative LLMs (ACAL), a neuro-symbolic framework that integrates adaptive multi-agent collaboration with an Arena-based Quantitative Bipolar Argumentation Framework (A-QBAF). ACAL dynamically deploys expert agent teams to construct arguments, employs a clash resolution mechanism to adjudicate conflicting claims, and utilizes uncertainty-aware escalation for borderline cases. Crucially, our framework supports a Human-in-the-Loop (HITL) contestability workflow, enabling users to directly audit and modify the underlying reasoning graph to influence the final judgment. Empirical evaluations on the LegalBench benchmark demonstrate that ACAL outperforms strong baselines across Gemini-2.5-Flash-Lite and Gemini-2.5-Flash architectures, effectively balancing efficient predictive performance with structured transparency and contestability.

Poly-WaveGC: A Generalized Spectral Wavelet Graph Convolution Network with Adaptive Orthogonal Polynomials

Mon, 29 Jun 2026 00:00:00 +0000

Spectral Graph Neural Networks (GNNs) typically rely on fixed polynomial bases, such as Chebyshev polynomials, to approximate graph filters. While efficient, these bases enforce a rigid weight function that implicitly assumes a specific prior on the graph signal density, often leading to suboptimal fitting on graphs with diverse spectral distributions. In this paper, we propose Poly-WaveGC, a generalized spectral graph wavelet framework. We introduce a novel zero-point shift mechanism to adapt general Jacobi polynomials for wavelet construction. This approach allows the basis parameters ($\alpha$, $\beta$) to flexibly learn the graph’s specific spectral density while strictly enforcing the wavelet admissibility condition (g(0)=0) structurally. To address the loss of orthogonality inherent in adaptive bases, we introduce an explicit frame-bound regularization that constrains the filter bank to approximate a tight frame, thereby guaranteeing numerical stability. Extensive experiments on 10 benchmarks demonstrate that Poly-WaveGC significantly outperforms fixed-basis baselines on diverse graph structures and tasks, while maintaining robustness in deep networks. The code is available at https://github.com/weicaocw/Poly-WaveGC-public.

Visualizing the Elimination of Arbitrary Variables in Bayesian Networks as Compound Bayesian Networks

Mon, 29 Jun 2026 00:00:00 +0000

Research on Bayesian network (BN) inference continues to this day along two main fronts: scalable inference and deepening our understanding of the semantics of intermediate inference steps. In this theoretical paper, falling in the latter direction, we give a novel graphical representation of eliminating arbitrary variables from discrete BNs. This includes methods that represent both multiplication and marginalization operations and involves extending classical BNs to compound BNs. Our main result formally establishes a one-to-one correspondence between intermediate numeric factorizations and graphical representations.

Beyond Information Sufficiency: Observation-Action Space Alignment in Robotic Reinforcement Learning

Mon, 29 Jun 2026 00:00:00 +0000

Observation design is a fundamental yet under-specified component of robotic reinforcement learning (RL). While classical theory emphasizes that observations should be informationally sufficient, we show—through a focused reaching case study—that sufficiency alone does not guarantee learnability or sim-to-real transfer. Using PPO on a 6-DOF Kinova Gen3 Lite arm, we demonstrate that two observation spaces with equal dimension-ality and theoretically equivalent information content (9D joint-based vs. 9D Cartesian- based) differ by over 60 percentage points in success when paired with Cartesian velocity control. Aligned Cartesian observations consistently learn faster, achieve higher success, and transfer zero-shot to the physical robot, whereas misaligned joint observations fail despite being sufficient in principle. Our findings highlight representational alignment between observations, actions, and rewards as a first-order design constraint in robotic RL, demonstrated through controlled simulation and zero-shot real-world deployment.

Towards Custom AI Benchmarking for the Government of Canada

Mon, 29 Jun 2026 00:00:00 +0000

The Government of Canada (GC) has several options when selecting artificial intelligence (AI) systems to support its operations. At the same time, AI safety issues motivate it to assess these models for various risks that could harm users seeking information about government operations. Thus, evaluation of AI systems has become an area of concern for the GC. Existing AI benchmarks do not suffice to inform the evaluation/selection process as they are generally not adequate for this. To address this problem, we are building CAN-Bench, a bilingual benchmark designed for the Canadian public service context. Based on a dataset compiled from public GC documents, we automatically generate a bilingual set of high-quality questions around government knowledge, safety, and public service values. This paper describes the methodology for benchmark construction and a comparison of various AI models on the benchmark. Our results indicate that while the AI models we tested are good at answering general knowledge questions about government policies, they are not always aligned with public sector values such as non-partisanship, and can potentially provide unsafe responses in some scenarios.

Directional Stock Prediction with Temporal Sentiment

Mon, 29 Jun 2026 00:00:00 +0000

Financial market forecasting is increasingly incorporating textual sentiment cues from news and financial reports alongside traditional market indicators. While prior work has frequently incorporated aggregated daily sentiment indicators, the temporal structure through which sentiment propagates into market move- ments remains underexplored. Sentiment influence on index price dynamics may continue beyond a single observation period, thereby limiting the ability of point-in-time sentiment measures. Moreover, predictive performance is often evaluated using regression metrics such as Mean Absolute Error and Mean Absolute Percentage Error. Although these metrics provide valuable insights into prediction error, they fall short of capturing the effectiveness required in financial settings, where accurately predicting the direction of price movements is crucial. To address these limitations, we introduce temporally sentiment features that capture the persistence and evolution of market perception over time rather than relying on the last-day sentiment. In addition, we propose a Transformer-based forecasting architecture specifically designed to model temporal dependencies between sentiment and index returns. Our approach also prioritizes directional evaluation and incorporates an asymmetric custom objective function to better address the risks associated with negative market movements. Findings indicate that, while conventional error metrics are comparable to baseline models, the integration of temporal sentiment significantly enhances overall directional prediction. Fur- thermore, employing an asymmetric custom objective function especially in the context of the Transformer based model improves the identification of downward trends while ensuring a more effective balance between positive and negative market fluctuations.

Contextual Stance-Aware Semantic Graph Learning for Fake News Detection

Mon, 29 Jun 2026 00:00:00 +0000

The rapid spread of disinformation on social networks threatens public trust and democratic processes. We propose a unified framework for early detection of emerging false narratives by combining context-aware graph modeling with semantic analysis. Our method builds interaction graphs where posts are nodes connected by stance relations (agreement or disagreement). In parallel, a semantic module extracts fine-grained linguistic cues from each post. These signals are fused via a graph neural network that jointly models early diffusion patterns and content semantics to identify deceptive posts at their inception. Experiments on benchmark datasets show that our approach outperforms existing baselines, highlighting the effectiveness of integrating stance-aware graph representations with semantic understanding for scalable disinformation detection.

Generating Concept Lexicalizations via Dictionary-Based Cross-Lingual Sense Projection

Mon, 29 Jun 2026 00:00:00 +0000

We study the task of automatically expanding WordNet-style lexical resources to new languages through sense generation. We generate senses by associating target-language lemmas with existing lexical concepts via semantic projection. Given a sense-tagged English corpus and its translation, our method projects the annotated synsets onto aligned target-language tokens and assigns the corresponding lemmas to those synsets. To generate alignments and ensure their quality, we augment a pretrained base aligner with a bilingual dictionary, which is also used to filter incorrect sense projections. We evaluate the method on multiple languages, comparing it to prior methods, as well as dictionary-based and large language model baselines. Results show that the proposed project-and-filter strategy improves precision while remaining interpretable and resource-efficient. We release our code, documentation, and generated sense inventories at https://github.com/UAlberta-NLP/ExpandNet.

Optimizing Bayesian Neural Networks for Genomic Prediction: A Study on Feature Selection and Architecture

Mon, 29 Jun 2026 00:00:00 +0000

Genome wide association studies (GWAS) scan the genome for genetic variants, typically single nucleotide polymorphisms, whose alleles are associated with phenotypic variation across individuals. GWAS and genomic prediction face a core challenge: learning from extremely high dimensional genotype matrices under limited sample sizes. Bayesian neural networks offer uncertainty aware prediction and the capacity to represent nonlinear genetic effects, but their practical performance depends on feature selection and architectural choices that interact with the inference mechanism. This paper presents an empirical study that improves a Bayesian neural network pipeline for genomic prediction by tuning input selection strategies, network depth and width, and activation functions under Hamiltonian Monte Carlo inference. We compare three approaches: a deterministic ResNet baseline, a standard “out of the box” Bayesian neural network, and an optimized Bayesian neural network produced through targeted tuning. Results show that feature selection is necessary for stable learning under the large $p$, small $n$ regime and that smooth activations are primary drivers of improved posterior exploration and predictive accuracy. On an Ear Height benchmark from the TASSEL tutorial ecosystem, the optimized BNN achieves a test $R^2$ near $0.68$, outperforming the standard BNN and the deterministic baseline.

A Hybrid ALNS-Based Approach for the Electric Vehicle Routing Problem with Time Windows

Mon, 29 Jun 2026 00:00:00 +0000

Facing the need for low-carbon vehicle routing solutions, this paper addresses the Electric Vehicle Routing Problem with Time Windows (EVRPTW), which involves battery limitations and recharging constraints. In particular, we propose a hybrid metaheuristic approach that combines Adaptive Large Neighborhood Search (ALNS), a Greedy Time-Oriented Nearest Neighborhood Heuristic (GTONNH), and Tabu Search (TS). Experiments on standard benchmark instances show that the proposed GTONNH-ALNS-TS variant outperforms baseline approaches, achieving the best solutions on more than 60% of large-scale instances and reaching the minimum fleet size in nearly 95% of the cases. On average, the proposed approach reduces the required number of vehicles by more than one compared to classical ALNS, while maintaining competitive travel distances. High-quality solutions are obtained within a few seconds on small and medium-sized instances, highlighting the efficiency of the proposed framework.

Output-Distribution Divergence as a Pre-Interpretation Gate for Mental Health AI

Mon, 29 Jun 2026 00:00:00 +0000

Machine learning models in mental health are widely used to generate risk scores from observational data, yet their outputs are frequently interpreted in causal or intervention-oriented terms without explicit checks on whether such interpretations transport across demographic contexts. We propose a simple governance-oriented diagnostic that compares predicted-probability distributions across contexts against within-context baselines using standard divergence measures, functioning as a pre-interpretation gate rather than a causal estimator. We operationalize this protocol using Jensen–Shannon divergence and Wasserstein distance, calibrated via bootstrapped intra-context baselines, and evaluate it on depression risk prediction using PHQ-9 data from the National Health and Nutrition Examination Survey (NHANES) 2017–2020. Across age and sex contexts, we find that cross-context divergence consistently exceeds baseline variation, particularly for age-group transfers, even when discrimination metrics such as AUC remain stable. These results demonstrate that performance-based validation alone can mask substantial distributional instability in predicted probabilities, with implications for calibration and interpretability. We argue that output-distribution divergence provides a low-cost, model-agnostic diagnostic for identifying transportability risk prior to deploying or interpreting mental health prediction models in intervention-relevant settings.

GitHub’s Copilot Code Review: Can AI Spot Security Flaws Before You Commit?

Mon, 29 Jun 2026 00:00:00 +0000

As software development practices increasingly adopt AI-powered tools, ensuring that such tools can support secure coding has become critical. This study evaluates the effectiveness of GitHub Copilot’s recently introduced code review feature in detecting security vulnerabilities. Using a curated set of labeled vulnerable code samples drawn from diverse open-source projects spanning multiple programming languages and application domains, we systematically assessed Copilot’s ability to identify and provide feedback on common security flaws. Contrary to expectations, our results reveal that Copilot’s code review frequently fails to detect critical vulnerabilities such as SQL injection, cross-site scripting (XSS), and insecure deserialization. Instead, its feedback primarily addresses low-severity issues, such as coding style and typographical errors. These findings expose a significant gap between the perceived capabilities of AI-assisted code review and its actual effectiveness in supporting secure development practices. Our results highlight the continued necessity of dedicated security tools and manual code audits to ensure robust software security.

Towards Optimizing Proximal Policy Optimization PPO through Supervised Model-Support

Mon, 29 Jun 2026 00:00:00 +0000

Reinforcement Learning enables agents to learn behaviors by interacting with the environment and maximizing cumulative rewards. Model-free methods are widely used for their simplicity and flexibility, but often suffer from slow convergence as they solely rely on trial-and-error learning without knowledge of environment dynamics. Proximal Policy Optimization (PPO) is a popular on-policy algorithm that collects data using its current policy. However, because PPO relies on freshly sampled trajectories, it has limited ability to reuse past experiences, which can lead to repeatedly exploring suboptimal behaviors and slow policy improvement. To address this, we present Model-Support (MS), a supervised assistant that maintains model-free learning principles while improving efficiency. MS learns state-action pairs from high-return trajectories and serves as a supplementary policy that clones high-performing behaviors. While the agent explores broadly, the MS policy samples meaningful actions based on those behaviors. This combination leads to greater diversity of actions by mixing broad sampling from the agent’s actor with focused sampling from the MS policy. MS acts as a form of local memory, capturing high-reward trajectories and guiding exploration toward promising regions that the agent policy might overwrite or miss. Although PPO uses advantage estimates to emphasize better actions within sampled data, it does not explicitly prioritize high-return trajectories. Consequently, suboptimal experiences still influence learning, weakening valuable signals and slowing convergence. This highlights the role of MS in preserving and cloning high-return behaviors to guide exploration and accelerate convergence.

From Hints to Answers: Uncertainty-Aware LLM-Guided Retrieval for Multi-Hop Question Answering

Mon, 29 Jun 2026 00:00:00 +0000

We propose Generate-Retrieve-Generate (GReG), a training-free pipeline for multi-hop open-domain question answering. GReG uses a strong LLM to generate multiple long-form “hints” that expose implicit intermediate facts in the question, and uses the selected hint as a retrieval query for gathering supporting evidence. To choose among candidate hints, we introduce an uncertainty-aware selection method, which favors lower-entropy generations. By improving retrieval quality, GReG enables a smaller, cost-efficient answer generator to answer complex multi-hop questions more accurately. Experiments on HotpotQA and 2WikiMultihopQA show that GReG achieves state-of-the-art performance under identical retrieval and generation settings.

QuantFormer: A Hybrid Quantum Classical Transformer for Hyperspectral Image Classification

Mon, 29 Jun 2026 00:00:00 +0000

Hyperspectral image (HSI) classification is challenging because each pixel has hundreds of spectral bands while only a small number of labelled samples are available. This paper presents QuantFormer, a hybrid quantum–classical transformer that embeds a small variational quantum circuit as a spectral token encoder inside a vision transformer backbone for pixel-wise land-cover mapping. A unified patch-based pipeline with band-wise normalization, principal component analysis, and quantum token encoding is evaluated on four benchmarks: Indian Pines, Pavia University, a 7-class Houston 2013 subset, and EuroSAT_MS. With roughly 35k trainable parameters, QuantFormer attains overall accuracy above 99% on the three airborne hyperspectral datasets and about 89.8% on EuroSAT_MS, competitive with deep 3D CNNs while using substantially fewer weights. Beyond full-data experiments, we also study limited-label regimes and provide practical guidance on when quantum token encoders are a viable alternative to classical projections, without claiming quantum advantage over the strongest classical baselines.

LLM-Enhanced Hypergraph Learning for Review-Based Cross-Domain Recommendation

Mon, 29 Jun 2026 00:00:00 +0000

A major challenge in recommender systems is data sparsity. Cross-domain recommendation (CDR) addresses this issue by transferring knowledge from high-resource (HR) to low-resource domains (LR), but existing methods largely rely on user ratings that provide only implicit preference signals. In this work, we propose a review-based CDR framework that leverages Large Language Models (LLMs) to extract fine-grained product aspects and associated user sentiments from reviews, capturing explicit and nuanced user preferences. The extracted aspects are aggregated across source and target domains, and the relationships among users, items, and aspect-level features are jointly modeled using a hypergraph representation. In this formulation, each hyperedge explicitly connects a user, an item, and the corresponding extracted aspects, enabling a unified representation of their interdependent relationships. The resulting model is trained with a hypergraph neural network (HGNN) to enable effective preference transfer across domains. Experiments show that our approach significantly improves personalized recommendations in data-sparse settings, outperforming strong baselines while maintaining efficient knowledge transfer through shared semantic representations.

Lexicon-Guided Morphological Tag Injection for Low-Resource Filipino-Cebuano Neural Machine Translation

Mon, 29 Jun 2026 00:00:00 +0000

Neural Machine Translation (NMT) remains difficult for low-resource languages, especially those with complex word formation systems. This work focuses on the Filipino– Cebuano language pair, where verbs encode voice and aspect using different morphological patterns. Although the two languages are closely related, their distinct verb formation strategies often create ambiguity and mismatches during translation, leading to errors in predicate interpretation and grammatical alignment. Pretrained multilingual models such as NLLB-200 provide broad language coverage, but they frequently struggle with predicate-level accuracy in closely related Philippine languages due to insufficient explicit morphological grounding. We propose a lexicon-guided morphological tag injection framework that enriches source-side input with structured linguistic cues, including aspect and voice markers derived from a curated morphological lexicon. Rather than modifying the model architecture or introducing new token embeddings, we inject morphological metadata directly into the input sequence and perform parameter-efficient fine-tuning using Low-Rank Adaptation (LoRA). Experimental results show consistent improvements over baseline fine-tuning, particularly in constructions involving complex verbal morphology and one-to-many or many-to-one lexical mappings.

CA2: Code-Aware Agent for Automated Game Testing

Mon, 29 Jun 2026 00:00:00 +0000

Automated game testing is important for verifying game functionality, but it remains a costly and time-consuming process. Manual testing often misses edge cases, and current automated methods struggle to provide full code coverage. Prior work has explored reinforcement learning (RL) for game testing, but without leveraging internal code signals such as the call stack. We present Code-Aware Agent (CA2), which uses call stack information to learn effective testing strategies. The agent receives the current function call trace along with the game state and learns to reach specific target functions. We instrument two types of environments, 1) State-based and 2) Image-based, with support for efficient call stack extraction. Through experimental evaluation, we find that \okey achieves consistent improvement over the non-code aware baselines, which does not leverage call stack information. Our results show that incorporating code signals like the call stack enables more effective and targeted game testing.

Physics-Guided Diffusion Models for Production Forecasting with Limited Well Data

Mon, 29 Jun 2026 00:00:00 +0000

Accurate forecasting of oil and gas well production from limited early-time data is critical for reservoir management and timely land cleanup. We present Physics-SIMS-TS, a novel framework that adapts Self-Improving Diffusion Models with Synthetic Data (SIMS) for time series forecasting and integrates physics-based decline-curve constraints. Given only the first 20% of a well’s production history, our model predicts the remaining 80%. We first adapt the SIMS methodology, originally developed for image generation, to time series by training a conditional diffusion model with negative guidance that steers generation away from synthetic artifacts. Experiments show that domain-aware data augmentation (Inverse Distance Weighting) outperforms generic generative approaches (TimeGAN, TimeVAE) by 1.7x, demonstrating that incorporating domain-specific knowledge improves forecasting performance. Building on this insight, we introduce Physics-SIMS-TS, which integrates Arps decline-curve dynamics through gradient guidance during sampling and monotonicity projection via isotonic regression. Experiments on 16,216 gas wells from British Columbia, Canada, spanning multiple geological formations, demonstrate that Physics-SIMS-TS achieves 1.9-6.2x lower prediction error than traditional machine learning baselines across all dataset sizes, with the largest improvements on small datasets where physics constraints most effectively regularize the learning problem.

FogTTA: Online Test-Time Adaptation for Robust Transformer-based Object Detection in Foggy Weather

Mon, 29 Jun 2026 00:00:00 +0000

Object detection models for autonomous driving commonly experience substantial performance drops when deployed under adverse weather due to the domain shift between training data and real-world operating conditions. This degradation is especially evident when models trained on clear-weather images encounter foggy environments with reduced visibility and contrast. To address this challenge, we introduce FogTTA, an online test-time adaptation framework designed to improve the robustness of Transformer-based object detectors in fog. Using RF-DETR as the underlying object detector, FogTTA enables real-time adaptation to the streaming target domain without requiring source data or retraining. The framework follows a teacher–student design, where the deployed model serves as the teacher and generates pseudo labels from weakly augmented target inputs. These predictions are subsequently refined through non-maximum suppression and confidence filtering. The student model then learns from strongly augmented target sample using the Varifocal loss to mitigate pseudo-label noise. The teacher is updated via exponential moving averaging to ensure stable and continuous adaptation. Experiments show that FogTTA outperforms prior baselines, delivering improved detection accuracy and stability while maintaining real-time performance.

Sequence-Based Evolutionary and Neural Strategies for Reducing Zone Crossings in Toolpaths

Mon, 29 Jun 2026 00:00:00 +0000

Excessive transitions between material zones in 3D printing reduce both efficiency and integrity. We introduce a hybrid optimization framework to minimize these zone cross- ings in Hamiltonian toolpaths. Our approach combines the local search capabilities of Simulated Annealing (SA) with the global exploration of a sequence-based Genetic Algo- rithm (GA). Furthermore, we propose a hybrid neural network that models the learned optimization behavior and predicts efficient sequences of operations. Experiments show thatourmethodsignificantlyreduceszonecrossingsacrossvariouscomplexpatterns, and provesitseffectivenessandscalabilityforefficientmulti-materialadditivemanufacturing.