<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Proceedings of Machine Learning Research</title>
    <description>Proceedings of Machine Learning Research
  Held in Singapore, Singapore on 26 January 2026

Published as Volume 303 by the Proceedings of Machine Learning Research on 27 January 2026.

Volume Edited by:
  Dorien Herremans
  Keshav Bhandari
  Abhinaba Roy
  Simon Colton
  Mathieu Barthet

Series Editors:
  Neil D. Lawrence
</description>
    <link>https://proceedings.mlr.press/v303/</link>
    <atom:link href="https://proceedings.mlr.press/v303/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Tue, 27 Jan 2026 17:07:28 +0000</pubDate>
    <lastBuildDate>Tue, 27 Jan 2026 17:07:28 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>Prevailing Research Areas for Music AI in the Era of Foundation Models</title>
        <description>Parallel to rapid advancements in foundation model research, the past few years have witnessed a surge in music AI applications. As AI-generated and AI-augmented music become increasingly mainstream, many researchers in the music AI community may wonder: what research frontiers remain unexplored? This paper outlines several key areas within music AI research that present significant opportunities for further investigation. We begin by examining foundational representation models and highlight emerging efforts toward explainability and interpretability. We then discuss the evolution toward multimodal systems, provide an overview of the current landscape of music datasets and their limitations, and address the growing importance of model efficiency in both training and deployment. Next, we explore applied directions, focusing first on generative models. We review recent systems, their computational constraints, and persistent challenges related to evaluation and controllability. We then examine extensions of these generative approaches to multimodal settings and their integration into artists’ workflows, including applications in music editing, captioning, production, transcription, source separation, performance, discovery, and education. Finally, we explore copyright implications of generative music and propose strategies to safeguard artist rights. While not exhaustive, this survey aims to illuminate promising research directions enabled by recent developments in music foundation models.</description>
        <pubDate>Tue, 27 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v303/wei26a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v303/wei26a.html</guid>
        
        
      </item>
    
      <item>
        <title>Low-Resource Rhythm Learning of South Asian Beat Structures: Machine Learning Approaches to Nattuvangam</title>
        <description>Semantic representations of rhythmic structures are important for AI-driven music generation and choreography. South Asian classical dance, such as Bharatanatyam, relies on intricate rhythms that guide choreography and improvisation. These rhythms are expressed through Nattuvangam, a vocal and percussive form that uses rhythmic syllables (Solkattus) and cymbal cues (Talam). Despite its pedagogical importance, Nattuvangam is rarely documented in digital form, which limits systematic study and teaching. We present the first curated dataset of Nattuvangam recordings that capture diverse Solkattu patterns and cyclic Talam structures. Each clip is analyzed using handcrafted and learned features, including onset envelopes, inter-onset intervals, tempograms, and Mel-spectrogram embeddings. These representations allow machine learning models to identify, cluster, and retrieve rhythmic motifs across performances. The dataset serves as a pedagogical tool and supports computational exploration of Solkattu patterns in relation to Talam, revealing the structural principles underlying Nattuvangam. This work establishes a foundation for studying Nattuvangam as both a standalone and performative art form, bridging cultural teaching with AI-based rhythm analysis in low-resource contexts.</description>
        <pubDate>Tue, 27 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v303/sudarshan26a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v303/sudarshan26a.html</guid>
        
        
      </item>
    
      <item>
        <title>Artificial Dancing Intelligence: Neural Cellular Automata for Visual Performance of Music</title>
        <description>We present Artificial Dancing Intelligence (ADI), an interactive neural music visualizer that is accessed through a web app, but performs inference entirely on local devices. Our approach enables anyone to create music-driven visuals while leveraging the expressive and sometimes unpredictable dynamics of self-organized systems. ADI uses an audio stream’s average energy (known as RMS) to modulate a neural cellular automata (NCA) that produces visual patterns that move and ’dance’ along with the audio stream in real-time. Through the web interface, users can adjust the relationship between the music’s energy and the NCA system to create unique visual performances out of any music audio stream. ADI achieves smooth, real-time responsiveness on modern consumer devices.</description>
        <pubDate>Tue, 27 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v303/salcedo26a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v303/salcedo26a.html</guid>
        
        
      </item>
    
      <item>
        <title>The Circle of Fifths as Latent Geometry in Bach’s Well-Tempered Clavier</title>
        <description>Can unsupervised deep learning methods encode fundamental music-theoretic features? We answer this question by training an autoencoder on J.S. Bach’s Well-Tempered Clavier and analyzing its latent space via principal component analysis. Sequences in the first two principal components are clustered hierarchically into pieces and keys that spontaneously arrange in a circle-of-fifths geometry. Quantitatively, relative major-minor key pairs (sharing pitch collections) lie more than three times closer than non-relative pairs, and circle-of-fifths distance correlates strongly with learned distances. This structure emerges entirely from reconstruction loss, with no harmonic labels or supervision. Our results suggest that the circle of fifths is an intrinsic property of tonal relationships, demonstrating that unsupervised representation learning can recover harmonic principles that open the door for interpretable data-driven exploration of latent spaces across diverse musical traditions.</description>
        <pubDate>Tue, 27 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v303/sadek26a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v303/sadek26a.html</guid>
        
        
      </item>
    
      <item>
        <title>Silence as Music: Controllable and Interpretable AI for Strategic Silence Placement</title>
        <description>AI music systems increasingly emphasize controllability and interpretable design. We propose a system that treats silence as a first-class compositional element and enables interactive shaping of silence placement through transparent analysis, cultural presets, and steerable controls. Our method constructs multiple candidate rest patterns from phrase boundaries, melodic tension, rhythmic heuristics, and cultural weights, then selects a mask via a quality function balancing rhythmic entropy, groove preservation, and structural coherence. We present baselines (random 10/25%, phrase-only, tension-only, weak-beats), a proxy for language model without silence prompting, and our hybrid predictor. Across four canonical melodies and three cultural presets, our approach increases rhythmic variety while preserving groove and phrase alignment relative to baselines, offering an interpretable framework for co-creative composition. We release an API, offline demos, audio examples (WAV), and a comprehensive experiment suite to support interactive composition, pedagogy, and performance.</description>
        <pubDate>Tue, 27 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v303/ram26a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v303/ram26a.html</guid>
        
        
      </item>
    
      <item>
        <title>Neural Codec Language Model for Controllable Timbre Transfer in Music Synthesis</title>
        <description>Neural codec language models have revolutionized speech synthesis but face significant challenges when adapted to music generation, particularly in achieving precise timbre control while preserving melodic content. We introduce Neural Code Language Model for Controllable Timbre Transfer (NCLMCTT), a novel architecture that enables zero-shot instrument cloning through direct audio conditioning without explicit timbre learning. Our approach combines a 385M-parameter transformer for coarse musical structure modeling with a specialized upsampler for fine timbral detail, achieving flexible control through 1-5 second reference audio segments. We establish the first comprehensive benchmark dataset for controllable timbre transfer evaluation, comprising 62,500 high-fidelity samples across 50 synthesizer presets with ground truth targets. Extensive experiments demonstrate substantial improvements over the TokenSynth baseline: 27.1% reduction in SI-SDR, 50.9% in Mel Distance, and 59.4% in STFT Distance, while maintaining strong melodic coherence (Chroma Similarity: 0.85). Our method achieves robust zero-shot generalization, with performance on unseen instrument presets matching that of seen presets. Ablation studies confirm that extended reference audio duration (40.8% improvement), cross-attention mechanisms (11.9% improvement), and increased model capacity contribute meaningfully to overall performance. By separating melodic content from timbral characteristics and enabling implicit timbre control, NCLMCTT provides both immediate practical value for music creators and a methodological foundation for advancing controllable neural audio synthesis.</description>
        <pubDate>Tue, 27 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v303/liu26b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v303/liu26b.html</guid>
        
        
      </item>
    
      <item>
        <title>TS-RaMIA: Membership Inference Attacks for Symbolic Music Generation Models</title>
        <description>Artists and rights holders face growing concerns about unauthorized use of their copyrighted works in training generative models. We introduce TS-RaMIA, a practical auditing framework enabling creators to test whether their symbolic music has been used without authorization. Unlike existing likelihood-based approaches that are confounded by piece length and density, TS-RaMIA exploits structural tokens—bar lines, positions, and tempo markers—encoding musical phrasing through sample-level analysis and rigorous debiasing. Our method combines (i) length matching and conditional calibration to remove spurious confounders, (ii) tail-of-top-k aggregation on structural tokens to amplify sparse memorization signals, and (iii) a lightweight meta-attacker fusing statistical cues via composer-stratified cross-validation. Evaluated on a 67M-parameter REMI Transformer trained on MAESTRO pieces, TS-RaMIA achieves AUC 0.826 and TPR@1%FPR 14.6%, while a debiased baseline drops to AUC 0.563. Cross-representation validation on NotaGen (ABC notation) yields comparable performance (AUC 0.73, TPR@1%FPR 8.9%), demonstrating transferability.</description>
        <pubDate>Tue, 27 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v303/liu26a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v303/liu26a.html</guid>
        
        
      </item>
    
      <item>
        <title>Encoder-Only Transformers for Melodic Harmonization: Representation Emergence and Inference Strategies</title>
        <description>This paper addresses the problem of melodic harmonization—the automatic generation of harmonic accompaniments that complement a given melody—using non-autoregressive, encoder-only transformer models operating on a synchronized melody-harmony time grid. The proposed framework allows flexible conditioning, such as fixing chords at specific positions, while maintaining high generative quality. Comparative experiments show that single-encoder models outperform dual-encoder architectures despite using fewer parameters. Interestingly, harmony-related attention patterns emerge even when harmony tokens remain fully masked during training, and models using only cross-attention achieve comparable results, suggesting implicit modeling of harmony-harmony relations. Different inference unmasking strategies further reveal notable effects on harmonic structure and coherence.</description>
        <pubDate>Tue, 27 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v303/kaliakatsos-papakostas26a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v303/kaliakatsos-papakostas26a.html</guid>
        
        
      </item>
    
      <item>
        <title>Conditional Vocal Timbral Technique Conversion via Embedding-Guided Dual Attribute Modulation</title>
        <description>Vocal timbral techniques such as whisper, falsetto, and vocal fry scream uniquely shape the spectral properties of the human voice, presenting a complex challenge for converting between them while preserving the original speaker’s identity. Traditional voice conversion methods, while effective at altering speaker identity or broad timbral qualities, often struggle to transform specialized timbral techniques without compromising speaker-specific traits. Similarly, existing style-transfer models, which are designed to capture broad categories like emotional expressiveness or singing styles, lack the necessary granularity to handle technique-specific variations. To address this, we propose FABYOL, a novel framework for timbral technique conversion built upon FACodec. FABYOL leverages supervised contrastive learning to generate embeddings that encode specific timbral techniques. These embeddings are then used to modulate timbre and prosody, enabling authentic technique conversion while preserving speaker identity. Experimental evaluation, using both tailored objective metrics and a user study, demonstrates that FABYOL achieves promising performance and offers significant improvements in fidelity and flexibility compared to state-of-the-art models. To support this task, we also introduce the EMO dataset, a high-quality, paired corpus developed with a specific focus on vocal fry scream. Audio samples, source code, pre-trained checkpoints, and the EMO dataset are available at https://alberthsu0509.github.io/FABYOL/.</description>
        <pubDate>Tue, 27 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v303/hsu26a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v303/hsu26a.html</guid>
        
        
      </item>
    
      <item>
        <title>A Novel Diffusion Model Based Approach for Sleep Therapeutic Music Generation</title>
        <description>Sleep disorders, particularly insomnia, and mental health conditions affect a significant fraction of adults worldwide, posing serious mental and physical health risk. Music therapy offers promising, low-cost, and non-invasive treatment, but current approaches rely heavily on expert-curated playlists, limiting scalability and personalization. We propose a low-cost generative system leveraging recent advances in diffusion models to synthesize music for therapy. We focus on insomnia and curate a dataset of waveform sleep music to generate audio tailored to sleep. To ensure real-world feasibility, we optimize our system for training and use on a single GPU, balancing quality and efficiency through extensive ablation studies. We show through subjective human evaluations that our generated music matches or outperforms existing baselines in both perceived quality and relevance to sleep therapy, while using only a fraction of the computational cost.</description>
        <pubDate>Tue, 27 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v303/hromadka26a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v303/hromadka26a.html</guid>
        
        
      </item>
    
      <item>
        <title>Investigating Timbre Representations in CLAP Across Modalities via Perturbations</title>
        <description>The transition from feature-based language-audio representations to more high-dimensional ones from pre-trained foundation models has enabled us to map audio content to a significantly broader vocabulary of natural language. However, some interpretability of the alignment between the embedding spaces of the two modalities and their relation to psychoacoustic features is lost as a byproduct. In this study, we investigate timbre representations in CLAP in both the text embedding space and audio embedding space. We identify directions for different timbral qualities in each embedding space and use them as perturbation vectors in uni-modal and cross-modal Text-to-Music (TTM) Retrieval and Generation downstream tasks. We find that although both audio and text embeddings move monotonically along their respective timbre directions, timbral variation is more linearly distributed and therefore more easily exploitable in the audio embedding space. Cross-modal perturbation experiments further reveal that the audio and text embedding spaces form a geometrically aligned subspace with respect to timbre. Additionally, our analysis identifies cases where CLAP’s timbre representations closely align with perceptually grounded spectral features, and cases where such alignment is limited.</description>
        <pubDate>Tue, 27 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v303/hebbar26a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v303/hebbar26a.html</guid>
        
        
      </item>
    
      <item>
        <title>SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications</title>
        <description>With recent advances in automatic speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS) technologies, spoken dialogue systems (SDS) have become widely accessible. However, most existing SDS are limited to conventional spoken responses. We present SingingSDS, a cascaded SDS that responds through singing rather than speaking, fostering more affective, memorable, and pleasurable interactions in character-based roleplay and interactive entertainment scenarios. SingingSDS employs a modular ASR-LLM-SVS pipeline and supports a wide range of configurations across character personas, ASR and LLM backends, SVS models, melody sources, and voice profiles, tailored to different needs in terms of latency, quality, and musical style. SingingSDS is available as a plug-and-play web demo, featuring modular, open-source code that supports customization and extension.</description>
        <pubDate>Tue, 27 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v303/han26a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v303/han26a.html</guid>
        
        
      </item>
    
      <item>
        <title>Postscript on the Musics of Control</title>
        <description>This paper traces how “control” is socially coded in contemporary music technology, and offers an account of the circuit of control to bridge the conceptual gap between the sociopolitical and the technical. Formally, the paper adopts a triptych that deliberately echoes Deleuze’s influential “Postscript” in title and cadence: Historical, Logic, Programming to progress its analysis. The analysis redirects attention from the controllability of intelligent systems to the social formations through which control is allocated and recognized. To frame this shift, the paper theorizes the gear economy—a lens set alongside the much-frequented concept gig economy in sociology—to explain how AI-music tools (e.g., Suno, Udio) are legitimated through existing music gear markets. The claim is coalition rather than addition: AI tools fold into pre-existing markets that price and credential controllability. On this basis, the paper calls for design-oriented ethical interventions that speak to technologists, producer-musicians, and the general public, making the circuit of control visible and reconfigurable, and redirecting control toward humans so that AI music tools augment rather than shrink creative agency.</description>
        <pubDate>Tue, 27 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v303/chen26a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v303/chen26a.html</guid>
        
        
      </item>
    
      <item>
        <title>LLMs can read music, but struggle to hear it. An evaluation of core music perception tasks</title>
        <description>Multimodal Large Language Models (MLLMs) claim &quot;musical understanding,&quot; yet most evaluations conflate listening with score reading. We benchmark three SOTA LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, and Qwen2.5-Omni) across three core music skills: Syncopation Scoring (rhythm perception), Transposition Detection (melody perception), and Chord Quality Identification (harmony perception). Moreover, we separate three sources of variability: (i) perceptual limitations (by contrasting audio recordings vs. symbolic MIDI inputs), (ii) exposure to prior examples (zero- vs. few-shot manipulations), and (iii) reasoning strategies (Standalone, Chain of Thought, LogicLM). For the latter we adapt LogicLM, a framework combining LLMs with symbolic solvers to perform structured reasoning. In LogicLM, LLMs act as perceptual formulators, generating strict, machine-checkable schemas (onset grids, interval sequences) that deterministic solvers execute with self-refinement. Our results reveal a clear perceptual gap: models perform near ceiling on MIDI but show substantial accuracy drops on audio. Reasoning and few-shot prompting offer minimal gains. This is expected for MIDI, where performance reaches saturation, but more surprising for audio, where LogicLM, despite near-perfect MIDI accuracy, remains notably brittle. Among models, Gemini Pro achieves the highest performance across most conditions. Transposition yields the highest accuracies across models, while Chord Identification scores slightly below Syncopation. Overall, current systems reason well over symbols (MIDI) but do not yet &quot;listen&quot; reliably from audio, with reasoning strategies having little impact over accuracy. Our method and dataset make the perception-reasoning boundary explicit and offer actionable guidance for building robust, audio music systems.</description>
        <pubDate>Tue, 27 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v303/carone26a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v303/carone26a.html</guid>
        
        
      </item>
    
      <item>
        <title>Emerging AI Technologies for Music: Towards Controllable, Collaborative, and Creative Systems</title>
        <description></description>
        <pubDate>Tue, 27 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v303/bhandari26a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v303/bhandari26a.html</guid>
        
        
      </item>
    
  </channel>
</rss>
