[edit]
Neural Codec Language Model for Controllable Timbre Transfer in Music Synthesis
Proceedings of Machine Learning Research, PMLR 303:1-13, 2026.
Abstract
Neural codec language models have revolutionized speech synthesis but face significant challenges when adapted to music generation, particularly in achieving precise timbre control while preserving melodic content. We introduce Neural Code Language Model for Controllable Timbre Transfer (NCLMCTT), a novel architecture that enables zero-shot instrument cloning through direct audio conditioning without explicit timbre learning. Our approach combines a 385M-parameter transformer for coarse musical structure modeling with a specialized upsampler for fine timbral detail, achieving flexible control through 1-5 second reference audio segments. We establish the first comprehensive benchmark dataset for controllable timbre transfer evaluation, comprising 62,500 high-fidelity samples across 50 synthesizer presets with ground truth targets. Extensive experiments demonstrate substantial improvements over the TokenSynth baseline: 27.1% reduction in SI-SDR, 50.9% in Mel Distance, and 59.4% in STFT Distance, while maintaining strong melodic coherence (Chroma Similarity: 0.85). Our method achieves robust zero-shot generalization, with performance on unseen instrument presets matching that of seen presets. Ablation studies confirm that extended reference audio duration (40.8% improvement), cross-attention mechanisms (11.9% improvement), and increased model capacity contribute meaningfully to overall performance. By separating melodic content from timbral characteristics and enabling implicit timbre control, NCLMCTT provides both immediate practical value for music creators and a methodological foundation for advancing controllable neural audio synthesis.