Neural Codec Language Model for Controllable Timbre Transfer in Music Synthesis

Sheldon Liu, Tianyu Liu, Deepak Dalakoti, Adithya Suresh, Yueying Teng, Xuefeng Liu, Atanu Roy, Randeep Bhatia, Daniel Hatadi, Prabhjeet Ghuman
Proceedings of Machine Learning Research, PMLR 303:1-13, 2026.

Abstract

Neural codec language models have revolutionized speech synthesis but face significant challenges when adapted to music generation, particularly in achieving precise timbre control while preserving melodic content. We introduce Neural Code Language Model for Controllable Timbre Transfer (NCLMCTT), a novel architecture that enables zero-shot instrument cloning through direct audio conditioning without explicit timbre learning. Our approach combines a 385M-parameter transformer for coarse musical structure modeling with a specialized upsampler for fine timbral detail, achieving flexible control through 1-5 second reference audio segments. We establish the first comprehensive benchmark dataset for controllable timbre transfer evaluation, comprising 62,500 high-fidelity samples across 50 synthesizer presets with ground truth targets. Extensive experiments demonstrate substantial improvements over the TokenSynth baseline: 27.1% reduction in SI-SDR, 50.9% in Mel Distance, and 59.4% in STFT Distance, while maintaining strong melodic coherence (Chroma Similarity: 0.85). Our method achieves robust zero-shot generalization, with performance on unseen instrument presets matching that of seen presets. Ablation studies confirm that extended reference audio duration (40.8% improvement), cross-attention mechanisms (11.9% improvement), and increased model capacity contribute meaningfully to overall performance. By separating melodic content from timbral characteristics and enabling implicit timbre control, NCLMCTT provides both immediate practical value for music creators and a methodological foundation for advancing controllable neural audio synthesis.

Cite this Paper


BibTeX
@InProceedings{pmlr-v303-liu26b, title = {Neural Codec Language Model for Controllable Timbre Transfer in Music Synthesis}, author = {Liu, Sheldon and Liu, Tianyu and Dalakoti, Deepak and Suresh, Adithya and Teng, Yueying and Liu, Xuefeng and Roy, Atanu and Bhatia, Randeep and Hatadi, Daniel and Ghuman, Prabhjeet}, booktitle = {Proceedings of Machine Learning Research}, pages = {1--13}, year = {2026}, editor = {Herremans, Dorien and Bhandari, Keshav and Roy, Abhinaba and Colton, Simon and Barthet, Mathieu}, volume = {303}, series = {Proceedings of Machine Learning Research}, month = {26 Jan}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v303/main/assets/liu26b/liu26b.pdf}, url = {https://proceedings.mlr.press/v303/liu26b.html}, abstract = {Neural codec language models have revolutionized speech synthesis but face significant challenges when adapted to music generation, particularly in achieving precise timbre control while preserving melodic content. We introduce Neural Code Language Model for Controllable Timbre Transfer (NCLMCTT), a novel architecture that enables zero-shot instrument cloning through direct audio conditioning without explicit timbre learning. Our approach combines a 385M-parameter transformer for coarse musical structure modeling with a specialized upsampler for fine timbral detail, achieving flexible control through 1-5 second reference audio segments. We establish the first comprehensive benchmark dataset for controllable timbre transfer evaluation, comprising 62,500 high-fidelity samples across 50 synthesizer presets with ground truth targets. Extensive experiments demonstrate substantial improvements over the TokenSynth baseline: 27.1% reduction in SI-SDR, 50.9% in Mel Distance, and 59.4% in STFT Distance, while maintaining strong melodic coherence (Chroma Similarity: 0.85). Our method achieves robust zero-shot generalization, with performance on unseen instrument presets matching that of seen presets. Ablation studies confirm that extended reference audio duration (40.8% improvement), cross-attention mechanisms (11.9% improvement), and increased model capacity contribute meaningfully to overall performance. By separating melodic content from timbral characteristics and enabling implicit timbre control, NCLMCTT provides both immediate practical value for music creators and a methodological foundation for advancing controllable neural audio synthesis.} }
Endnote
%0 Conference Paper %T Neural Codec Language Model for Controllable Timbre Transfer in Music Synthesis %A Sheldon Liu %A Tianyu Liu %A Deepak Dalakoti %A Adithya Suresh %A Yueying Teng %A Xuefeng Liu %A Atanu Roy %A Randeep Bhatia %A Daniel Hatadi %A Prabhjeet Ghuman %B Proceedings of Machine Learning Research %C Proceedings of Machine Learning Research %D 2026 %E Dorien Herremans %E Keshav Bhandari %E Abhinaba Roy %E Simon Colton %E Mathieu Barthet %F pmlr-v303-liu26b %I PMLR %P 1--13 %U https://proceedings.mlr.press/v303/liu26b.html %V 303 %X Neural codec language models have revolutionized speech synthesis but face significant challenges when adapted to music generation, particularly in achieving precise timbre control while preserving melodic content. We introduce Neural Code Language Model for Controllable Timbre Transfer (NCLMCTT), a novel architecture that enables zero-shot instrument cloning through direct audio conditioning without explicit timbre learning. Our approach combines a 385M-parameter transformer for coarse musical structure modeling with a specialized upsampler for fine timbral detail, achieving flexible control through 1-5 second reference audio segments. We establish the first comprehensive benchmark dataset for controllable timbre transfer evaluation, comprising 62,500 high-fidelity samples across 50 synthesizer presets with ground truth targets. Extensive experiments demonstrate substantial improvements over the TokenSynth baseline: 27.1% reduction in SI-SDR, 50.9% in Mel Distance, and 59.4% in STFT Distance, while maintaining strong melodic coherence (Chroma Similarity: 0.85). Our method achieves robust zero-shot generalization, with performance on unseen instrument presets matching that of seen presets. Ablation studies confirm that extended reference audio duration (40.8% improvement), cross-attention mechanisms (11.9% improvement), and increased model capacity contribute meaningfully to overall performance. By separating melodic content from timbral characteristics and enabling implicit timbre control, NCLMCTT provides both immediate practical value for music creators and a methodological foundation for advancing controllable neural audio synthesis.
APA
Liu, S., Liu, T., Dalakoti, D., Suresh, A., Teng, Y., Liu, X., Roy, A., Bhatia, R., Hatadi, D. & Ghuman, P.. (2026). Neural Codec Language Model for Controllable Timbre Transfer in Music Synthesis. Proceedings of Machine Learning Research, in Proceedings of Machine Learning Research 303:1-13 Available from https://proceedings.mlr.press/v303/liu26b.html.

Related Material