Do you notice a mistake?
NaN:NaN
00:00
Recent advances in machine learning have profoundly transformed our relationship with sound and musical creation. Deep generative models are emerging as powerful tools that can support and extend creative practices, yet their adoption by artists remains limited by the question of control. Current approaches either rely on explicit parameters (notes, instruments, textual descriptions) or on abstract representation spaces that enable the exploration of subjective concepts such as timbre and style, but are harder to integrate into musical workflows.This thesis aims to reconcile these two paradigms of explicit and implicit control to design expressive audio synthesis tools that can be seamlessly integrated into music production environments. We begin with a systematic study of neural audio codecs, the building blocks of most modern generative models, identifying design choices that influence both audio quality and controllability. We then explore methods to jointly learn explicit and implicit control spaces, first in a supervised setting, and later through AFTER, a framework designed for the unsupervised case. AFTER enables realistic and continuous timbre transfer across a wide range of instruments while preserving control over pitch and rhythm.Finally, we adapt these models for real-time use through lightweight, streamable diffusion architectures and develop an intuitive interface integrated into digital audio workstations. The thesis concludes with several artistic collaborations, demonstrating the creative potential and practical impact of these generative approaches.
Nils Demerlé, PhD candidate within the EDITE doctoral school (ED 130), conducted his research entitled “Separating Explicit and Implicit Controls for Expressive Real-Time Neural Synthesis” as part of the Analysis–Synthesis team at the STMS Laboratory (IRCAM, CNRS, Sorbonne Université, Ministry of Culture), under the supervision of Philippe Esling and co-supervision of Guillaume Doras.
Jury composition:
Abstract:
Recent advances in machine learning have profoundly transformed our relationship with sound and musical creation. Deep generative models are emerging as powerful tools that can support and extend creative practices, yet their adoption by artists remains limited by the question of control. Current approaches either rely on explicit parameters (notes, instruments, textual descriptions) or on abstract representation spaces that enable the exploration of subjective concepts such as timbre and style, but are harder to integrate into musical workflows.This thesis aims to reconcile these two paradigms of explicit and implicit control to design expressive audio synthesis tools that can be seamlessly integrated into music production environments. We begin with a systematic study of neural audio codecs, the building blocks of most modern generative models, identifying design choices that influence both audio quality and controllability. We then explore methods to jointly learn explicit and implicit control spaces, first in a supervised setting, and later through AFTER, a framework designed for the unsupervised case. AFTER enables realistic and continuous timbre transfer across a wide range of instruments while preserving control over pitch and rhythm.Finally, we adapt these models for real-time use through lightweight, streamable diffusion architectures and develop an intuitive interface integrated into digital audio workstations. The thesis concludes with several artistic collaborations, demonstrating the creative potential and practical impact of these generative approaches.
October 31, 2025
Do you notice a mistake?