Workflow Search Reinforcement Learning over Structured Decompositions

Guangyu Jiang, Shu Hong, Mahdi Imani, Nathaniel D. Bastian, Tian Lan
Proceedings of The 8th Annual Learning for Dynamics and Control Conference, PMLR 331:809-832, 2026.

Abstract

We study workflow search reinforcement learning (RL) for long-horizon tasks that can be decomposed into ordered, semantically interpretable subtasks. A workflow specifies an ordered set of milestones or procedural steps. Rather than learning a library of low-level skills and a meta-controller, we treat the set of feasible workflows as the high-level search domain. We then train a workflow-conditioned policy in an inner reinforcement learning loop. We propose a Gaussian process upper confidence bound workflow search (GP-UCB-WS) method. It places a Gaussian process prior over the workflow-to-return map and uses the upper confidence bound rule to adaptively select promising workflows. For each selected workflow, a base RL algorithm optimizes the corresponding conditioned policy using a shaped reward. We derive regret bounds that decompose the overall error into (i) Bayesian optimization error in workflow space and (ii) a policy-learning error for the workflow-conditioned inner loop, yielding provable regret bounds with respect to the optimal workflow and policy. In compositional tasks, including an ordered-visit gridworld and the TTCP CAGE Challenge 2 cyber defense environment, GP-UCB-WS significantly accelerates learning and achieves higher or comparable returns than flat proximal policy optimization (PPO), soft actor critic (SAC), and hierarchical RL (HRL) baselines, particularly when the workflow representation captures latent low-dimensional structure of the learning problems.

Cite this Paper


BibTeX
@InProceedings{pmlr-v331-jiang26a, title = {Workflow Search Reinforcement Learning over Structured Decompositions}, author = {Jiang, Guangyu and Hong, Shu and Imani, Mahdi and Bastian, Nathaniel D. and Lan, Tian}, booktitle = {Proceedings of The 8th Annual Learning for Dynamics and Control Conference}, pages = {809--832}, year = {2026}, editor = {Sukhatme, Gaurav and Lindemann, Lars and Tu, Stephen and Wierman, Adam and Atanasov, Nikolay}, volume = {331}, series = {Proceedings of Machine Learning Research}, month = {17--19 Jun}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v331/main/assets/jiang26a/jiang26a.pdf}, url = {https://proceedings.mlr.press/v331/jiang26a.html}, abstract = {We study workflow search reinforcement learning (RL) for long-horizon tasks that can be decomposed into ordered, semantically interpretable subtasks. A workflow specifies an ordered set of milestones or procedural steps. Rather than learning a library of low-level skills and a meta-controller, we treat the set of feasible workflows as the high-level search domain. We then train a workflow-conditioned policy in an inner reinforcement learning loop. We propose a Gaussian process upper confidence bound workflow search (GP-UCB-WS) method. It places a Gaussian process prior over the workflow-to-return map and uses the upper confidence bound rule to adaptively select promising workflows. For each selected workflow, a base RL algorithm optimizes the corresponding conditioned policy using a shaped reward. We derive regret bounds that decompose the overall error into (i) Bayesian optimization error in workflow space and (ii) a policy-learning error for the workflow-conditioned inner loop, yielding provable regret bounds with respect to the optimal workflow and policy. In compositional tasks, including an ordered-visit gridworld and the TTCP CAGE Challenge 2 cyber defense environment, GP-UCB-WS significantly accelerates learning and achieves higher or comparable returns than flat proximal policy optimization (PPO), soft actor critic (SAC), and hierarchical RL (HRL) baselines, particularly when the workflow representation captures latent low-dimensional structure of the learning problems.} }
Endnote
%0 Conference Paper %T Workflow Search Reinforcement Learning over Structured Decompositions %A Guangyu Jiang %A Shu Hong %A Mahdi Imani %A Nathaniel D. Bastian %A Tian Lan %B Proceedings of The 8th Annual Learning for Dynamics and Control Conference %C Proceedings of Machine Learning Research %D 2026 %E Gaurav Sukhatme %E Lars Lindemann %E Stephen Tu %E Adam Wierman %E Nikolay Atanasov %F pmlr-v331-jiang26a %I PMLR %P 809--832 %U https://proceedings.mlr.press/v331/jiang26a.html %V 331 %X We study workflow search reinforcement learning (RL) for long-horizon tasks that can be decomposed into ordered, semantically interpretable subtasks. A workflow specifies an ordered set of milestones or procedural steps. Rather than learning a library of low-level skills and a meta-controller, we treat the set of feasible workflows as the high-level search domain. We then train a workflow-conditioned policy in an inner reinforcement learning loop. We propose a Gaussian process upper confidence bound workflow search (GP-UCB-WS) method. It places a Gaussian process prior over the workflow-to-return map and uses the upper confidence bound rule to adaptively select promising workflows. For each selected workflow, a base RL algorithm optimizes the corresponding conditioned policy using a shaped reward. We derive regret bounds that decompose the overall error into (i) Bayesian optimization error in workflow space and (ii) a policy-learning error for the workflow-conditioned inner loop, yielding provable regret bounds with respect to the optimal workflow and policy. In compositional tasks, including an ordered-visit gridworld and the TTCP CAGE Challenge 2 cyber defense environment, GP-UCB-WS significantly accelerates learning and achieves higher or comparable returns than flat proximal policy optimization (PPO), soft actor critic (SAC), and hierarchical RL (HRL) baselines, particularly when the workflow representation captures latent low-dimensional structure of the learning problems.
APA
Jiang, G., Hong, S., Imani, M., Bastian, N.D. & Lan, T.. (2026). Workflow Search Reinforcement Learning over Structured Decompositions. Proceedings of The 8th Annual Learning for Dynamics and Control Conference, in Proceedings of Machine Learning Research 331:809-832 Available from https://proceedings.mlr.press/v331/jiang26a.html.

Related Material