In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only egocentric views of the world. To effectively plan in this setting, in contrast to learning world dynamics in a single-agent scenario, we must simulate world dynamics conditioned on an arbitrary number of agents' actions given only partial egocentric visual observations of the world. To address this issue of partial observability, we first train generative models to estimate the overall world state given partial egocentric observations. To enable accurate simulation of multiple sets of actions on this world state, we then propose to learn a compositional world model for multi-agent cooperation by factorizing the naturally composable joint actions of multiple agents and compositionally generating the video conditioned on the world state. By leveraging this compositional world model, in combination with Vision Language Models to infer the actions of other agents, we can use a tree search procedure to integrate these modules and facilitate online cooperative planning. We evaluate our methods on three challenging benchmarks with 2-4 agents. The results show our compositional world model is effective and the framework enables the embodied agents to cooperate efficiently with different agents across various tasks and an arbitrary number of agents, showing the promising future of our proposed methods. Our codes and models are open source and will be released upon acceptance.
We propose COMBO, a novel Compositional wOrld Model-based emBOdied multi-agent planning framework, shown in the figure below. After receiving the egocentric observations during the last execution step, COMBO first estimates the overall world state to plan on. COMBO then utilizes Vision Language Models to act as an Action Proposer to propose possible actions, an Intent Tracker to infer other agents' intents, and an Outcome Evaluator to evaluate the different possible outcomes. In combination with the compositional world model to simulate the effect of joint actions on the world state, we use the tree search procedure to integrate these planning sub-modules. COMBO empowers embodied agents to imagine how different plans may affect the world with other agents in the long run and plan cooperatively.
Here are video demos showing our COMBO agent cooperates well with arbitrary number of agents to finish the TDW-Cook and TDW-Game tasks.
(a) TDW-Cook
(b) TDW-Game
(c) TDW-Game with 3 agents
*All videos are generated by the Compositional World Model.