Interactive Character Control with Auto-Regressive Motion Diffusion Models

Simon Fraser University[1] Shanghai AI Lab[2] Xmov[3] NVIDIA[4]

BibTeX

@article{
      shi2024amdm,
      author = {Shi, Yi and Wang, Jingbo and Jiang, Xuekun and Lin, Bingkun and Dai, Bo and Peng, Xue Bin},
      title = {Interactive Character Control with Auto-Regressive Motion Diffusion Models},
      year = {2024},
      issue_date = {July 2024},
      publisher = {Association for Computing Machinery},
      address = {New York, NY, USA},
      volume = {43},
      number = {4},
      issn = {0730-0301},
      journal = {ACM Trans. Graph.},
      month = {jul},
      articleno = {143},
      numpages = {14},
      keywords = {motion synthesis, diffusion model, reinforcement learning}
    }

Abstract

Real-time character control is an essential component for interactive experiences, with a broad range of applications, including physics simulations, video games, and virtual reality. The success of diffusion models for image synthesis has led to the use of these models for motion synthesis. However, the majority of these motion diffusion models are primarily designed for offline applications, where space-time models are used to synthesize an entire sequence of frames simultaneously with a pre-specified length. To enable real-time motion synthesis with diffusion model that allows time-varying controls, we propose A-MDM (Auto-regressive Motion Diffusion Model). Our conditional diffusion model takes an initial pose as input, and auto-regressively generates successive motion frames conditioned on the previous frame. Despite its streamlined network architecture, which uses simple MLPs, our framework is capable of generating diverse, long-horizon, and high-fidelity motion sequences. Furthermore, we introduce a suite of techniques for incorporating interactive controls into A-MDM, such as task-oriented sampling, in-painting, and hierarchical reinforcement learning. These techniques enable a pre-trained A-MDM to be efficiently adapted for a variety of new downstream tasks. We conduct a comprehensive suite of experiments to demonstrate the effectiveness of A-MDM, and compare its performance against state-of-the-art auto-regressive methods.

Methods

Base A-MDM Model

Unlike common approaches that rely on transformer architectures and numerous diffusion steps—often constrained by limited context windows and slow inference—we demonstrate that using MLP-based structures in diffusion models with fewer denoising steps is sufficient for synthesizing high-quality motions. Our model is auto-regressive and each frame is generated based on its previous frame. teaser Each layer in our architecture consists of fully connected layers, group normalization, and SiLU activation. The input features include a time embedding, the clean previous motion frame, and the noisy target motion frame from the prior denoising step. Our model then predicts the noise in the target frame. For each motion frame, we include root linear and angular velocities. We adopt the overparameterized design for local joint representations, where joint positions, joint rotations, and joint linear velocities are concatenated. In computing the output, only predicted joint rotations are used, avoiding the need for postprocessing steps like inverse kinematics that introduces extra errors. teaser

Character Control via Conditional Inpainting

We show that A-MDM is also amenable to diffusion model-based inpainting. This approach enables A-MDM to be used for a wide range of tasks without requiring further fine-tuning. Users can modify elements within each motion frame, such as assigning root linear and angular velocities to make the character follow a specified trajectory. The interactive denoising process can then effectively generate a reasonable next motion frame based on both user specifications and the previous frame. teaser

Character Control via Training Hierarchical Controller

The predicted actions of the controller policies determine how to modify the denoised output at each denoising step. The controller is essentially steering the denoising process to generate the next motion frame that yields the highest future reward. The actions can be dilated and cover a handful of denosing steps. We refrain from applying control during the final few denoising steps to ensure the naturalness of the resulting motions. One of the cool properties of our hierarchical controller is its ability to naturally preserve the stochasticity of diffusion models. teaser

Video Clips