Related Works | Notion

Motion Generation

Recent advancements in human motion generation have propelled the field forward, bringing considerable attention to tasks aimed at generating natural, contextually appropriate motion from various control signals. For music-driven dance generation, the objective is to create realistic dance movements that correspond to input music. Bailando addresses ——. Similarly, in co-speech motion generation, the objective is to produce gestures that correspond seamlessly to a speaker’s audio input. EMAGE addresses ——. In the field of multi-task, multimodal motion generation, the goal is to generate natural motion from multiple control signals. UniMuMo addresses ——. Among the various motion generation tasks, text-to-motion generation has garnered the most attention. The primary objective of text-to-motion generation is to produce natural, coherent motion that accurately corresponds to a given textual description.

Text-to-Motion Generation in Continuous Space

Early approaches to text-to-motion generation primarily focused on mapping textual descriptions to motions within a continuous latent space, using models such as Variational Autoencoders (VAEs). Following the success of the Denoising Diffusion Model, recent methods for text-to-motion generation have adopted techniques that gradually refine generated motion in a continuous latent space. Learning in a continuous latent space offers notable advantages, including direct learning without substantial information loss and the ability to capture smooth transitions and subtle variations in motion data. This continuous representation allows the modeling of complex, high-dimensional motion patterns, potentially leading to more expressive and nuanced motion generation. However, this approach also has its drawbacks. A significant limitation is the requirement for large-scale, high-quality datasets to effectively train models in continuous spaces. In the motion domain, such extensive datasets are often scarce, which can hinder a model’s ability to capture the full complexity and diversity of human movements. This scarcity can lead to suboptimal results, as models may struggle to learn the intricate details required for high-quality motion generation.

Text-to-Motion Generation in Discrete Space

On the other hand, some methods reformulate the generation task as a discrete classification problem, achieving notable performance in motion generation. These approaches often employ Vector Quantized Variational Autoencoders (VQ-VAEs) to create motion tokens, which are then used to generate motion sequences through either autoregressive modeling or generative masked modeling. By reinterpreting the generation task as classification, these methods avoid the large-scale datasets typically required for training in continuous latent spaces. However, these methods face limitations due to approximation errors inherent in the quantization process, which can compromise the fidelity of generated motion. While numerous strategies have been proposed to mitigate the limitations of discrete representations, these efforts remain confined to discrete space and often overlook the inherently continuous nature of motion modality. Due to this lack of consideration for motion’s inherent continuity, improvements in discrete quantization-based methods often overlook the need to align closely with human perception.

Rectified Flow

Rectified Flow is a generative model developed to address the transport mapping problem by employing a flow matching algorithm. Flow matching focuses on constructing a transport map, denoted $T: \mathbb{R}^d \rightarrow \mathbb{R}^d$ , that effectively transforms a sample $X_0 \sim \pi_0$ from an initial distribution $\pi_0$ on $\R^d$ into a target distribution $\pi_1$ on $\mathbb{R}^d$ such that $T(X_0) \sim \pi_1$. This transport map is formalized via the ordinary differential equation (ODE):

$dX_t = v(X_t, t) \, dt \quad (1)$

In this context, $v$ is typically referred to as a vector field, while $X_t$ represents the forward process parameterized over $t \in [0,1]$ . Rectified Flow defines the forward process as a straight path, expressed as $X_t = t X_1 + (1 - t) X_0$ , with a corresponding $v$ defined as $X_1 - X_0$ . Since the destination $X_1$ is unknown during the generation phase, a causal approximation of $v$ , denoted $v_\theta$ , is learned through optimization of the following loss function:

\text{(insert loss function here)}

Once the vector field $v_\theta$ has been trained, samples from $\pi_1$ can be generated by solving equation (1) using an ODE solver. Due to the almost straight trajectory connecting the two distributions, Rectified Flow can effectively transport samples from one distribution to the another in a significantly reduced number of steps compared to typical diffusion models. This makes Rectified Flow an efficient approach for sampling generation in scenarios requiring distributional alignment or transformation.

Then, we apply a diffusion process to recover and add detail discarded by the autoencoder. + 디코더의 한계도 극복

→ if optimal no need to train