Q.

I am studying about human motion synthesis and human-scene interaction.

I have a novel research question.

  1. Human-Scene Interaction diffusion model is given (for example, LINGO at SIGRRAPH 2024 asia)
  2. Text-to-human-motion diffusion model is given (for example, Human Motion Diffusion Model or Motion Latent Diffusion Model)

Since proposed datasets for human-scene-interaction (HSI) lacks diversity of human motion (very simple motions, to be specific), there is a room to make diverse action while keeping scene-aware interaction capability.

Here, how to leverage 2nd one's fluent knowledge about human motion and make diverse scene-aware motion by integrating two distinct models?

Let's brainstorming !


A.

Below is an informal brainstorm around how one might integrate a scene-aware diffusion model (which focuses on human-scene interaction) with a text-to-motion diffusion model (which focuses on diverse and coherent human motions). The ultimate goal is to leverage both:

  1. The scene model’s knowledge of geometry, contacts, collisions, object usage, etc.
  2. The text model’s ability to produce a wide array of actions and styles.

This can yield more diverse, contextually relevant, and physically plausible human-scene interactions.


1. Two-Stage Generation Pipelines

  1. Generate motion from text
  2. Refine with scene constraints