Q.
I am studying about human motion synthesis and human-scene interaction.
I have a novel research question.
- Human-Scene Interaction diffusion model is given (for example, LINGO at SIGRRAPH 2024 asia)
- Text-to-human-motion diffusion model is given (for example, Human Motion Diffusion Model or Motion Latent Diffusion Model)
Since proposed datasets for human-scene-interaction (HSI) lacks diversity of human motion (very simple motions, to be specific), there is a room to make diverse action while keeping scene-aware interaction capability.
Here, how to leverage 2nd one's fluent knowledge about human motion and make diverse scene-aware motion by integrating two distinct models?
Let's brainstorming !
A.
Below is an informal brainstorm around how one might integrate a scene-aware diffusion model (which focuses on human-scene interaction) with a text-to-motion diffusion model (which focuses on diverse and coherent human motions). The ultimate goal is to leverage both:
- The scene model’s knowledge of geometry, contacts, collisions, object usage, etc.
- The text model’s ability to produce a wide array of actions and styles.
This can yield more diverse, contextually relevant, and physically plausible human-scene interactions.
1. Two-Stage Generation Pipelines
- Generate motion from text
- Use a text-driven diffusion model (e.g., Motion Latent Diffusion, MLD) to produce an initial motion sequence.
- This motion can be generic (unaware of the scene yet) but is semantically relevant to the text prompt (e.g., “a person running,” “a person picking something up,” etc.).
- Refine with scene constraints
- Feed that motion into the human-scene interaction (HSI) diffusion model.
- The HSI model injects scene geometry constraints: contact points, collision avoidance, ground alignment, object interactions, etc.
- Optionally, a second diffusion process in the HSI model adjusts local poses/foot placements/limbs so that the motion is physically plausible with respect to the scene.