MAR | Notion

adapting MAR baseline on random VAE settings
- Remormatting data so that it fits to our motion data
  - Motion Data ⇒ $T\times263$ since SMPL has $j=22$ joints
    
    $\dot{r}^{a} \isin \mathbb{R}$ : root angular velocity along the y-axis
    
    $\dot{r}^{x}\isin \mathbb{R}$ : root linear velocity along the y-axis
    
    $\dot{r}^{z}\isin \mathbb{R}$ : root linear velocity along the y-axis
    
    ${r}^{y}\isin \mathbb{R}$ : root height
    
    $\bold{j}^p \isin \mathbb{R}^{(j - 1)*3}$ : local joint position (in the root space)
    
    $\bold{j}^v\isin \mathbb{R}^{j*3}$ : local joint velocity (in the root space)
    
    $\bold{j}^r\isin \mathbb{R}^{(j-1)*6}$ : local joint rotation (in the root space)
    
    $\bold{c}^{f}\isin \mathbb{R}^{4}$ : foot contact
- Text Conditioned Motion Generation.
  - Which text encoder will be best and which dimesion will be good?
  - Use directly or go through mlps?
  - Lets assume $T'\times d$ as Motion Input, $1\times d$ as Text Input (or maybe use vector before final layer : $77\times d$)
- Sampling Order
  - Start with sequential generation with <EOS>
    - But how can we add EOS token in such discrete settings?
- Which VAE?
  - Normal Temporal VAE
    
    Input $T\times263\rightarrow T/l\times263$
  - Can we expand transformer vae?
    
    → Need to read MLD
  - MLD가 애초에 왜 저런 아키텍쳐를 도입한건지
Question : is motion more like image or text. Do we actually need to avoid VQ?

그냥 VQ 없이 한다는걸 너무 메인으로 밀어붙이면 공격받을수도 뭐가 다르냐고