1. Transformer Architecture

1-1. Compressive Transformer

메모리를 compress해서 사용하는 방식
TRansformer-XL에 대한 추가적인 성능 향상 기대

Untitled

1-2. Dynamic Evaluation

Untitled

validation 단계에서 segments의 loss를 활용하여 모델을 업데이트
Drop-in replacement로 성능 향상 기대

Untitled

1-3. Block Recurrent Transformer

1) t5 positional encoding (구현 중)

<aside> 💡 Instead, we add a T5-style relative position bias to the selfattention matrix in the vertical direction. (Although similar, T5 relative positions differ slightly from the relative positions used in the Transformer-XL paper.)

</aside>

기존 transformer-xl에서 쓰던 positional encoding 방식은 sin, cos를 이용한 방식이다.

t5-style positional encoding 방식은 sin, cos를 사용하지 않고 상대적인 위치를 encoding한다.

(ex - [0, 1, 2, 3]과 같이 단순한 스칼라 값이다.)

물론 이 스칼라 값은 임베딩을 거쳐 차원이 늘어나거나 줄어든다.

2) sliding-window attention (시작 X)

transformer xl와 비슷하지만 sliding을 2번 한다.