On generation strategy

Random generation cannot leverage usage of EOS tokens since we cannot define how many tokens we will generate in next stage.

causality should be given to transformer or we need to specify how many sequences we will generate, and not giving this will effect models alot.

Also some minor fix might be useful if padding mask is applied first than use masking

currently masking is applied first so some sequence might only have paddings masked out.

However causality hinders the performance because it only knows information forehead.

JPEG 이미지-4262-8361-35-0.jpeg

요래하면 causal 하게 한번에 여러 토큰씩 inference 때도 생성할 수 있지 않을까