EGO4d clips

각 clip의 길이는 5분 = 300s

0. Annotations

train : 389개 (단 6개의 uid에 대해서는 ‘anno’키가 존재하지 않음 )
val : 50개

Camera wearer가 말하기 직전 몇 개의 temperal unit (TSN에서 사용한 6개 frame을 한 temperal unit으로 봄)에 label을 달아둔 annotation

/scratch/jisoo/{split}_offset_1.pickle

/scratch/jisoo/{split}_offset_2.pickle

/scratch/jisoo/{split}_offset_5.pickle

Background, camera wearer, normal speaker의 발화에 모든 temperal unit에 label을 달아둔 annotation

/scratch/jisoo/{split}_perfeature_v2.pickle

1. video features

30fps이기 때문에 총 frame의 개수는 9000 frame
6개 frame을 하나의 chunk로 묶기 때문에 video feature의 차원은 [9000/6, D] = [1500,D]
D는 보통 1024 or 2048

2. Audio features (Wave2vec2)

train : 383개 (anno가 없는 6개의 uid 제외)
val : 50개
normalized된 음성에서 뽑은 feature와 normalized되지 않은 음성에서 뽑은 feature들 존재

/scratch/jungbin/ego4d_audio_feature_normalized

/scratch/jungbin/ego_audio_feature_unnormalized
float32