ElasticTok: Adaptive tokenization for Image and Video


Anonymous Authors

Below, we show examples of the adaptive tokenization capabilities of ElasticTok. For each video, ground truth is the left image and reconstruction the right image. The bottom image shows the percentage of tokens used over each frame as the video plays. Typically, simpler scenes, or scenes with less motion will use fewer frames, while larger transitions such as fast motion or scene cuts will result in brief token usage spikes as our model adaptively encodes.




Shorter Video Examples




Long Video Examples




Different Target Reconstruction Thresholds

The following videos show different MSE reconstruction thresholds for the same video. The left video uses a more strict threshold of 0.001, which always requires all tokens due to the complexity of the video. The right video uses a higher threshold of 0.007 which is able to benefit more from the adaptivity of ElasticTok.