Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

1Peking University, 2Kuaishou Technology

Multimodal Generation

Text-to-Image Generation


Text-to-Video Generation

(New!) Trained from scratch without SVD initialization

Trained with SVD initialization

Image-to-Video Generation


Long Video Generation

(New!) Adding Intervention Keyframes

Quantitative Results

Zero-shot text-to-video generation performance.

Multimodal Understanding

Image Understanding Performance

Image understanding performance.

Video Understanding Performance

Zero-shot video question answering performance.