Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

1Peking University, 2Kuaishou Technology

Multimodal Generation

Text-to-Image Generation


Text-to-Video Generation


Image-to-Video Generation


Long Video Generation


Quantitative Results

Zero-shot text-to-video generation performance.

Multimodal Understanding

Image Understanding Performance

Image understanding performance.

Video Understanding Performance

Zero-shot video question answering performance.