Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

Despite diffusion models having shown powerful abilities to generate photorealistic images, generating videos that are realistic and diverse still remains in its infancy. One of the key reasons is that current methods intertwine spatial content and temporal dynamics together, leading to a notably increased complexity of text-to-video generation (T2V). In this work, we propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives, i.e., structure level and content level. At the structure level, we decompose the T2V task into two steps, including spatial reasoning and temporal reasoning, using a unified denoiser. Specifically, we generate spatially coherent priors using text during spatial reasoning and then generate temporally coherent motions from these priors during temporal reasoning. At the content level, we extract two subtle cues from the content of the input video that can express motion and appearance changes, respectively. These two cues then guide the model's training for generating videos, enabling flexible content variations and enhancing temporal stability. Through the decoupled paradigm, HiGen can effectively reduce the complexity of this task and generate realistic videos with semantics accuracy and motion stability.

?News!!!

[2024.03] HiGen's code and models are now publicly available! Additionally, the high-quality super-resolution code and models of our I2VGen-XL have also been released!

Preparation

Installation

conda create -n vgen python=3.8
conda activate vgen
pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

You also need to ensure that your system has installed the ffmpeg command. If it is not installed, you can install it using the following command:

sudo apt-get update && apt-get install ffmpeg libsm6 libxext6  -y

Clone the code

git clone https://github.com/ali-vilab/VGen.git
cd VGen

Download the checkpoints

Here, we are officially releasing the HiGen model and the super-resolution model. Other auxiliary models such as the CLIP text encoder and AutoEncoder can be referenced from Stable Diffusion. For convenience, they are also provided in this repository.

!pip install modelscope
from modelscope.hub.snapshot_download import snapshot_download
model_dir = snapshot_download('iic/HiGen', cache_dir='models/')

Then you might need the following command to move the checkpoints to the "models/" directory:

mv ./models/iic/HiGen/* ./models/

Inference your prompts

You can perform text-to-video inference on the model using the following command.

python inference.py --cfg configs/higen_infer.yaml

Then you can execute the following command to perform super-resolution on the generated videos:

python inference.py --cfg configs/sr600_infer.yaml

Finally, you can find the videos you generated in the workspace/experiments/text_list_for_t2v_share directory. For specific configurations such as data, models, seed, etc., please refer to the higen_infer.yaml file.


Click HERE to view the generated video.	Click HERE to view the generated video.

Customize your own approach

Our codebase essentially supports all the commonly used components in video generation. You can manage your experiments flexibly by adding corresponding registration classes, including ENGINE, MODEL, DATASETS, EMBEDDER, AUTO_ENCODER, VISUAL, DIFFUSION, PRETRAIN, and can be compatible with all our open-source algorithms according to your own needs. If you have any questions, feel free to give us your feedback at any time.

BibTeX

If this repo is useful to you, please cite our corresponding technical paper.

@article{2023i2vgenxl,
  title={I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models},
  author={Zhang, Shiwei and Wang, Jiayu and Zhang, Yingya and Zhao, Kang and Yuan, Hangjie and Qing, Zhiwu and Wang, Xiang  and Zhao, Deli and Zhou, Jingren},
  booktitle={arXiv preprint arXiv:2311.04145},
  year={2023}
}
@article{qing2023higen,
  title={Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation},
  author={Qing, Zhiwu and Zhang, Shiwei and Wang, Jiayu and Wang, Xiang and Wei, Yujie and Zhang, Yingya and Gao, Changxin and Sang, Nong },
  journal={arXiv preprint arXiv:2312.04483},
  year={2023}
}

Acknowledgement

We would like to express our gratitude for the contributions of several previous works to the development of VGen. This includes, but is not limited to Composer, ModelScopeT2V, Stable Diffusion, OpenCLIP, WebVid-10M, LAION-400M, Pidinet and MiDaS. We are committed to building upon these foundations in a way that respects their original contributions.

Disclaimer

This open-source model is trained with using WebVid-10M and LAION-400M datasets and is intended for RESEARCH/NON-COMMERCIAL USE ONLY.

HiGen

作品详情