开源地址
https://modelscope.cn/models/iic/Image-to-Video授权协议
CC-BY-NC-ND

Image-to-Video高清图像生成视频大模型

本项目Image-to-Video旨在解决根据输入图像生成高清视频任务。Image-to-Video由达摩院研发的高清视频生成基础模型之一，其核心部分包含两个阶段，分别解决语义一致性和清晰度的问题，参数量共计约37亿，模型经过在大规模视频和图像数据混合预训练，并在少量精品数据上微调得到，该数据分布广泛、类别多样化，模型对不同的数据均有良好的泛化性。项目相比于现有视频生成模型，Image-to-Video在清晰度、质感、语义、时序连续性等方面均具有明显的优势。

此外，Image-to-Video的许多设计理念和设计细节（比如核心的UNet部分）都继承于我们已经公开的工作VideoComposer，您可以参考我们的VideoComposer和本项目ModelScope的了解详细细节。

The Image-to-Video project aims to address the task of HD video geeratio based o iput images. Image-to-Video is oe of the HQ video geeratio base models developed by DAMO Academy. Its core compoets cosist of two stages, each addressig the issues of sematic cosistecy ad video quality. The total umber of parameters is approximately 3.7 billio. The model has bee pre-traied o a large-scale mixture of video ad image data ad fie-tued o a small amout of high-quality data. This data distributio is extesive ad diverse, ad the model demostrates good geeralizatio to differet types of data. Compared to existig video geeratio models, the Image-to-Video project has sigificat advatages i terms of quality, texture, sematics, ad temporal cotiuity.

Additioally, may desig cocepts ad details of Image-to-Video (such as the core UNet) are iherited from our publicly available work, VideoComposer. For detailed iformatio, please refer to our VideoComposer ad the Github code repository for this ModelScope project.

Fig.1 Overall framework of I2VGe-XL.

模型介绍 (Itroductio)

如图Fig.2所示，Image-to-Video是一种基于隐空间的视频扩散模型(VLDM)，其通过我们专门设计的时空UNet(ST-UNet)在隐空间中进行时空建模，然后通过解码器重建出最终视频（具体模型结构可以参考VideoComposer）。为能够生成720P视频，我们将Image-to-Video分为两个阶段，第一阶段是在低分辨率条件下保证语义一致性，第二阶是利用新的VLDM进行去噪以提高视频分辨率以及同时提升时间和空间上的一致性。通过在模型、数据和训练上的联合优化，Image-to-Video主要具有以下几个特点：

高清&宽屏，可以直接生成720P(1280*720)分辨率的视频，且相比于现有的开源项目，不仅分辨率得到有效提高，其生产的宽屏视频可以适合更多的场景
连续性，通过特定训练和推理策略，在视频的细节生成的稳定性上（时间和空间维度）有明显提高
质感好，通过收集特定的风格的视频数据训练，使得生成的视频在质感上得到明显提升，可以生成科技感、电影色、卡通风格和素描等类型视频
无水印，模型通过我们内部大规模无水印视频/图像训练，并在高质量数据微调得到，生成的无水印视频可适用更多视频平台，减少许多限制

以下为生成的部分案例：

As show i Fig.2, Image-to-Video is a video latet diffusio model. It utilizes our desiged ST-UNet ((for model details, please refer to VideoComposer)) to perform spatio-temporal modelig i the latet space ad recostruct the geerated video through a decoder. I order to geerate 720P videos, we divide Image-to-Video ito two stages. The first stage esures sematic cosistecy with low resolutios, while the secod stage utilizes the ew VLDM to deoise ad improve video resolutio, as well as ehace temporal ad spatial cosistecy. Through joit optimizatio of the model, data, ad traiig, Image-to-Video has the followig characteristics.

High-defiitio & widescree, ca directly geerate 720P (1280*720) resolutio videos, ad compared to existig ope source projects, ot oly is the resolutio effectively improved, but the widescree videos it produces ca also be suitable for more scearios.
Cotiuity, through specific traiig ad iferece strategies, there is a sigificat improvemet i the stability of detail geeratio i videos (i the time ad space dimesios).
Good texture, by collectig specific style video data for traiig, the geerated model has a sigificat improvemet i texture ad ca geerate techology, film color, cartoo style, sketch ad other types of videos.
No watermark, the model is traied o a large-scale watermark-free video/image dataset iterally ad fie-tued o high-quality data, geeratig watermark-free videos that ca be applied to more video platforms ad reducig may restrictios.

Below are some examples geerated by the model:

Fig.2 Architecture of the first stage.

为方便展示，本页面展示为低分辨率GIF格式，但是GIF会下降视频质量，720P的视频效果可以参下面对应的视频链接

For display purposes, this page shows low-resolutio GIF format. However, GIF format may reduce video quality. For specific effects, please refer to the video lik below.


HQ Video	HQ Video

HQ Video	HQ Video

HQ Video	HQ Video

HQ Video	HQ Video

HQ Video	HQ Video

HQ Video	HQ Video

HQ Video	HQ Video

HQ Video	HQ Video

HQ Video	HQ Video

HQ Video	HQ Video

HQ Video	HQ Video

[2023.08.25 更新] ModelScope发布1.8.4版本，I2VGe-XL模型更新到模型参数文件 v1.1.0;

依赖项 (Depedecy)

首先你需要确定你的系统安装了ffmpeg命令，如果没有，可以通过以下命令来安装：

First, you eed to esure that your system has istalled the ffmpeg commad. If it is ot istalled, you ca istall it usig the followig commad:

sudo apt-get update && apt-get istall ffmpeg libsm6 libxext6  -y

其次，本Image-to-Video项目适配ModelScope代码库，以下是本项目需要安装的部分依赖项。

The Image-to-Video project is compatible with the ModelScope codebase, ad the followig are some of the depedecies that eed to be istalled for this project.

pip istall modelscope==1.8.4
pip istall xformers==0.0.20
pip istall torch==2.0.1
pip istall ope_clip_torch>=2.0.2
pip istall opecv-pytho-headless
pip istall opecv-pytho 
pip istall eiops>=0.4
pip istall rotary-embeddig-torch
pip istall fairscale 
pip istall scipy
pip istall imageio
pip istall pytorch-lightig
pip istall torchsde

快速使用 (Iferece)

关于更多的尝试，请关注我们将公开的技术报告和开源代码。

For more experimets, please stay tued for our upcomig techical report ad ope-source code release.

代码范例 (Code example)

from modelscope.pipelies import pipelie
from modelscope.outputs import OutputKeys

pipe = pipelie(task="image-to-video", model='damo/Image-to-Video', model_revisio='v1.1.0', device='cuda:0')

# IMG_PATH: your image path (url or local file)
output_video_path = pipe(IMG_PATH, output_video='./output.mp4')[OutputKeys.OUTPUT_VIDEO]
prit(output_video_path)

如果想生成超分视频的话, 示例见下:

If you wat to geerate high-resolutio video, please use the followig code:

from modelscope.pipelies import pipelie
from modelscope.outputs import OutputKeys

# if you oly have oe GPU, please make it's GPU memory bigger tha 50G, or you ca use two GPUs, ad set them by device
pipe1 = pipelie(task='image-to-video', model='damo/Image-to-Video', model_revisio='v1.1.0', device='cuda:0')
pipe2 = pipelie(task='video-to-video', model='damo/Video-to-Video', model_revisio='v1.1.0', device='cuda:0')

# image to video
output_video_path = pipe1("test.jpg", output_video='./i2v_output.mp4')[OutputKeys.OUTPUT_VIDEO]

# video resolutio
p_iput = {'video_path': output_video_path}
ew_output_video_path = pipe2(p_iput, output_video='./v2v_output.mp4')[OutputKeys.OUTPUT_VIDEO]

更多超分细节, 请访问 Video-to-Video。我们也提供了用户接口，请移步I2VGe-XL-Demo。

Please visit Video-to-Video for more details. We also provide user iterface:I2VGe-XL-Demo.

模型局限 (Limitatio)

目前，我们发现Image-to-Video方法在处理以下情况会存在一定的局限性：

小目标生成能力有限，在生成较小目标的时候，会存在一定的错误
快速运动目标生成能力有限，当生成快速运动目标时，可能会出现一些假象和不合理的情况
生成速度较慢，生成高清视频会明显导致生成速度减慢

此外，我们研究也发现，生成的视频空间上的质量和时序上的变化速度在一定程度上存在互斥现象，在本项目我们选择了其折中的模型，兼顾两者间的平衡。

Curretly, we have foud certai limitatios of the Image-to-Video method i hadlig the followig situatios:

Limited ability to geerate small objects. There may be some errors whe geeratig smaller objects.
Limited ability to geerate fast-movig objects. There may be some artifacts whe geeratig fast-movig objects.
Slow geeratio speed. Geeratig high-defiitio videos sigificatly slows dow the geeratio speed.

Additioally, our research has foud that there is a trade-off betwee the spatial quality ad temporal variability of the geerated videos. I this project, we have chose a model that strikes a balace betwee the two.

如果您正在尝试使用我们的模型，我们建议您首先在第一阶段中得到语义符合预期的视频后(离线运行的时候可以修改cofiguratio.jso文件中的Seed生成不同视频)，再尝试第二阶段的视频修正（因为该过程比较耗时），这样可以提高您的使用效率，也更容易得到更好的结果。

If you are tryig to use our model, we suggest that you first obtai sematic-expected videos i the first stage (you ca modify the Seed i the cofiguratio.jso file whe ruig offlie to geerate differet videos). The, you ca try video refiig i the secod stage (as this process takes more time). This will improve your efficiecy ad make it easier to achieve better results.

训练数据介绍 (Traiig Data)

我们训练数据主要来源来源广泛，具备以下几个属性：

混合训练，模型有按照视频图像比7:1训练模型，以此保证视频生成质量
类别分布广，数据数十亿的总体量涵盖人、动物、机车、科幻、场景等等绝大多数的实际数据
来源分布广，数据来源于开源数据、视频网站以及其他内部数据，具有多分辨率、长宽比等
精品数据构建，为了提升模型生成的质量，我们构建了约20w的精品数据对预训练模型进行微调

Our traiig data maily comes from various sources ad has the followig attributes:

Mixed traiig. The model is traied with a 7:1 ratio of video to image to esure the quality of video geeratio.
Wide class distributio. The data set covers most real-world categories, icludig people, aimals, locomotives, sciece fictio, scees, etc. with a total volume of billios of data poits.
Wide source distributio. The data comes from ope-source data, video websites, ad other iteral sources, with varyig resolutios ad aspect ratios.
High-quality data costructio. To improve the quality of the model-geerated videos, we costructed approximately 200,000 high-quality data pairs for fie-tuig the pre-traiig model.

更强更灵活的视频生成模型会持续发布，及其背后技术报告正在撰写中，欢迎及时关注。

More powerful models will cotiue to be released, ad the techical report behid them are curretly beig writte. Please stay tued for updates ad timely iformatio.

使用协议 (Licese Agreemet)

我们的代码和模型权重仅可用于个人/学术研究，暂不支持商用。

Our code ad model weights are oly available for persoal/academic research use ad are curretly ot supported for commercial use.

联系我们 (Cotact Us)

如果你想联系我们的算法/产品同学, 或者想加入我们的算法团队(实习/正式), 欢迎发邮件至: yigya.zyy@alibaba-ic.com。

If you would like to cotact us, or joi our team (itership/formal), please feel free to email us at yigya.zyy@alibaba-ic.com.

Image-to-Video高清图像生成视频大模型本项目Image-to-Video旨在解决根据输入图像生成高清视频任务。Image-to-Video由达摩院研发的高清视频生成基础模型之一，其核心部

声明：本文仅代表作者观点，不代表本站立场。如果侵犯到您的合法权益，请联系我们删除侵权资源！如果遇到资源链接失效，请您通过评论或工单的方式通知管理员。未经允许，不得转载，本站所有资源文章禁止商业使用运营!

下载安装【程序员客栈】APP

实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

前往安装

Image-to-Video高清图像生成视频大模型

技术信息

作品详情

Image-to-Video高清图像生成视频大模型

模型介绍 (Itroductio)

依赖项 (Depedecy)

快速使用 (Iferece)

代码范例 (Code example)

模型局限 (Limitatio)

训练数据介绍 (Traiig Data)

相关论文以及引用信息 (Referece)

使用协议 (Licese Agreemet)

联系我们 (Cotact Us)

功能介绍

重点城市程序员兼职推荐

重点岗位程序员兼职推荐