文本生成视频大模型-英文-通用领域

我要开发同款
匿名用户2024年07月31日
139阅读

技术信息

开源地址
https://modelscope.cn/models/iic/text-to-video-synthesis
授权协议
CC-BY-NC-ND

作品详情

文本生成视频大模型-英文-通用领域 (Text-to-video-sythesis Model i Ope Domai)

本模型基于多阶段文本到视频生成扩散模型, 输入描述文本,返回符合文本描述的视频。仅支持英文输入。

This model is based o a multi-stage text-to-video geeratio diffusio model, which iputs a descriptio text ad returs a video that matches the text descriptio. Oly Eglish iput is supported.

模型描述 (Model Descriptio)

文本到视频生成扩散模型由文本特征提取、文本特征到视频隐空间扩散模型、视频隐空间到视频视觉空间这3个子网络组成,整体模型参数约17亿。支持英文输入。扩散模型采用Uet3D结构,通过从纯高斯噪声视频中,迭代去噪的过程,实现视频生成的功能。

The text-to-video geeratio diffusio model cosists of three sub-etworks: text feature extractio, text feature-to-video latet space diffusio model, ad video latet space to video visual space. The overall model parameters are about 1.7 billio. Support Eglish iput. The diffusio model adopts the Uet3D structure, ad realizes the fuctio of video geeratio through the iterative deoisig process from the pure Gaussia oise video.

期望模型使用方式以及适用范围 (How to expect the model to be used ad where it is applicable)

本模型适用范围较广,能基于任意英文文本描述进行推理,生成视频。一些文本生成视频示例如下,上方为输入文本,下方为对应的生成视频:

This model has a wide rage of applicatios ad ca reaso ad geerate videos based o arbitrary Eglish text descriptios. Some geerated video examples are as follows, the upper part is the iput text, ad the lower part is the correspodig geerated video:


Robot dacig i times square.
Robot dacig i times square.
Clow fish swimmig through
the coral reef. Clow fish swimmig through the coral reef.
Meltig ice cream drippig
dow the coe. Meltig ice cream drippig dow the coe.

A waterfall flowig through glacier at ight.
A waterfall flowig through glacier at ight.
A cat eatig food out of a owl,
i style of va Gogh. A cat eatig food out of a owl, i style of va Gogh.

Tiy plat sprout comig out of the groud.
Tiy plat sprout comig out of the groud.

Hyper-realistic photo of a abadoed
idustrial site durig a storm. Hyper-realistic photo of a abadoed idustrial site durig a storm.

Balloo full of water explodig
i extreme slow motio. Balloo full of water explodig i extreme slow motio.
Icredibly detailed sciece fictio scee
set o a alie plaet,
view of a marketplace. Pixel art. Icredibly detailed sciece fictio scee set o a alie plaet, view of a marketplace. Pixel art.

如何使用 (How to use)

为便于体验模型,用户可以参考Notebook快速开发文生视频-教程。 模型已经在创空间huggigface上线,可以直接体验。

I order to facilitate the experiece of the model, users ca refer to the Aliyu Notebook Tutorial to quickly develop this Text-to-Video model. The model has bee lauched o ModelScope Studio ad huggigface, you ca experiece it directly.

该模型暂仅支持在GPU上进行推理。模型需要硬件配置大约是 16GB 内存和 16GB GPU显存。在ModelScope框架下,通过调用简单的Pipelie即可使用当前模型,其中,输入需为字典格式,合法键值为'text',内容为一小段文本。输入具体代码示例如下:

This model curretly oly supports iferece o the GPU. This demo requires about 16GB CPU RAM ad 16GB GPU RAM. Uder the ModelScope framework, the curret model ca be used by callig a simple Pipelie, where the iput must be i dictioary format, the legal key value is 'text', ad the cotet is a short text. Eter specific code examples as follows:

[2023.03.21 更新] ModelScope发布1.4.2版本,text-to-video-sythesis 模型更新到模型参数文件 v1.1.0。

[2023.03.21 update] ModelScope released versio 1.4.2, ad the text-to-video-sythesis model updated the model parameter file ito v1.1.0.

运行环境 (Operatig eviromet)

 pip istall modelscope==1.4.2
 pip istall ope_clip_torch
 pip istall pytorch-lightig
 pip istall ope_clip_torch==2.24.0

代码范例 (Code example)

from modelscope.pipelies import pipelie
from modelscope.outputs import OutputKeys

p = pipelie('text-to-video-sythesis', 'damo/text-to-video-sythesis')
test_text = {
        'text': 'A pada eatig bamboo o a rock.',
    }
output_video_path = p(test_text, output_video='./output.mp4')[OutputKeys.OUTPUT_VIDEO]
prit('output_video_path:', output_video_path)

查看结果 (View Results)

上述代码会展示输出视频的保存路径,目前编码格式采用VLC播放器可以正常播放。系统默认播放器可能无法正常播放本模型生成的视频。

The above code will display the save path of the output mp4 video, ad the curret ecodig format ca be played ormally with VLC player. Some other media players may ot view it ormally.

模型局限性以及可能的偏差 (Model limitatios ad biases)

  • 模型基于Webvid等公开数据集进行训练,生成结果可能会存在与训练数据分布相关的偏差。

  • 该模型无法实现完美的影视级生成。

  • 该模型无法生成清晰的文本。

  • 该模型主要是用英文语料训练的,暂不支持其他语言。

  • 该模型在复杂的组合性生成任务上表现有待提升。

  • The model is traied based o public data sets such as Webvid, ad the geerated results may have deviatios related to the distributio of traiig data.

  • This model caot achieve perfect film ad televisio quality geeratio.

  • The model caot geerate clear text.

  • The model is maily traied with Eglish corpus ad does ot support other laguages ​​at the momet**.

  • The performace of this model eeds to be improved o complex compositioal geeratio tasks.

滥用、恶意使用和超出范围的使用 (Misuse, Malicious Use ad Excessive Use)

  • 本模型是为非商业目的提供,仅供研究使用。

  • 该模型未经过训练以真实地表示人或事件,因此使用该模型生成此类内容超出了该模型的能力范围。

  • 禁止用于对人或其环境、文化、宗教等产生贬低、或有害的内容生成。

  • 禁止用于涉黄、暴力和血腥内容生成。

  • 禁止用于错误和虚假信息生成。

  • The model ca oly be used for o-commercial purposes. The model is meat for research purposes.

  • The model was ot traied to realistically represet people or evets, so usig it to geerate such cotet is beyod the model's capabilities.

  • It is prohibited to geerate cotet that is demeaig or harmful to people or their eviromet, culture, religio, etc.

  • Prohibited for porographic, violet ad bloody cotet geeratio.

  • Prohibited for error ad false iformatio geeratio.

训练数据介绍 (Traiig data)

训练数据包括LAION5B, ImageNet, Webvid等公开数据集。经过美学得分、水印得分、去重等预训练进行图像和视频过滤。

The traiig data icludes LAION5B, ImageNet, Webvid ad other public datasets. Image ad video filterig is performed after pre-traiig such as aesthetic score, watermark score, ad deduplicatio.

相关论文以及引用信息

@article{wag2023modelscope,
  title={Modelscope text-to-video techical report},
  author={Wag, Jiuiu ad Yua, Hagjie ad Che, Dayou ad Zhag, Yigya ad Wag, Xiag ad Zhag, Shiwei},
  joural={arXiv preprit arXiv:2308.06571},
  year={2023}
}

@iproceedigs{luo2023videofusio,
  title={VideoFusio: Decomposed Diffusio Models for High-Quality Video Geeratio},
  author={Luo, Zhegxiog ad Che, Dayou ad Zhag, Yigya ad Huag, Ya ad Wag, Liag ad She, Yuju ad Zhao, Deli ad Zhou, Jigre ad Ta, Tieiu},
  booktitle={Proceedigs of the IEEE/CVF Coferece o Computer Visio ad Patter Recogitio},
  year={2023}
}

@iproceedigs{rombach2022high,
  title={High-resolutio image sythesis with latet diffusio models},
  author={Rombach, Robi ad Blattma, Adreas ad Lorez, Domiik ad Esser, Patrick ad Ommer, Bj{\"o}r},
  booktitle={Proceedigs of the IEEE/CVF Coferece o Computer Visio ad Patter Recogitio},
  pages={10684--10695},
  year={2022}
}

@iproceedigs{Bai21,
  author={Max Bai ad Arsha Nagrai ad G{\"u}l Varol ad Adrew Zisserma},
  title={Froze i Time: A Joit Video ad Image Ecoder for Ed-to-Ed Retrieval},
  booktitle={IEEE Iteratioal Coferece o Computer Visio},
  year={2021},
}

功能介绍

文本生成视频大模型-英文-通用领域 (Text-to-video-synthesis Model in Open Domain) 本模型基于多阶段文本到视频生成扩散模型, 输入描述文本,返回符合文本描

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论