匿名用户2024年07月31日
65阅读

技术信息

开源地址
https://modelscope.cn/models/iic/dreamtalk

作品详情

DreamTalk: Whe Expressive Talkig Head Geeratio Meets Diffusio Probabilistic Models

DreamTalk is a diffusio-based audio-drive expressive talkig head geeratio framework that ca produce high-quality talkig head videos across diverse speakig styles. DreamTalk exhibits robust performace with a diverse array of iputs, icludig sogs, speech i multiple laguages, oisy audio, ad out-of-domai portraits.

figure1

News

  • [2023.12] Release iferece code ad pretraied checkpoit.

安装依赖

pip istall dlib

Istallatio

我在output_video文件夹下已经放入了一些生成好的文件, 可运行下面脚本, 对比下结果.

from modelscope.utils.costat import Tasks
from modelscope.pipelies import pipelie
import os

pipe = pipelie(task=Tasks.text_to_video_sythesis, model='damo/dreamtalk',
style_clip_path="data/style_clip/3DMM/M030_frot_surprised_level3_001.mat",
pose_path="data/pose/RichardShelby_frot_eutral_level1_001.mat",
model_revisio='master'
)
# ,model_revisio='master')
iputs={
    "output_ame": "sogbie_yk_male",
    "wav_path": "data/audio/ackowledgemet_eglish.m4a",
    "img_crop": True,
    "image_path": "data/src_img/ucropped/male_face.pg",
    "max_ge_le": 20
    }
pipe(iput=iputs)
prit("ed")

wav_path 为输入音频路径;

style_clip_path 为表情参考文件,从带情绪的视频中提取, 可用来控制生成视频的表情;

pose_path 为头部运动参考文件, 从视频中提取,可用来控制生成视频的头部运动;

image_path 为说话人肖像, 最好是正脸, 理论支持任意分辨率输入, 会被裁减成$256\times256$ 分辨率;

max_ge_le 为最长视频生成时长, 单位为秒, 如果输入音频长于这个时间则会被截断;

output_ame为输出名称, 最终生成的视频会在 output_video 文件夹下, 中间结果会在 tmp 文件夹下.

如果输入图片已经为$256\times256$ 而且大小合适无需裁剪, 则可使用disable_img_crop跳过裁剪步骤, 如下:

Dowload Checkpoits

Dowload the checkpoit of the deoisig etwork:

Dowload the checkpoit of the rederer:

Put the dowloaded checkpoits ito checkpoits folder.

Iferece

Ru the script:

pytho iferece_for_demo_video.py \
--wav_path data/audio/ackowledgemet_eglish.m4a \
--style_clip_path data/style_clip/3DMM/M030_frot_eutral_level1_001.mat \
--pose_path data/pose/RichardShelby_frot_eutral_level1_001.mat \
--image_path data/src_img/ucropped/male_face.pg \
--cfg_scale 1.0 \
--max_ge_le 30 \
--output_ame ackowledgemet_eglish@M030_frot_eutral_level1_001@male_face

wav_path specifies the iput audio. The iput audio file extesios such as wav, mp3, m4a, ad mp4 (video with soud) should all be compatible.

style_clip_path specifies the referece speakig style ad pose_path specifies head pose. They are 3DMM parameter sequeces extracted from referece videos. You ca follow PIRederer to extract 3DMM parameters from your ow videos. Note that the video frame rate should be 25 FPS. Besides, videos used for head pose referece should be first cropped to $256\times256$ usig scripts i FOMM video preprocessig.

image_path specifies the iput portrait. Its resolutio should be larger tha $256\times256$. Frotal portraits, with the face directly facig forward ad ot tilted to oe side, usually achieve satisfactory results. The iput portrait will be cropped to $256\times256$. If your portrait is already cropped to $256\times256$ ad you wat to disable croppig, use optio --disable_img_crop like this:

pytho iferece_for_demo_video.py \
--wav_path data/audio/ackowledgemet_chiese.m4a \
--style_clip_path data/style_clip/3DMM/M030_frot_surprised_level3_001.mat \
--pose_path data/pose/RichardShelby_frot_eutral_level1_001.mat \
--image_path data/src_img/cropped/zp1.pg \
--disable_img_crop \
--cfg_scale 1.0 \
--max_ge_le 30 \
--output_ame ackowledgemet_chiese@M030_frot_surprised_level3_001@zp1

cfg_scale cotrols the scale of classifer-free guidace. It ca adjust the itesity of speakig styles.

max_ge_le is the maximum video geeratio duratio, measured i secods. If the iput audio exceeds this legth, it will be trucated.

The geerated video will be amed $(output_ame).mp4 ad put i the output_video folder. Itermediate results, icludig the cropped portrait, will be i the tmp/$(output_ame) folder.

Sample iputs are preseted i data folder. Due to copyright issues, we are uable to iclude the sogs we have used i this folder.

Ackowledgemets

We exted our heartfelt thaks for the ivaluable cotributios made by precedig works to the developmet of DreamTalk. This icludes, but is ot limited to: PIRederer ,AVCT ,StyleTalk ,Deep3DFaceReco_pytorch ,Wav2vec2.0 ,diffusio-poit-cloud ,FOMM video preprocessig. We are dedicated to advacig upo these foudatioal works with the utmost respect for their origial cotributios.

Citatio

If you fid this codebase useful for your research, please use the followig etry.

@article{ma2023dreamtalk,
  title={DreamTalk: Whe Expressive Talkig Head Geeratio Meets Diffusio Probabilistic Models},
  author={Ma, Yifeg ad Zhag, Shiwei ad Wag, Jiayu ad Wag, Xiag ad Zhag, Yigya ad Deg, Zhidog},
  joural={arXiv preprit arXiv:2312.09767},
  year={2023}
}

功能介绍

DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models DreamTalk

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论