DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models
DreamTalk is a diffusion-based audio-driven expressive talking head generation framework that can produce high-quality talking head videos across diverse speaking styles. DreamTalk exhibits robust performance with a diverse array of inputs, including songs, speech in multiple languages, noisy audio, and out-of-domain portraits.
News
- [2023.12] Release inference code and pretrained checkpoint.
安装依赖
pip install dlib
Installation
我在output_video
文件夹下已经放入了一些生成好的文件, 可运行下面脚本, 对比下结果.
from modelscope.utils.constant import Tasks
from modelscope.pipelines import pipeline
import os
pipe = pipeline(task=Tasks.text_to_video_synthesis, model='damo/dreamtalk',
style_clip_path="data/style_clip/3DMM/M030_front_surprised_level3_001.mat",
pose_path="data/pose/RichardShelby_front_neutral_level1_001.mat",
model_revision='master'
)
# ,model_revision='master')
inputs={
"output_name": "songbie_yk_male",
"wav_path": "data/audio/acknowledgement_english.m4a",
"img_crop": True,
"image_path": "data/src_img/uncropped/male_face.png",
"max_gen_len": 20
}
pipe(input=inputs)
print("end")
wav_path
为输入音频路径;
style_clip_path
为表情参考文件,从带情绪的视频中提取, 可用来控制生成视频的表情;
pose_path
为头部运动参考文件, 从视频中提取,可用来控制生成视频的头部运动;
image_path
为说话人肖像, 最好是正脸, 理论支持任意分辨率输入, 会被裁减成$256\times256$ 分辨率;
max_gen_len
为最长视频生成时长, 单位为秒, 如果输入音频长于这个时间则会被截断;
output_name
为输出名称, 最终生成的视频会在 output_video
文件夹下, 中间结果会在 tmp
文件夹下.
如果输入图片已经为$256\times256$ 而且大小合适无需裁剪, 则可使用disable_img_crop
跳过裁剪步骤, 如下:
Download Checkpoints
Download the checkpoint of the denoising network:
Download the checkpoint of the renderer:
Put the downloaded checkpoints into checkpoints
folder.
Inference
Run the script:
python inference_for_demo_video.py \
--wav_path data/audio/acknowledgement_english.m4a \
--style_clip_path data/style_clip/3DMM/M030_front_neutral_level1_001.mat \
--pose_path data/pose/RichardShelby_front_neutral_level1_001.mat \
--image_path data/src_img/uncropped/male_face.png \
--cfg_scale 1.0 \
--max_gen_len 30 \
--output_name acknowledgement_english@M030_front_neutral_level1_001@male_face
wav_path
specifies the input audio. The input audio file extensions such as wav, mp3, m4a, and mp4 (video with sound) should all be compatible.
style_clip_path
specifies the reference speaking style and pose_path
specifies head pose. They are 3DMM paramenter sequences extracted from reference videos. You can follow PIRenderer to extract 3DMM parameters from your own videos. Note that the video frame rate should be 25 FPS. Besides, videos used for head pose reference should be first cropped to $256\times256$ using scripts in FOMM video preprocessing.
image_path
specifies the input portrait. Its resolution should be larger than $256\times256$. Frontal portraits, with the face directly facing forward and not tilted to one side, usually achieve satisfactory results. The input portrait will be cropped to $256\times256$. If your portrait is already cropped to $256\times256$ and you want to disable cropping, use option --disable_img_crop
like this:
python inference_for_demo_video.py \
--wav_path data/audio/acknowledgement_chinese.m4a \
--style_clip_path data/style_clip/3DMM/M030_front_surprised_level3_001.mat \
--pose_path data/pose/RichardShelby_front_neutral_level1_001.mat \
--image_path data/src_img/cropped/zp1.png \
--disable_img_crop \
--cfg_scale 1.0 \
--max_gen_len 30 \
--output_name acknowledgement_chinese@M030_front_surprised_level3_001@zp1
cfg_scale
controls the scale of classifer-free guidance. It can adjust the intensity of speaking styles.
max_gen_len
is the maximum video generation duration, measured in seconds. If the input audio exceeds this length, it will be truncated.
The generated video will be named $(output_name).mp4
and put in the output_video folder. Intermediate results, including the cropped portrait, will be in the tmp/$(output_name)
folder.
Sample inputs are presented in data
folder. Due to copyright issues, we are unable to include the songs we have used in this folder.
Acknowledgements
We extend our heartfelt thanks for the invaluable contributions made by preceding works to the development of DreamTalk. This includes, but is not limited to: PIRenderer ,AVCT ,StyleTalk ,Deep3DFaceRecon_pytorch ,Wav2vec2.0 ,diffusion-point-cloud ,FOMM video preprocessing. We are dedicated to advancing upon these foundational works with the utmost respect for their original contributions.
Citation
If you find this codebase useful for your research, please use the following entry.
@article{ma2023dreamtalk,
title={DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models},
author={Ma, Yifeng and Zhang, Shiwei and Wang, Jiayu and Wang, Xiang and Zhang, Yingya and Deng, Zhidong},
journal={arXiv preprint arXiv:2312.09767},
year={2023}
}
评论