MagicPose: Realistic Human Poses and Facial Expressions Retargeting with Identity-aware Diffusion.

Abstract

In this work, we propose MagicPose, a diffusion-based model for 2D human motion and facial expression transfer on challenging human dance videos. Specifically, we aim to generate human dance videos of any target identity driven by novel pose sequences while keeping the identity unchanged. To this end, we propose a two-stage training strategy to disentangle human motions and appearance (e.g., facial expressions, skin tone and dressing), consisting of the pretraining of an appearance-control block and fine-tuning of an appearance-pose-joint-control block over human dance poses of the same dataset. Our novel design enables robust appearance control with temporally consistent upper body, facial attributes, and even background. The model also generalizes well on unseen human identities and complex motion sequences without the need for any fine-tuning with additional data with diverse human attributes by leveraging the prior knowledge of image diffusion models. Moreover, the proposed model is easy to use and can be considered as a plug-in module/extension to Stable Diffusion. We also demonstrate the model's ability for zero-shot 2D animation generation, enabling not only the appearance transfer from one identity to another but also allowing for cartoon-like stylization given only pose inputs. Extensive experiments demonstrate our superior performance on the TikTok dataset.

Video

Method

Overview of the proposed MagicPose pipeline for controllable human dance video generation with motions & facial expressions transfer. The Appearance Control Model is a copy of the entire Stable-Diffusion UNet, initialized with the same weight. The Stable-Diffusion UNet is frozen throughout the training. During a) Appearance Control Pretraining, we train the appearance control model and its Multi-Source Self-Attention Module. During b) Appearance-disentangled Pose Control, we jointly fine-tune the Appearance Control Model, initialized with weights from a), and the Pose ControlNet. After these steps, we freeze all previously trained modules and fine-tune a motion module initialized with AnimateDiff.

Results

1. Human Motion and Facial Expression Transfer

Visualization of Human Motion and Facial Expression Transfer. MagicPose is able to generate vivid and realistic motion and expressions under the condition of diverse pose skeleton and face landmark input, while accurately maintaining identity information from the reference image input.

Video Results

GT Pose TPS Disco MagicPose

2. Zero-Shot Animation

Visualization of Zero-Shot 2D Animation Generation. MagicPose can provide a precise generation with identity information from cartoon-style images even without any further fine-tuning after being trained on real-human dance videos.

Video Results

Condition

Reference

Generated video

Comparison to Recent Works

Qualitative Compare

GT Pose TPS Disco MagicPose

Qualitative comparison of human video generation between TPS, Disco and MagicPose. Previous methods obviously suffer from inconsistent facial expressions and human pose identity.

Qualitative comparison of zero-shot motion transfer on out-of-domain images.

Comparison to Animate Anyone

Reference Pose Animate Anyone MagicPose

Qualitative comparison of zero-shot motion transfer on out-of-domain images to Animate Anyone. Animate Anyone fails to generate steady and consistent background while MagicPose is able to.

Comparison to MagicAnimate

Qualitative comparison of MagicPose to MagicAnimate on Facial Expression Editing. MagicAnimate fails to generate diverse facial expressions, while MagicPose is able to.

Qualitative comparison of MagicPose to MagicAnimate on in-the-wild pose retargeting. MagicAnimate fails to generate the back of the human subject, while MagicPose is able to.

Quantitative Compare

Quantitative comparisons of MagicPose with the recent SOTA methods DreamPose and Disco. ↓ indicates that the lower the better, and vice versa. Methods with * directly use the target image as the input, including more information compared to the OpenPose. † represents that Disco is pre-trained on other datasets more than our proposed MagicPose, which uses only 335 video sequences in the TikTok dataset for pretraning and fine-tuning. Face-Cos represents the cosine similarity of the face area between generation and ground truth image.

BibTeX

@misc{chang2024magicpose,
      title={MagicPose: Realistic Human Poses and Facial Expressions Retargeting with Identity-aware Diffusion}, 
      author={Di Chang and Yichun Shi and Quankai Gao and Jessica Fu and Hongyi Xu and Guoxian Song and Qing Yan and Yizhe Zhu and Xiao Yang and Mohammad Soleymani},
      year={2024},
      eprint={2311.12052},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

This work was sponsored by the Army Research Office and was accomplished under Cooperative Agreement Number W911NF-20-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.