MagicPose: Realistic Human Poses
and Facial Expressions Retargeting
with Identity-aware Diffusion

1University of Southern California, 2ByteDance Inc
empty


We propose MagicPose, a novel and effective approach to provide realistic human video generation enabling vivid motion and facial expression transfer, and consistent in-the-wild zero-shot generation without any fine-tuning. MagicPose can precisely generate appearance-consistent results, while the original text-to-image model (e.g., Stable Diffusion and ControlNet) can hardly maintain the subject identity information accurately. Furthermore, our proposed modules can be treated as an extension/plug-in to the original text-to-image model without modifying its pre-trained weight.

Abstract

In this work, we propose MagicPose, a diffusion-based model for 2D human motion and facial expression transfer on challenging human dance videos. Specifically, we aim to generate human dance videos of any target identity driven by novel pose sequences while keeping the identity unchanged. To this end, we propose a two-stage training strategy to disentangle human motions and appearance (e.g., facial expressions, skin tone and dressing), consisting of the pretraining of an appearance-control block and fine-tuning of an appearance-pose-joint-control block over human dance poses of the same dataset. Our novel design enables robust appearance control with temporally consistent upper body, facial attributes, and even background. The model also generalizes well on unseen human identities and complex motion sequences without the need for any fine-tuning with additional data with diverse human attributes by leveraging the prior knowledge of image diffusion models. Moreover, the proposed model is easy to use and can be considered as a plug-in module/extension to Stable Diffusion. We also demonstrate the model's ability for zero-shot 2D animation generation, enabling not only the appearance transfer from one identity to another but also allowing for cartoon-like stylization given only pose inputs. Extensive experiments demonstrate our superior performance on the TikTok dataset.

Video

Method

Interpolate start reference image.


Overview of the proposed MagicPose pipeline for controllable human dance video generation with motions & facial expressions transfer. The Appearance Control Model is a copy of the entire Stable-Diffusion UNet, initialized with the same weight. The Stable-Diffusion UNet is frozen throughout the training. During a) Appearance Control Pretraining, we train the appearance control model and its Multi-Source Self-Attention Module. During b) Appearance-disentangled Pose Control, we jointly fine-tune the Appearance Control Model, initialized with weights from a), and the Pose ControlNet. After these steps, we freeze all previously trained modules and fine-tune a motion module initialized with AnimateDiff.


Results

1. Human Motion and Facial Expression Transfer

Empty
Empty

Visualization of Human Motion and Facial Expression Transfer. MagicPose is able to generate vivid and realistic motion and expressions under the condition of diverse pose skeleton and face landmark input, while accurately maintaining identity information from the reference image input.


Video Results

GT Pose TPS Disco MagicPose



2. Zero-Shot Animation

Empty
Empty


Visualization of Zero-Shot 2D Animation Generation. MagicPose can provide a precise generation with identity information from cartoon-style images even without any further fine-tuning after being trained on real-human dance videos.


Video Results

Interpolate start reference image.

Condition

Interpolation end reference image.

Reference

Generated video


Comparison to Recent Works

Qualitative Compare

Empty

GT Pose TPS Disco MagicPose


Qualitative comparison of human video generation between TPS, Disco and MagicPose. Previous methods obviously suffer from inconsistent facial expressions and human pose identity.


Empty


Qualitative comparison of zero-shot motion transfer on out-of-domain images.


Comparison to Animate Anyone

Reference Pose Animate Anyone MagicPose


Qualitative comparison of zero-shot motion transfer on out-of-domain images to Animate Anyone. Animate Anyone fails to generate steady and consistent background while MagicPose is able to.


Comparison to MagicAnimate

Empty


Qualitative comparison of MagicPose to MagicAnimate on Facial Expression Editing. MagicAnimate fails to generate diverse facial expressions, while MagicPose is able to.


Empty


Qualitative comparison of MagicPose to MagicAnimate on in-the-wild pose retargeting. MagicAnimate fails to generate the back of the human subject, while MagicPose is able to.


Quantitative Compare


Empty

Quantitative comparisons of MagicPose with the recent SOTA methods DreamPose and Disco. ↓ indicates that the lower the better, and vice versa. Methods with * directly use the target image as the input, including more information compared to the OpenPose. † represents that Disco is pre-trained on other datasets more than our proposed MagicPose, which uses only 335 video sequences in the TikTok dataset for pretraning and fine-tuning. Face-Cos represents the cosine similarity of the face area between generation and ground truth image.

BibTeX

@misc{chang2024magicpose,
      title={MagicPose: Realistic Human Poses and Facial Expressions Retargeting with Identity-aware Diffusion}, 
      author={Di Chang and Yichun Shi and Quankai Gao and Jessica Fu and Hongyi Xu and Guoxian Song and Qing Yan and Yizhe Zhu and Xiao Yang and Mohammad Soleymani},
      year={2024},
      eprint={2311.12052},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}