HairWeaverhair emoji: Few-Shot Photorealistic Hair Motion Synthesis with Sim-to-Real Guided Video Diffusion

1Meta      2University of Southern California

Abstract

We present HairWeaver, a diffusion-based pipeline that animates a single human image with realistic and expressive hair dynamics. While existing methods successfully control body pose, they lack specific control for hair and, as a result, typically fail to capture the intricate hair motions, resulting in stiff and unrealistic animations. HairWeaver overcomes this limitation using two specialized modules: a Motion-Context-LoRA to integrate motion conditions and a Sim2Real-Domain-LoRA to preserve the subject's photoreal appearance across different data domains. These lightweight components are designed to guide a video diffusion backbone while maintaining its core generative capabilities. By training on a specialized dataset of dynamic human motion generated from CG simulator, HairWeaver affords fine control over hair motion and ultimately learns to produce highly realistic hair that responds naturally to movement. Comprehensive evaluations demonstrate that our approach sets a new state of the art, producing lifelike human hair animations with unprecedented dynamic details.

Method

MY ALT TEXT

Figure 2. Overview of HairWeaver pipeline. a) We use CG simulator to generate data samples, which contains videos with motions Vgt, static reference image, pose condition, and hair condition . b) During training stage 1, we leverage the a diffusion transformer (DiT) architecture as the backbone model and pre-train the Sim2Real-Domain-LoRA. This training process is conducted in Image-to-Video manner with reference image and text prompt as input. c) During training stage 2, we freeze the Sim2Real-Domain- LoRA and finetune the Motion-Context-LoRA with pose condition and hair condition as additional guidance. d) During inference, the Sim2Real-Domain-LoRA is discarded and the trained model generates photorealistic human videos with hair and body motions with photorealistic reference and CG pose & hair conditions as input. e) Details of the Model Architecture presented in c). The Pose Encoder integrates the body motions as a trainable residual to the noisy latent. The hair motions are encoded as additional attention context to the DiT blocks by a frozen VAE-Encoder. The only trainable modules are the Pose Encoder and the Motion-Context-LoRA.

Results

Comparison to Previous Works

Photorealistic Video Animation


CG: Source Video from Simulator to provide Conditions.
Generation: Photorealistic Generation by HairWeaver.
w/Conditions: Photorealistic Generation and Corresponding Conditions.

More Visualizations


From left to right: Ref Img, Body Condition, Hair Condition, Generation