Dyadic Interaction Modeling
for Social Behavior Generation

ECCV 2024
University of Southern California
*Equal Contribution    

Please turn on speaker for audio.
    Portraits source: ViCo Dataset

Overview

empty


We propose Dyadic Interaction Modeling, a pre-training strategy that jointly models speakers’ and listeners’ motions and learns representations that capture the dyadic context. We then utilize the pre-trained weights and feed multimodal inputs from the speaker into DIM-Listener. DIM-Listener is capable of generating photorealistic videos for the listener's motion.

Pretraining: Dyadic Interaction Modeling

Interpolate start reference image.


Dyadic Interaction Modeling learns a unified speaker-listener representation from dyadic interactions. The framework takes both the ground-truth speaker motion \(s\) and the listener motion \(l\) as input. VQ-Encoders of speaker and listener then encode the motions to discrete units (discrete latent codes) \(z^{(s)}\) and \(z^{(l)}\). The masked speaker's and listener's motions are further encoded and concatenated so that a unified representation is learned with contrastive loss. Then, the split unified representation and speaker audio feature \(a\) are decoded into discrete unit predictions \(z'^{(s)}\) and \(z'^{(l)}\) supervised by cross-entropy loss. Finally, the generated speaker motions \(s'\) and listener motions \(l'\) are decoded from these discrete unit predictions to optimize the reconstruction loss.

DIM-Listener

Interpolate start reference image.


For fine-tuning the model on listener motion generation, speaker input is not masked, and the listener input is entirely masked. We train the framework with the same cross-entropy loss and reconstruction loss from Dyadic Interaction Modeling while keeping the weights of listener VQ-Encoder fixed.

DIM-Speaker

Interpolate start reference image.


Since we introduce a dyadic framework capturing contextualized interactions among listeners and speakers, our approach can also animate speaker motion from the speaker's speech input. For Speech-Driven Speaker Generation, we fine-tune the pre-trained weights of the joint decoder and speaker VQ-Decoder from Dyadic Interaction Modeling. The generated speaker's motion from DIM-Speaker is optimized by the same cross-entropy and reconstruction loss as DIM-Listener.

Photorealistic Visualizations on ViCo

Please turn on speaker for audio.
    Portraits source: ViCo Dataset

Quantitative Compare

ViCo


Empty

Quantitative comparisons of DIM with recent SOTA methods RLHG, L2L and ELP. ↓ indicates that the lower the better. Methods with * didn’t release any code, and we re-implement it from scratch on our own. † represents that the corresponding method has been pre-trained on the CANDOR dataset.


LM_Listener


Empty

Quantitative comparisons of DIM with recent SOTA methods RLHG, L2L and ELP. ↓ indicates that the lower the better. Methods with * didn’t release any code, and we re-implement it from scratch on our own. † represents that the corresponding method has been pre-trained on the CANDOR dataset.

BibTeX

@article{tran2024dyadic,
      title={Dyadic Interaction Modeling for Social Behavior Generation},
      author={Tran, Minh and Chang, Di and Siniukov, Maksim and Soleymani, Mohammad},
      journal={arXiv preprint arXiv:2403.09069},
      year={2024}
    }

Acknowledgement

This work is supported by the National Science Foundation under Grant No. 2211550. The work was also sponsored by the Army Research Office and was accomplished under Cooperative Agreement Number W911NF-20-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.
We appreciate the support from Haiwen Feng, Quankai Gao and Hongyi Xu for their suggestions and discussions.