ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions

Di Chang1,2* Mingdeng Cao1,3* Yichun Shi1 Bo Liu1,4 Shengqu Cai1,5 Shijie Zhou6 Weilin Huang1 Gordon Wetzstein5 Mohammad Soleymani2 Peng Wang1
1 ByteDance Seed 2 Unviersity of Southern California 3 The University of Tokyo 4 University of California Berkeley 5 Stanford University 6 University of California Los Angeles
TL;DR: We propose a new benchmark for instruction-guided image editing with non-rigid motions, and show that current commercial methods from the industry and SOTA methods from the acdemia are not robust to non-rigid motions.
Teaser Image
Our framework enables instruction-guided image editing with complex non-rigid motions, supporting diverse scenarios such as human articulation, object deformation, and viewpoint changes.

Abstract

Editing images with instructions to reflect non-rigid motions—camera viewpoint shifts, object deformations, human articulations, and complex interactions—poses a challenging yet underexplored problem in computer vision. Existing approaches and datasets predominantly focus on static scenes or rigid transformations, limiting their capacity to handle expressive edits involving dynamic motion. To address this gap, we introduce , a comprehensive framework for instruction-based image editing with an emphasis on non-rigid motions. comprises a large-scale dataset, , and a strong baseline model built upon the Diffusion Transformer (DiT), named . includes over 6 million high-resolution image editing pairs for training, along with a carefully curated evaluation benchmark . Both capture a wide variety of non-rigid motion types across diverse environments, human figures, and object categories. The dataset is constructed using motion-guided data generation, layered compositing techniques, and automated captioning to ensure diversity, realism, and semantic coherence. We further conduct a comprehensive evaluation of recent instruction-based image editing methods from both academic and commercial domains. The benchmark is available here.

Main Contributions

ByteMorph Dataset Construction

Our methodology leverages video generation models to produce natural and coherent transitions between source and target images, ensuring edits are both realistic and motion-consistent. ByteMorph Dataset Construction

Overview of Synthetic Data Construction. Given a source frame extracted from the real video, our pipeline proceeds in three steps. a) A Vision-Language Model (VLM) creates a Motion Caption from the instruction template database to animate the given frame. b) This caption guides a video generation model Seaweed to create a natural transformation. c) We sampled frames uniformly from the generated dynamic videos with a fixed interval and treated each pair of neighbouring frames as an image editing pair. We re-captioned the editing instruction by the same VLM, as well as the general description of each sampled frame (not shown in the figure).

Leaderboard

We run each method four times and report the average value. In addition to CLIP similarity metrics, we use Claude-3.7-Sonnet to evaluate the overall editing quality (VLM-Score). We also ask human participants to evaluate the instruction-following quality (Human-Eval-FL) and identity-preserving quality (Human-Eval-ID).

Benchmarking Open-Sourced Models
Category Method CLIP-SIMtxt↑ CLIP-Dtxt↑ CLIP-SIMimg↑ CLIP-Dimg↑ VLM-Eval↑
Camera ZoomInstructPix2Pix0.2700.0210.7370.26642.37
MagicBrush0.3110.0020.9070.20249.37
UltraEdit (SD3)0.2990.0000.8640.24954.74
AnySD0.3090.0010.9110.18240.92
InstrcutMove0.2830.0270.8210.29470.66
OminiControl0.2510.0220.7220.30045.79
†InstrcutMove0.3010.0450.8460.42582.29
†OminiControl0.3100.0390.8010.41474.15
†ByteMorpher (Ours)0.3010.0480.8470.46384.08
GT0.3170.0750.8901.00087.11
Camera MoveInstructPix2Pix0.3180.0100.7090.20032.20
MagicBrush0.3170.0090.9130.19552.63
UltraEdit (SD3)0.3060.0120.8850.24059.01
AnySD0.3180.0100.9090.20049.37
InstrcutMove0.3050.0160.8620.29174.86
OminiControl0.2430.0220.6870.24316.71
†InstrcutMove0.3040.0270.8830.41282.53
†OminiControl0.2980.0250.8910.30479.26
†ByteMorpher (Ours)0.3190.0410.8940.42684.18
GT0.3200.0390.9151.00086.37
Object MotionInstructPix2Pix0.2990.0260.7890.25736.47
MagicBrush0.3280.0070.9010.16347.49
UltraEdit (SD3)0.3240.0120.8870.23762.13
AnySD0.3190.0080.8790.18948.31
InstrcutMove0.3250.0150.8700.31872.44
OminiControl0.2790.0230.7530.27034.11
†InstrcutMove0.3280.0430.8910.48187.97
†OminiControl0.3300.0360.8920.47086.48
†ByteMorpher (Ours)0.3320.0440.8960.47289.07
GT0.3350.0560.9191.00089.53
Human MotionInstructPix2Pix0.2480.0120.6940.21123.60
MagicBrush0.3170.0010.9110.14646.27
UltraEdit (SD3)0.3130.0110.9000.19550.64
AnySD0.3120.0030.8940.15638.12
InstrcutMove0.3080.0130.8610.27869.43
OminiControl0.2300.0180.6600.22925.18
†InstrcutMove0.3140.0230.9010.44284.70
†OminiControl0.3110.0160.8800.39980.78
†ByteMorpher (Ours)0.3160.0220.8990.44085.66
GT0.3210.0310.9221.00086.10
InteractionInstructPix2Pix0.2710.0200.7320.26331.29
MagicBrush0.3170.0040.9140.16739.98
UltraEdit (SD3)0.3140.0180.8920.22652.24
AnySD0.3150.0050.9090.17337.23
InstrcutMove0.3090.0190.8550.31867.07
OminiControl0.2580.0210.6890.26532.99
†InstrcutMove0.3140.0430.8850.47785.83
†OminiControl0.2950.0410.7680.43378.90
†ByteMorpher (Ours)0.3200.0450.8840.48386.61
GT0.3240.0460.9051.00088.84

Quantitative evaluation of open-sourced methods on ByteMorph-Bench. † indicates the method is trained on ByteMorph-6M. Best results are in bold, second best are underlined.

Benchmarking Industrial Models - Editing Category: Camera Zoom

Organization Method CLIP-SIMtxt↑ CLIP-Dtxt↑ CLIP-SIMimg↑ CLIP-Dimg↑ VLM-Eval↑ Human-Eval-FL↑ Human-Eval-ID↑
StepFun AI Step1X-Edit 0.310 0.025 0.943 0.258 59.34 26.60 48.86
HiDream.ai HiDream-E1-FULL 0.304 0.027 0.682 0.287 41.18 33.00 16.50
Google Imagen-3-capability 0.293 0.025 0.846 0.264 53.94 61.34 41.38
Google Gemini-2.0-flash-image 0.305 0.031 0.862 0.297 72.27 61.04 63.09
ByteDance SeedEdit 1.6 0.311 0.029 0.827 0.325 75.00 61.34 83.60
OpenAI GPT-4o-image 0.317 0.015 0.832 0.337 88.14 89.36 61.09
ByteDance BAGEL 0.300 0.031 0.860 0.301 75.55 - -
Black Forest Labs Flux-Kontext-pro 0.312 0.024 0.864 0.334 75.66 - -
Black Forest Labs Flux-Kontext-max 0.307 0.032 0.871 0.373 80.18 - -
ByteDance SeedEdit 3.0 0.296 0.027 0.833 0.370 88.25 - -
ByteDance ByteMorpher (Ours) 0.301 0.048 0.847 0.463 84.08 61.13 74.73
- GT 0.317 0.075 0.890 1.000 87.11 - -

Quantitative results for Camera Zoom. Best results are in bold, second best are underlined.

Benchmarking Industrial Models - Editing Category: Camera Move

Organization Method CLIP-SIMtxt↑ CLIP-Dtxt↑ CLIP-SIMimg↑ CLIP-Dimg↑ VLM-Eval↑ Human-Eval-FL↑ Human-Eval-ID↑
StepFun AI Step1X-Edit 0.315 0.008 0.946 0.208 57.96 33.50 63.39
HiDream.ai HiDream-E1-FULL 0.309 0.029 0.712 0.252 32.76 16.50 18.22
Google Imagen-3-capability 0.282 0.010 0.813 0.238 47.22 17.38 26.51
Google Gemini-2.0-flash-image 0.317 0.020 0.892 0.311 77.96 56.60 75.76
ByteDance SeedEdit 1.6 0.314 0.015 0.866 0.253 78.59 58.30 87.78
OpenAI GPT-4o-image 0.321 0.011 0.865 0.285 84.57 76.74 59.14
ByteDance BAGEL 0.306 0.026 0.883 0.290 76.08 - -
Black Forest Labs Flux-Kontext-pro 0.312 0.016 0.891 0.286 79.14 - -
Black Forest Labs Flux-Kontext-max 0.315 0.019 0.896 0.325 85.97 - -
ByteDance SeedEdit 3.0 0.308 0.020 0.887 0.278 78.00 - -
ByteDance ByteMorpher (Ours) 0.319 0.041 0.894 0.426 84.18 67.60 58.25
- GT 0.320 0.039 0.915 1.000 86.37 - -

Quantitative results for Camera Move. Best results are in bold, second best are underlined.

Benchmarking Industrial Models - Editing Category: Object Motion

Organization Method CLIP-SIMtxt↑ CLIP-Dtxt↑ CLIP-SIMimg↑ CLIP-Dimg↑ VLM-Eval↑ Human-Eval-FL↑ Human-Eval-ID↑
StepFun AI Step1X-Edit 0.323 0.019 0.923 0.260 72.78 72.16 59.39
HiDream.ai HiDream-E1-FULL 0.312 0.028 0.700 0.259 35.00 44.34 49.75
Google Imagen-3-capability 0.324 0.027 0.870 0.261 57.06 62.56 77.84
Google Gemini-2.0-flash-image 0.333 0.040 0.892 0.341 79.08 74.77 86.62
ByteDance SeedEdit 1.6 0.332 0.025 0.874 0.323 80.21 66.50 79.12
OpenAI GPT-4o-image 0.339 0.029 0.861 0.354 90.60 75.19 49.91
ByteDance BAGEL 0.324 0.036 0.920 0.326 74.07 - -
Black Forest Labs Flux-Kontext-pro 0.321 0.018 0.893 0.314 78.41 - -
Black Forest Labs Flux-Kontext-max 0.325 0.025 0.888 0.353 80.42 - -
ByteDance SeedEdit 3.0 0.321 0.036 0.905 0.344 88.11 - -
ByteDance ByteMorpher (Ours) 0.332 0.044 0.896 0.472 89.07 62.16 58.25
- GT 0.335 0.056 0.919 1.000 89.53 - -

Quantitative results for Object Motion. Best results are in bold, second best are underlined.

Benchmarking Industrial Models - Editing Category: Human Motion

Organization Method CLIP-SIMtxt↑ CLIP-Dtxt↑ CLIP-SIMimg↑ CLIP-Dimg↑ VLM-Eval↑ Human-Eval-FL↑ Human-Eval-ID↑
StepFun AI Step1X-Edit 0.315 0.017 0.931 0.212 65.39 44.50 78.80
HiDream.ai HiDream-E1-FULL 0.301 0.017 0.676 0.215 33.21 12.51 38.66
Google Imagen-3-capability 0.295 0.017 0.840 0.233 55.70 33.34 61.17
Google Gemini-2.0-flash-image 0.314 0.017 0.893 0.282 78.72 51.84 63.34
ByteDance SeedEdit 1.6 0.324 0.024 0.878 0.274 80.62 56.23 72.12
OpenAI GPT-4o-image 0.316 0.021 0.850 0.330 87.93 87.56 57.84
ByteDance BAGEL 0.312 0.021 0.929 0.242 74.36 - -
Black Forest Labs Flux-Kontext-pro 0.314 0.017 0.918 0.283 79.15 - -
Black Forest Labs Flux-Kontext-max 0.316 0.016 0.908 0.307 80.78 - -
ByteDance SeedEdit 3.0 0.313 0.025 0.903 0.343 88.13 - -
ByteDance ByteMorpher (Ours) 0.316 0.022 0.899 0.440 85.66 68.38 75.00
- GT 0.321 0.031 0.922 1.000 86.10 - -

Quantitative results for Human Motion. Best results are in bold, second best are underlined.

Benchmarking Industrial Models - Editing Category: Interaction

Organization Method CLIP-SIMtxt↑ CLIP-Dtxt↑ CLIP-SIMimg↑ CLIP-Dimg↑ VLM-Eval↑ Human-Eval-FL↑ Human-Eval-ID↑
StepFun AI Step1X-Edit 0.312 0.020 0.937 0.245 65.99 36.09 64.56
HiDream.ai HiDream-E1-FULL 0.307 0.019 0.679 0.251 35.73 10.60 38.66
Google Imagen-3-capability 0.307 0.023 0.863 0.254 54.78 47.16 61.59
Google Gemini-2.0-flash-image 0.316 0.027 0.889 0.327 76.86 60.70 77.94
ByteDance SeedEdit 1.6 0.326 0.032 0.878 0.316 78.27 49.78 80.10
OpenAI GPT-4o-image 0.318 0.031 0.851 0.351 88.65 81.17 73.72
ByteDance BAGEL 0.312 0.037 0.913 0.301 73.16 - -
Black Forest Labs Flux-Kontext-pro 0.313 0.028 0.898 0.318 78.58 - -
Black Forest Labs Flux-Kontext-max 0.320 0.032 0.894 0.335 80.12 - -
ByteDance SeedEdit 3.0 0.312 0.036 0.894 0.371 86.07 - -
ByteDance ByteMorpher (Ours) 0.320 0.045 0.884 0.483 86.61 69.15 64.73
- GT 0.324 0.046 0.905 1.000 88.84 - -

Quantitative results for Interaction. Best results are in bold, second best are underlined.

Benchmarking Industrial Models: Qualitative Comparison

Qualitative comparison of industrial instruction-guided image editing models on the ByteMorph-Bench benchmark. Our method achieves superior performance across various non-rigid motion scenarios.

Ablation Study

We fine-tune OminiControl and InstructMove on our training set. Both models exhibit notable gains across key metrics after fine-tuning. The following qualitative results demonstrate that InstructMove trained on our dataset achieve substantially better instruction-following ability, particularly for non-rigid motion edits.

Ablation Study

Bibtex

@article{chang2025bytemorph,
  title={ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions},
  author={Chang, Di and Cao, Mingdeng and Shi, Yichun and Liu, Bo and Cai, Shengqu and Zhou, Shijie and Huang, Weilin and Wetzstein, Gordon and Soleymani, Mohammad and Wang, Peng},
  journal={arXiv preprint arXiv:2506.03107},
  year={2025}
}
      

License

Our dataset ByteMorph-6M and evaluation benchmark ByteMorph-Bench are released under CC0-1.0 Creative Commons Zero v1.0 Universal License. The baseline model ByteMorpher, including code and weights, is released under FLUX.1-dev Non-Commercial License.

Disclaimer

Your access to and use of this dataset are at your own risk. We do not guarantee the accuracy of this dataset. The dataset is provided “as is” and we make no warranty or representation to you with respect to it and we expressly disclaim, and hereby expressly waive, all warranties, express, implied, statutory or otherwise. This includes, without limitation, warranties of quality, performance, merchantability or fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. In no event will we be liable to you on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this public license or use of the licensed material.The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.