ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions

Di Chang1,2* Mingdeng Cao1,3* Yichun Shi1 Bo Liu1,4 Shengqu Cai1,5 Shijie Zhou6 Weilin Huang1 Gordon Wetzstein5 Mohammad Soleymani2 Peng Wang1
1 ByteDance Seed 2 Unviersity of Southern California 3 The University of Tokyo 4 University of California Berkeley 5 Stanford University 6 University of California Los Angeles
TL;DR: We propose a new benchmark for instruction-guided image editing with non-rigid motions, and show that current commercial methods from the industry and SOTA methods from the acdemia are not robust to non-rigid motions.
Teaser Image
Our framework enables instruction-guided image editing with complex non-rigid motions, supporting diverse scenarios such as human articulation, object deformation, and viewpoint changes.

Abstract

Editing images with instructions to reflect non-rigid motions—camera viewpoint shifts, object deformations, human articulations, and complex interactions—poses a challenging yet underexplored problem in computer vision. Existing approaches and datasets predominantly focus on static scenes or rigid transformations, limiting their capacity to handle expressive edits involving dynamic motion. To address this gap, we introduce , a comprehensive framework for instruction-based image editing with an emphasis on non-rigid motions. comprises a large-scale dataset, , and a strong baseline model built upon the Diffusion Transformer (DiT), named . includes over 6 million high-resolution image editing pairs for training, along with a carefully curated evaluation benchmark . Both capture a wide variety of non-rigid motion types across diverse environments, human figures, and object categories. The dataset is constructed using motion-guided data generation, layered compositing techniques, and automated captioning to ensure diversity, realism, and semantic coherence. We further conduct a comprehensive evaluation of recent instruction-based image editing methods from both academic and commercial domains. The benchmark is available here.

Main Contributions

ByteMorph Dataset Construction

Our methodology leverages video generation models to produce natural and coherent transitions between source and target images, ensuring edits are both realistic and motion-consistent. ByteMorph Dataset Construction

Overview of Synthetic Data Construction. Given a source frame extracted from the real video, our pipeline proceeds in three steps. a) A Vision-Language Model (VLM) creates a Motion Caption from the instruction template database to animate the given frame. b) This caption guides a video generation model Seaweed to create a natural transformation. c) We sampled frames uniformly from the generated dynamic videos with a fixed interval and treated each pair of neighbouring frames as an image editing pair. We re-captioned the editing instruction by the same VLM, as well as the general description of each sampled frame (not shown in the figure).

Benchmarking Results

We run each method four times and report the average value. In addition to CLIP similarity metrics, we use Claude-3.7-Sonnet to evaluate the overall editing quality (VLM-Score). We also ask human participants to evaluate the instruction-following quality (Human-Eval-FL) and identity-preserving quality (Human-Eval-ID).

Benchmarking Open-Sourced Models
Category Method CLIP-SIMtxt CLIP-Dtxt CLIP-SIMimg CLIP-Dimg VLM-Eval↑
Camera ZoomInstructPix2Pix0.2700.0210.7370.26642.37
MagicBrush0.3110.0020.9070.20249.37
UltraEdit (SD3)0.2990.0000.8640.24954.74
AnySD0.3090.0010.9110.18240.92
InstrcutMove0.2830.0270.8210.29470.66
OminiControl0.2510.0220.7220.30045.79
†InstrcutMove0.3010.0450.8460.42582.29
†OminiControl0.3100.0390.8010.41474.15
†ByteMorpher (Ours)0.3010.0480.8470.46384.08
GT0.3170.0750.8901.00087.11
Camera MoveInstructPix2Pix0.3180.0100.7090.20032.20
MagicBrush0.3170.0090.9130.19552.63
UltraEdit (SD3)0.3060.0120.8850.24059.01
AnySD0.3180.0100.9090.20049.37
InstrcutMove0.3050.0160.8620.29174.86
OminiControl0.2430.0220.6870.24316.71
†InstrcutMove0.3040.0270.8830.41282.53
†OminiControl0.2980.0250.8910.30479.26
†ByteMorpher (Ours)0.3190.0410.8940.42684.18
GT0.3200.0390.9151.00086.37
Object MotionInstructPix2Pix0.2990.0260.7890.25736.47
MagicBrush0.3280.0070.9010.16347.49
UltraEdit (SD3)0.3240.0120.8870.23762.13
AnySD0.3190.0080.8790.18948.31
InstrcutMove0.3250.0150.8700.31872.44
OminiControl0.2790.0230.7530.27034.11
†InstrcutMove0.3280.0430.8910.48187.97
†OminiControl0.3300.0360.8920.47086.48
†ByteMorpher (Ours)0.3320.0440.8960.47289.07
GT0.3350.0560.9191.00089.53
Human MotionInstructPix2Pix0.2480.0120.6940.21123.60
MagicBrush0.3170.0010.9110.14646.27
UltraEdit (SD3)0.3130.0110.9000.19550.64
AnySD0.3120.0030.8940.15638.12
InstrcutMove0.3080.0130.8610.27869.43
OminiControl0.2300.0180.6600.22925.18
†InstrcutMove0.3140.0230.9010.44284.70
†OminiControl0.3110.0160.8800.39980.78
†ByteMorpher (Ours)0.3160.0220.8990.44085.66
GT0.3210.0310.9221.00086.10
InteractionInstructPix2Pix0.2710.0200.7320.26331.29
MagicBrush0.3170.0040.9140.16739.98
UltraEdit (SD3)0.3140.0180.8920.22652.24
AnySD0.3150.0050.9090.17337.23
InstrcutMove0.3090.0190.8550.31867.07
OminiControl0.2580.0210.6890.26532.99
†InstrcutMove0.3140.0430.8850.47785.83
†OminiControl0.2950.0410.7680.43378.90
†ByteMorpher (Ours)0.3200.0450.8840.48386.61
GT0.3240.0460.9051.00088.84

Quantitative evaluation of open-sourced methods on ByteMorph-Bench. † indicates the method is trained on ByteMorph-6M. Best results are in bold, second best are underlined.

Benchmarking Industrial Models: Camera Zoom

Method CLIP-SIMtxt CLIP-Dtxt CLIP-SIMimg CLIP-Dimg VLM-Eval↑ Human-Eval-FL↑ Human-Eval-ID↑
Step1X-Edit 0.310 0.025 0.943 0.258 59.34 26.60 48.86
HiDream-E1-FULL 0.304 0.027 0.682 0.287 41.18 33.00 16.50
Imagen-3-capability 0.293 0.025 0.846 0.264 53.94 61.34 41.38
Gemini-2.0-flash-image 0.305 0.031 0.862 0.297 72.27 61.04 63.09
SeedEdit 1.6 0.311 0.029 0.827 0.325 75.00 61.34 83.60
GPT-4o-image 0.317 0.015 0.832 0.337 88.14 89.36 61.09
BAGEL 0.300 0.031 0.860 0.301 75.55 - -
Flux-Kontext-pro 0.312 0.024 0.864 0.334 75.66 - -
Flux-Kontext-max 0.307 0.032 0.871 0.373 80.18 - -
ByteMorpher (Ours) 0.301 0.048 0.847 0.463 84.08 61.13 74.73
GT 0.317 0.075 0.890 1.000 87.11 - -

Quantitative results for Camera Zoom. Best results are in bold, second best are underlined.

Benchmarking Industrial Models: Camera Move

Method CLIP-SIMtxt CLIP-Dtxt CLIP-SIMimg CLIP-Dimg VLM-Eval↑ Human-Eval-FL↑ Human-Eval-ID↑
Step1X-Edit 0.315 0.008 0.946 0.208 57.96 33.50 63.39
HiDream-E1-FULL 0.309 0.029 0.712 0.252 32.76 16.50 18.22
Imagen-3-capability 0.282 0.010 0.813 0.238 47.22 17.38 26.51
Gemini-2.0-flash-image 0.317 0.020 0.892 0.311 77.96 56.60 75.76
SeedEdit 1.6 0.314 0.015 0.866 0.253 78.59 58.30 87.78
GPT-4o-image 0.321 0.011 0.865 0.285 84.57 76.74 59.14
BAGEL 0.306 0.026 0.883 0.290 76.08 - -
Flux-Kontext-pro 0.312 0.016 0.891 0.286 79.14 - -
Flux-Kontext-max 0.315 0.019 0.896 0.325 85.97 - -
ByteMorpher (Ours) 0.319 0.041 0.894 0.426 84.18 67.60 58.25
GT 0.320 0.039 0.915 1.000 86.37 - -

Quantitative results for Camera Move. Best results are in bold, second best are underlined.

Benchmarking Industrial Models: Object Motion

Method CLIP-SIMtxt CLIP-Dtxt CLIP-SIMimg CLIP-Dimg VLM-Eval↑ Human-Eval-FL↑ Human-Eval-ID↑
Step1X-Edit 0.323 0.019 0.923 0.260 72.78 72.16 59.39
HiDream-E1-FULL 0.312 0.028 0.700 0.259 35.00 44.34 49.75
Imagen-3-capability 0.324 0.027 0.870 0.261 57.06 62.56 77.84
Gemini-2.0-flash-image 0.333 0.040 0.892 0.341 79.08 74.77 86.62
SeedEdit 1.6 0.332 0.025 0.874 0.323 80.21 66.50 79.12
GPT-4o-image 0.339 0.029 0.861 0.354 90.60 75.19 49.91
BAGEL 0.324 0.036 0.920 0.326 74.07 - -
Flux-Kontext-pro 0.321 0.018 0.893 0.314 78.41 - -
Flux-Kontext-max 0.325 0.025 0.888 0.353 80.42 - -
ByteMorpher (Ours) 0.332 0.044 0.896 0.472 89.07 62.16 58.25
GT 0.335 0.056 0.919 1.000 89.53 - -

Quantitative results for Object Motion. Best results are in bold, second best are underlined.

Benchmarking Industrial Models: Human Motion

Method CLIP-SIMtxt CLIP-Dtxt CLIP-SIMimg CLIP-Dimg VLM-Eval↑ Human-Eval-FL↑ Human-Eval-ID↑
Step1X-Edit 0.315 0.017 0.931 0.212 65.39 44.50 78.80
HiDream-E1-FULL 0.301 0.017 0.676 0.215 33.21 12.51 38.66
Imagen-3-capability 0.295 0.017 0.840 0.233 55.70 33.34 61.17
Gemini-2.0-flash-image 0.314 0.017 0.893 0.282 78.72 51.84 63.34
SeedEdit 1.6 0.324 0.024 0.878 0.274 80.62 56.23 72.12
GPT-4o-image 0.316 0.021 0.850 0.330 87.93 87.56 57.84
BAGEL 0.312 0.021 0.929 0.242 74.36 - -
Flux-Kontext-pro 0.314 0.017 0.918 0.283 79.15 - -
Flux-Kontext-max 0.316 0.016 0.908 0.307 80.78 - -
ByteMorpher (Ours) 0.316 0.022 0.899 0.440 85.66 68.38 75.00
GT 0.321 0.031 0.922 1.000 86.10 - -

Quantitative results for Human Motion. Best results are in bold, second best are underlined.

Benchmarking Industrial Models: Interaction

Method CLIP-SIMtxt CLIP-Dtxt CLIP-SIMimg CLIP-Dimg VLM-Eval↑ Human-Eval-FL↑ Human-Eval-ID↑
Step1X-Edit 0.312 0.020 0.937 0.245 65.99 36.09 64.56
HiDream-E1-FULL 0.307 0.019 0.679 0.251 35.73 10.60 38.66
Imagen-3-capability 0.307 0.023 0.863 0.254 54.78 47.16 61.59
Gemini-2.0-flash-image 0.316 0.027 0.889 0.327 76.86 60.70 77.94
SeedEdit 1.6 0.326 0.032 0.878 0.316 78.27 49.78 80.10
GPT-4o-image 0.318 0.031 0.851 0.351 88.65 81.17 73.72
BAGEL 0.312 0.037 0.913 0.301 73.16 - -
Flux-Kontext-pro 0.313 0.028 0.898 0.318 78.58 - -
Flux-Kontext-max 0.320 0.032 0.894 0.335 80.12 - -
ByteMorpher (Ours) 0.320 0.045 0.884 0.483 86.61 69.15 64.73
GT 0.324 0.046 0.905 1.000 88.84 - -

Quantitative results for Interaction. Best results are in bold, second best are underlined.

Benchmarking Industrial Models: Qualitative Comparison

Qualitative comparison of industrial instruction-guided image editing models on the ByteMorph-Bench benchmark. Our method achieves superior performance across various non-rigid motion scenarios.

Ablation Study

We fine-tune OminiControl and InstructMove on our training set. Both models exhibit notable gains across key metrics after fine-tuning. The following qualitative results demonstrate that InstructMove trained on our dataset achieve substantially better instruction-following ability, particularly for non-rigid motion edits.

Ablation Study

Bibtex

Comming soon...
      

License

Our dataset ByteMorph-6M and evaluation benchmark ByteMorph-Bench are released under CC0-1.0 Creative Commons Zero v1.0 Universal License. The baseline model ByteMorpher, including code and weights, is released under FLUX.1-dev Non-Commercial License.

Disclaimer

Your access to and use of this dataset are at your own risk. We do not guarantee the accuracy of this dataset. The dataset is provided “as is” and we make no warranty or representation to you with respect to it and we expressly disclaim, and hereby expressly waive, all warranties, express, implied, statutory or otherwise. This includes, without limitation, warranties of quality, performance, merchantability or fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. In no event will we be liable to you on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this public license or use of the licensed material.The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.