
hold ball

kick ball

throw ball

hold ball

kick ball
TL;DR: We enables zero-shot human-object interaction edit
Orange indicates target interaction.
dribble ball
hold ball
kick ball
throw ball
hold ball
kick ball
hold skateboard
sit on skateboard
jump skateboard
ride skateboard
jump skateboard
ride skateboard
jump skateboard
ride skateboard
Orange indicates target interaction.
Source Image
walk dog
Ours
wash dog
NTI
InstructPix2Pix
SVDiff
InfEdit
CDS
TurboEdit
Break-A-Scene
Source Image
walk dog
Ours
hug dog
NTI
InstructPix2Pix
SVDiff
InfEdit
CDS
TurboEdit
Break-A-Scene
Source Image
blow cake
Ours
hold cake
NTI
InstructPix2Pix
SVDiff
InfEdit
CDS
TurboEdit
Break-A-Scene
Source Image
blow cake
Ours
eat cake
NTI
InstructPix2Pix
SVDiff
InfEdit
CDS
TurboEdit
Break-A-Scene
Source Image
hold skateboard
Ours
jump skateboard
NTI
InstructPix2Pix
SVDiff
InfEdit
CDS
TurboEdit
Break-A-Scene
Source Image
hold skateboard
Ours
ride skateboard
NTI
InstructPix2Pix
SVDiff
InfEdit
CDS
TurboEdit
Break-A-Scene
This paper presents InteractEdit, a novel framework for zero-shot Human-Object Interaction (HOI) editing, addressing the challenging task of transforming an existing interaction in an image into a new, desired interaction while preserving the identities of the subject and object. Unlike simpler image editing scenarios such as attribute manipulation, object replacement or style transfer, HOI editing involves complex spatial, contextual, and relational dependencies inherent in humans-objects interactions. Existing methods often overfit to the source image structure, limiting their ability to adapt to the substantial structural modifications demanded by new interactions. To address this, InteractEdit decomposes each scene into subject, object, and background components, then employs Low-Rank Adaptation (LoRA) and selective fine-tuning to preserve pretrained interaction priors while learning the visual identity of the source image. This regularization strategy effectively balances interaction edits with identity consistency. We further introduce IEBench, the most comprehensive benchmark for HOI editing, which evaluates both interaction editing and identity preservation. Our extensive experiments show that InteractEdit significantly outperforms existing methods, establishing a strong baseline for future HOI editing research and unlocking new possibilities for creative and practical applications.
HOI components are disassembled into subject, object, and background clues during inversion. LoRA regularization enables non-rigid edits by capturing essential attributes while ignoring fine-grained structural details. Selective fine-tuning preserves interaction priors while adapting to the source image's identity. Editing reassembles these components with the target interaction, using trained LoRA weights to guide the diffusion model.
If you use our work in your research, please cite:
@misc{hoe2025interactedit,
title={InteractEdit: Zero-Shot Editing of Human-Object Interactions in Images},
author={Jiun Tian Hoe and Weipeng Hu and Wei Zhou and Chao Xie and Ziwei Wang and Chee Seng Chan and Xudong Jiang and Yap-Peng Tan},
year={2025},
eprint={2503.09130},
archivePrefix={arXiv},
primaryClass={cs.GR},
url={https://arxiv.org/abs/2503.09130},
}