A Think-then-Place paradigm that leverages MLLM chain-of-thought reasoning to orchestrate physically plausible and visually natural video object insertion.
Modern video editing techniques have achieved high visual fidelity when inserting video objects. However, they focus on optimizing visual fidelity rather than physical causality, leading to edits that are physically inconsistent with their environment. In this work, we present Place-it-R1, an end-to-end framework for video object insertion that unlocks the physical-aware reasoning potential of Multimodal Large Language Models (MLLMs). Our framework leverages the Chain-of-Thought (CoT) reasoning of MLLMs to orchestrate video diffusion, following a Think-then-Place paradigm. To bridge cognitive reasoning and generative execution, we introduce three key innovations: First, MLLM performs physical scene understanding and interaction reasoning, generating environment-aware chain-of-thought tokens and inferring valid insertion regions to explicitly guide the diffusion toward physically plausible insertion. Then, we introduce MLLM-guided Spatial Direct Preference Optimization (DPO), where diffusion outputs are fed back to the MLLM for scoring, enabling visual naturalness. During inference, the MLLM iteratively triggers refinement cycles and elicits adaptive adjustments from the diffusion model, forming a closed-loop that progressively enhances editing quality. Extensive experiments demonstrate Place-it-R1 achieves physically-coherent video object insertion compared with state-of-the-art solutions and commercial models.
MLLM conducts hierarchical reasoning (Analysis → Revision → Planning) and generates environment-aware CoT tokens with automatic insertion trajectories to guide the diffusion model.
Diffusion outputs are scored by MLLM to construct DPO preference pairs. Spatial DPO applies fine-grained optimization within insertion regions for visual naturalness.
During inference, MLLM iteratively evaluates generation quality and triggers refinement cycles, forming a closed-loop that progressively enhances editing quality within 2–3 iterations.
| Benchmark | Method | Identity | Video Quality | Physics | ||||
|---|---|---|---|---|---|---|---|---|
| CLIP-I ↑ | DINO-I ↑ | Smooth. ↑ | Aesth. ↑ | PC ↑ | PR ↑ | PP ↑ | ||
| UNIC | UNIC | 0.5980 | 0.2450 | 0.9610 | 0.5627 | 4.20 | / | 5.33 |
| Kling (commercial) | 0.6203 | 0.2509 | 0.9540 | 0.5641 | 4.41 | / | 5.93 | |
| PIKA (commercial) | 0.6862 | 0.3752 | 0.9944 | 0.6151 | 4.34 | / | 6.11 | |
| Lucy-edit pro (commercial) | 0.6021 | 0.2629 | 0.9865 | 0.5693 | 4.28 | / | 5.79 | |
| Place-it-R1 (standard) | 0.6043 | 0.2897 | 0.9928 | 0.5684 | 4.53 | / | 6.21 | |
| Place-it-R1 (flexible) | 0.6040 | 0.2895 | 0.9919 | 0.5787 | 4.60 | / | 6.63 | |
| FlexInsert | AnyV2V + Anydoor | 0.7853 | 0.3805 | 0.9853 | 0.4833 | 3.87 | 0.66 | 3.38 |
| VACE + Traj (w/o CoT) | 0.7285 | 0.2541 | 0.9913 | 0.4920 | 4.03 | 0.67 | 5.21 | |
| Place-it-R1 (standard) | 0.7941 | 0.4917 | 0.9918 | 0.5294 | 4.13 | 0.78 | 7.28 | |
| Place-it-R1 (flexible) | 0.7938 | 0.4925 | 0.9906 | 0.5305 | 4.17 | 0.86 | 7.93 | |
| HumanSync | VACE | 0.7553 | 0.4210 | 0.9908 | 0.4952 | 4.12 | 0.91 | 6.21 |
| Place-it-R1 (standard) | 0.7631 | 0.4497 | 0.9929 | 0.5283 | 4.33 | 0.92 | 6.58 | |
| Place-it-R1 (flexible) | 0.7632 | 0.4500 | 0.9926 | 0.5295 | 4.37 | 0.92 | 6.93 | |
Three-way preference selection by 10 independent annotators.
@inproceedings{gu2026placeitr1,
title={Place-it-R1: Unlocking Environment-aware Reasoning
Potential of MLLM for Video Object Insertion},
author={Gu, Bohai and Wu, Taiyi and Du, Dazhao and Yang, Shuai
and Zhao, Xiaotong and Zhao, Alan and Guo, Song},
booktitle={Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR)},
year={2026}
}