Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion

A Think-then-Place paradigm that leverages MLLM chain-of-thought reasoning to orchestrate physically plausible and visually natural video object insertion.

Bohai Gu^1,2 Taiyi Wu² Dazhao Du¹ Jian Liu¹ Shuai Yang³ Xiaotong Zhao² Alan Zhao² Song Guo^1†

¹HKUST²Tencent Video³Peking University

† Corresponding author

Paper Code arXiv

Abstract

Modern video editing techniques have achieved high visual fidelity when inserting video objects. However, they focus on optimizing visual fidelity rather than physical causality, leading to edits that are physically inconsistent with their environment. In this work, we present Place-it-R1, an end-to-end framework for video object insertion that unlocks the physical-aware reasoning potential of Multimodal Large Language Models (MLLMs). Our framework leverages the Chain-of-Thought (CoT) reasoning of MLLMs to orchestrate video diffusion, following a Think-then-Place paradigm. To bridge cognitive reasoning and generative execution, we introduce three key innovations: First, MLLM performs physical scene understanding and interaction reasoning, generating environment-aware chain-of-thought tokens and inferring valid insertion regions to explicitly guide the diffusion toward physically plausible insertion. Then, we introduce MLLM-guided Spatial Direct Preference Optimization (DPO), where diffusion outputs are fed back to the MLLM for scoring, enabling visual naturalness. During inference, the MLLM iteratively triggers refinement cycles and elicits adaptive adjustments from the diffusion model, forming a closed-loop that progressively enhances editing quality. Extensive experiments demonstrate Place-it-R1 achieves physically-coherent video object insertion compared with state-of-the-art solutions and commercial models.

Motivation

Figure 1. Place-it-R1 can handle environment-aware video object insertion with automatic spatial planning. Top: Hierarchical reasoning (Analysis → Revision → Planning) enables physically plausible insertion — e.g., inferring that a ceramic mug cannot float on water and generating a support structure. Bottom: Automatic trajectory generation predicts realistic physics trajectories including drops, rebounds, spins, and rolls.

Method

🧠→🤖

Brain-to-Hand Command

MLLM conducts hierarchical reasoning (Analysis → Revision → Planning) and generates environment-aware CoT tokens with automatic insertion trajectories to guide the diffusion model.

🤖→🧠

Hand-to-Brain Feedback

Diffusion outputs are scored by MLLM to construct DPO preference pairs. Spatial DPO applies fine-grained optimization within insertion regions for visual naturalness.

🔄

Brain-Hand Co-refinement

During inference, MLLM iteratively evaluates generation quality and triggers refinement cycles, forming a closed-loop that progressively enhances editing quality within 2–3 iterations.

Figure 2. Overall pipeline of Place-it-R1. Stage 1 (Brain-to-Hand Command): MLLM performs hierarchical reasoning and automatic trajectory generation, then guides the diffusion model through semantic and spatial conditioning pathways via a connector module. Stage 2 (Hand-to-Brain Feedback): MLLM-guided physical preference dataset construction combined with Spatial DPO post-training (L_total = λ_global·L^global_DPO + λ_local·L^local_DPO). Stage 3 (Brain-Hand Co-refinement): MLLM-guided iterative refinement cycles during inference.

Results

Table 1. Quantitative comparisons across three benchmarks

PC: Physical Commonsense, PR: Physical Rule, PP: Physical Plausibility. UNIC benchmark includes many virtual animated characters as objects, thus precluding the use of PR.

Benchmark	Method	Identity		Video Quality		Physics
Benchmark	Method	CLIP-I ↑	DINO-I ↑	Smooth. ↑	Aesth. ↑	PC ↑	PR ↑	PP ↑
UNIC	UNIC	0.5980	0.2450	0.9610	0.5627	4.20	/	5.33
	Kling (commercial)	0.6203	0.2509	0.9540	0.5641	4.41	/	5.93
	PIKA (commercial)	0.6862	0.3752	0.9944	0.6151	4.34	/	6.11
	Lucy-edit pro (commercial)	0.6021	0.2629	0.9865	0.5693	4.28	/	5.79
	Place-it-R1 (standard)	0.6043	0.2897	0.9928	0.5684	4.53	/	6.21
	Place-it-R1 (flexible)	0.6040	0.2895	0.9919	0.5787	4.60	/	6.63

FlexInsert	AnyV2V + Anydoor	0.7853	0.3805	0.9853	0.4833	3.87	0.66	3.38
	VACE + Traj (w/o CoT)	0.7285	0.2541	0.9913	0.4920	4.03	0.67	5.21
	Place-it-R1 (standard)	0.7941	0.4917	0.9918	0.5294	4.13	0.78	7.28
	Place-it-R1 (flexible)	0.7938	0.4925	0.9906	0.5305	4.17	0.86	7.93

HumanSync	VACE	0.7553	0.4210	0.9908	0.4952	4.12	0.91	6.21
	Place-it-R1 (standard)	0.7631	0.4497	0.9929	0.5283	4.33	0.92	6.58
	Place-it-R1 (flexible)	0.7632	0.4500	0.9926	0.5295	4.37	0.92	6.93

User Study on FlexInsert

Three-way preference selection by 10 independent annotators.

Phys. Plausible

Place-it-R1 56.1%

AnyV2V 20.5%

VACE 23.4%

Visual Quality

Place-it-R1 39.0%

AnyV2V 26.6%

VACE 34.4%

1v1: Place-it-R1 vs VACE

Phys. Plausible

Ours 52.4%

Tie 35.1%

VACE 12.5%

Visual Quality

Ours 55.2%

Tie 37.5%

VACE 7.3%

1v1: Place-it-R1 vs AnyV2V

Phys. Plausible

Ours 42.5%

Tie 30.4%

AnyV2V 27.1%

Visual Quality

Ours 49.0%

Tie 21.9%

AnyV2V 29.2%

Citation

@inproceedings{gu2026placeitr1,
    title={Place-it-R1: Unlocking Environment-aware Reasoning
           Potential of MLLM for Video Object Insertion},
    author={Gu, Bohai and Wu, Taiyi and Du, Dazhao and Yang, Shuai
            and Zhao, Xiaotong and Zhao, Alan and Guo, Song},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer
               Vision and Pattern Recognition (CVPR)},
    year={2026}
}

Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion

Abstract

Motivation

Method

Brain-to-Hand Command

Hand-to-Brain Feedback

Brain-Hand Co-refinement

Demos

Results

User Study on FlexInsert

1v1: Place-it-R1 vs VACE

1v1: Place-it-R1 vs AnyV2V

Citation