Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion

A Think-then-Place paradigm that leverages MLLM chain-of-thought reasoning to orchestrate physically plausible and visually natural video object insertion.

1HKUST2Tencent Video3Peking University

† Corresponding author

Abstract

Modern video editing techniques have achieved high visual fidelity when inserting video objects. However, they focus on optimizing visual fidelity rather than physical causality, leading to edits that are physically inconsistent with their environment. In this work, we present Place-it-R1, an end-to-end framework for video object insertion that unlocks the physical-aware reasoning potential of Multimodal Large Language Models (MLLMs). Our framework leverages the Chain-of-Thought (CoT) reasoning of MLLMs to orchestrate video diffusion, following a Think-then-Place paradigm. To bridge cognitive reasoning and generative execution, we introduce three key innovations: First, MLLM performs physical scene understanding and interaction reasoning, generating environment-aware chain-of-thought tokens and inferring valid insertion regions to explicitly guide the diffusion toward physically plausible insertion. Then, we introduce MLLM-guided Spatial Direct Preference Optimization (DPO), where diffusion outputs are fed back to the MLLM for scoring, enabling visual naturalness. During inference, the MLLM iteratively triggers refinement cycles and elicits adaptive adjustments from the diffusion model, forming a closed-loop that progressively enhances editing quality. Extensive experiments demonstrate Place-it-R1 achieves physically-coherent video object insertion compared with state-of-the-art solutions and commercial models.

Motivation

Place-it-R1 Motivation
Figure 1. Place-it-R1 can handle environment-aware video object insertion with automatic spatial planning. Top: Hierarchical reasoning (Analysis → Revision → Planning) enables physically plausible insertion — e.g., inferring that a ceramic mug cannot float on water and generating a support structure. Bottom: Automatic trajectory generation predicts realistic physics trajectories including drops, rebounds, spins, and rolls.

Method

🧠→🤖

Brain-to-Hand Command

MLLM conducts hierarchical reasoning (Analysis → Revision → Planning) and generates environment-aware CoT tokens with automatic insertion trajectories to guide the diffusion model.

🤖→🧠

Hand-to-Brain Feedback

Diffusion outputs are scored by MLLM to construct DPO preference pairs. Spatial DPO applies fine-grained optimization within insertion regions for visual naturalness.

🔄

Brain-Hand Co-refinement

During inference, MLLM iteratively evaluates generation quality and triggers refinement cycles, forming a closed-loop that progressively enhances editing quality within 2–3 iterations.

Place-it-R1 Pipeline
Figure 2. Overall pipeline of Place-it-R1. Stage 1 (Brain-to-Hand Command): MLLM performs hierarchical reasoning and automatic trajectory generation, then guides the diffusion model through semantic and spatial conditioning pathways via a connector module. Stage 2 (Hand-to-Brain Feedback): MLLM-guided physical preference dataset construction combined with Spatial DPO post-training (Ltotal = λglobal·LglobalDPO + λlocal·LlocalDPO). Stage 3 (Brain-Hand Co-refinement): MLLM-guided iterative refinement cycles during inference.

Demos

Results

Table 1. Quantitative comparisons across three benchmarks
PC: Physical Commonsense, PR: Physical Rule, PP: Physical Plausibility. UNIC benchmark includes many virtual animated characters as objects, thus precluding the use of PR.
Benchmark Method Identity Video Quality Physics
CLIP-I ↑DINO-I ↑Smooth. ↑Aesth. ↑PC ↑PR ↑PP ↑
UNIC UNIC0.59800.24500.96100.56274.20/5.33
Kling (commercial)0.62030.25090.95400.56414.41/5.93
PIKA (commercial)0.68620.37520.99440.61514.34/6.11
Lucy-edit pro (commercial)0.60210.26290.98650.56934.28/5.79
Place-it-R1 (standard)0.60430.28970.99280.56844.53/6.21
Place-it-R1 (flexible)0.60400.28950.99190.57874.60/6.63
FlexInsert AnyV2V + Anydoor0.78530.38050.98530.48333.870.663.38
VACE + Traj (w/o CoT)0.72850.25410.99130.49204.030.675.21
Place-it-R1 (standard)0.79410.49170.99180.52944.130.787.28
Place-it-R1 (flexible)0.79380.49250.99060.53054.170.867.93
HumanSync VACE0.75530.42100.99080.49524.120.916.21
Place-it-R1 (standard)0.76310.44970.99290.52834.330.926.58
Place-it-R1 (flexible)0.76320.45000.99260.52954.370.926.93

User Study on FlexInsert

Three-way preference selection by 10 independent annotators.

Phys. Plausible
Place-it-R1 56.1%
AnyV2V 20.5%
VACE 23.4%
Visual Quality
Place-it-R1 39.0%
AnyV2V 26.6%
VACE 34.4%

1v1: Place-it-R1 vs VACE

Phys. Plausible
Ours 52.4%
Tie 35.1%
VACE 12.5%
Visual Quality
Ours 55.2%
Tie 37.5%
VACE 7.3%

1v1: Place-it-R1 vs AnyV2V

Phys. Plausible
Ours 42.5%
Tie 30.4%
AnyV2V 27.1%
Visual Quality
Ours 49.0%
Tie 21.9%
AnyV2V 29.2%

Citation

@inproceedings{gu2026placeitr1,
    title={Place-it-R1: Unlocking Environment-aware Reasoning
           Potential of MLLM for Video Object Insertion},
    author={Gu, Bohai and Wu, Taiyi and Du, Dazhao and Yang, Shuai
            and Zhao, Xiaotong and Zhao, Alan and Guo, Song},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer
               Vision and Pattern Recognition (CVPR)},
    year={2026}
}