NeurIPS 2026 Under Review

WorldCraft: From Camera Navigation to Object Manipulation in Interactive Video World Models

A trajectory-centric framework that equips interactive video world models with object-level actions, enabling users to manipulate individual objects along sketched paths while simultaneously navigating the camera.

Bohai Gu^1,2 Taiyi Wu² Yueyang Yuan³ Jian Liu¹ Xiaocheng Lu¹ Dazhao Du¹ Jie Zhang¹ Jinxiang Lai¹ Shuai Yang⁴ Xiaotong Zhao² Alan Zhao² Song Guo^1†

¹HKUST ²Tencent Video ³Wuhan University ⁴Peking University

† Corresponding author

Paper Code arXiv Video

Abstract

Recent video-based world models have made pixel-space environments interactive at the camera level: users can navigate viewpoints while the model generates coherent visual continuations. Yet their action spaces remain incomplete—users can move the camera, but cannot act on individual objects. We present WorldCraft, a framework that expands interactive video world models from camera navigation to object-level trajectory actions. Given a user click and a sketched path, WorldCraft generates future frames in which the selected object follows the prescribed trajectory while the camera continues to navigate the scene. WorldCraft achieves this through a trajectory-centric control pipeline: Normalized World Trajectory (NWT) lifts the user-drawn path into a camera-invariant world coordinate system; Spatial-Pathway LoRA (SP-LoRA) injects this signal through the model's spatial-control pathway while preserving the pretrained camera controller; and Trajectory-Anchored State Persistence (TASP) uses the world trajectory as a persistent state anchor across autoregressive chunks, maintaining object state even when the camera looks away.

Method

NWT

Normalized World Trajectory

Lifts user-drawn 2D paths into a camera-invariant world-space coordinate system and dynamically re-projects them under the current camera pose, decoupling object motion from ego-motion.

SP-LoRA

Spatial-Pathway LoRA

Adapts only the spatial-control pathway (ProPE + action encoder) with lightweight LoRA, adding object manipulation without overwriting the pretrained camera controller.

TASP

Trajectory-Anchored State Persistence

Uses the world-space trajectory as a persistent spatial state signal and refreshes stale memories, so moved objects reappear at correct positions when the camera returns.

Figure 1. WorldCraft overview. (Top-left) WorldCraft lifts a user-specified 2D trajectory into a camera-decoupled normalized world space and re-projects it into per-frame trajectory conditions under the given camera actions. (Top-right) The trajectory and camera controls are injected through a lightweight pathway-selective LoRA on the spatial-control pathway, while the backbone attention and MLP layers remain frozen. (Bottom) During autoregressive generation, WorldCraft updates the anchor frame and memory bank across chunks, and refreshes outdated memories to support long-horizon out-of-camera object reasoning.

Capability Comparison

WorldCraft uniquely supports composable camera-object control with autoregressive long-video generation.

Method	Camera	Object Traj.	Composable	Off-Cam State	Autoregressive
DragAnything	×	✓	×	×	×
Wan-Move	×	✓	×	×	×
GameCraft	✓	×	×	×	✓
Genie 3	✓	×	×	×	✓
WorldPlay	✓	×	×	×	✓
WorldCraft (Ours)	✓	✓	✓	✓	✓

Trajectory Control Comparison

Side-by-side comparison on static-camera trajectory control. All methods receive the same first frame and trajectory condition. WorldCraft achieves precise object-level control while maintaining temporal consistency.

Camera: Forward ↑

Input: First frame + user-drawn trajectory

Synced playback

0.0s / 2.5s

WorldCraft (Ours)

61f / 2.5s

WorldPlay

61f / 2.5s

DragAnything

14f / 0.6s

Wan-Move

81f / 3.8s

Long-Horizon & Off-Camera

The goose moves right while the camera pans left and then returns. WorldCraft maintains scene consistency and, via TASP, recovers the object at its correct off-camera-updated position. Baselines either lose scene coherence or cannot track object state.

Camera: Pan ← then →

Input: First frame + object trajectory + camera pan (left then return)

Synced playback

0.0s / 2.5s

WorldCraft (Ours)

WorldPlay

MatrixGame 2.0

GameCraft

Yume

Quantitative Results

Table 1. Trajectory control under static camera (61 frames, 50 clips)

All methods share the same first frame and trajectory condition. Best in bold.

Method	Visual Quality				VBench++			TE↓
Method	PSNR↑	SSIM↑	LPIPS↓	DINO↑	SubjC↑	BgC↑	Temp↑	TE↓
DragAnything	15.97	0.600	0.468	0.777	0.896	0.913	0.938	39.86
Wan-Move	16.42	0.592	0.375	0.782	0.927	0.943	0.985	44.08
WorldCraft (Ours)	17.23	0.616	0.363	0.807	0.942	0.945	0.989	38.90

Table 2. Camera fidelity on camera-only input

WorldCraft preserves camera accuracy at 61 frames and outperforms all methods at 253-frame extended horizon.

Method	Short-term (61 frames)						Long-term (253 frames)
Method	RPE_rot↓	RPE_trans↓	RPE_cam↓	PSNR↑	SSIM↑	LPIPS↓	RPE_rot↓	RPE_trans↓	RPE_cam↓
Yume	0.261	0.0143	0.0169	12.39	0.2931	0.5718	0.374	0.0247	0.0285
Matrix-Game 2.0	0.342	0.0137	0.0196	12.96	0.3235	0.5326	0.162	0.0243	0.0261
GameCraft	0.252	0.0130	0.0157	12.42	0.2861	0.5529	0.198	0.0243	0.0265
WorldPlay (base)	0.120	0.0155	0.0165	13.77	0.3434	0.4700	0.130	0.0262	0.0276
WorldCraft (Ours)	0.131	0.0161	0.0170	13.95	0.3474	0.4621	0.123	0.0225	0.0233

Table 3. Ablation: NWT representation & adaptation strategy

Configuration	#Params	TE↓	RPE_rot↓
(a) NWT Representation
Pixel space (raw user traj.)	—	35.82 / 40.69 / 45.28	—
World space + single-shot depth	—	33.82 / 37.69 / 41.28	—
World space + iterative depth	—	30.82 / 32.10 / 34.65	—
(b) Layer Selection (Static-BI → Dynamic-AR)
Spatial-pathway LoRA (ProPE + action)	~50M	38.90	0.131
Spatial pathway + V + MLP (blocks 28-42)	~120M	46.60	0.136
Q/K/V + MLP (conventional LoRA)	~200M	49.43	0.139
Full fine-tune	8B	37.20	0.237

Citation

@misc{gu2026worldcraftcameranavigationobject,
    title={WorldCraft: From Camera Navigation to Object Manipulation
           in Interactive Video World Models},
    author={Bohai Gu and Taiyi Wu and Yueyang Yuan and Jian Liu
            and Xiaocheng Lu and Dazhao Du and Jie Zhang and Jinxiang Lai
            and Shuai Yang and Xiaotong Zhao and Alan Zhao and Song Guo},
    year={2026},
    eprint={2605.25077},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2605.25077},
}