NeurIPS 2026 Under Review

WorldCraft: From Camera Navigation to Object Manipulation in Interactive Video World Models

A trajectory-centric framework that equips interactive video world models with object-level actions, enabling users to manipulate individual objects along sketched paths while simultaneously navigating the camera.

1HKUST 2Tencent Video 3Wuhan University 4Peking University

† Corresponding author

Abstract

Recent video-based world models have made pixel-space environments interactive at the camera level: users can navigate viewpoints while the model generates coherent visual continuations. Yet their action spaces remain incomplete—users can move the camera, but cannot act on individual objects. We present WorldCraft, a framework that expands interactive video world models from camera navigation to object-level trajectory actions. Given a user click and a sketched path, WorldCraft generates future frames in which the selected object follows the prescribed trajectory while the camera continues to navigate the scene. WorldCraft achieves this through a trajectory-centric control pipeline: Normalized World Trajectory (NWT) lifts the user-drawn path into a camera-invariant world coordinate system; Spatial-Pathway LoRA (SP-LoRA) injects this signal through the model's spatial-control pathway while preserving the pretrained camera controller; and Trajectory-Anchored State Persistence (TASP) uses the world trajectory as a persistent state anchor across autoregressive chunks, maintaining object state even when the camera looks away.

Method

01
NWT

Normalized World Trajectory

Lifts user-drawn 2D paths into a camera-invariant world-space coordinate system and dynamically re-projects them under the current camera pose, decoupling object motion from ego-motion.

02
SP-LoRA

Spatial-Pathway LoRA

Adapts only the spatial-control pathway (ProPE + action encoder) with lightweight LoRA, adding object manipulation without overwriting the pretrained camera controller.

03
TASP

Trajectory-Anchored State Persistence

Uses the world-space trajectory as a persistent spatial state signal and refreshes stale memories, so moved objects reappear at correct positions when the camera returns.

WorldCraft Pipeline
Figure 1. WorldCraft overview. (Top-left) WorldCraft lifts a user-specified 2D trajectory into a camera-decoupled normalized world space and re-projects it into per-frame trajectory conditions under the given camera actions. (Top-right) The trajectory and camera controls are injected through a lightweight pathway-selective LoRA on the spatial-control pathway, while the backbone attention and MLP layers remain frozen. (Bottom) During autoregressive generation, WorldCraft updates the anchor frame and memory bank across chunks, and refreshes outdated memories to support long-horizon out-of-camera object reasoning.
Capability Comparison
WorldCraft uniquely supports composable camera-object control with autoregressive long-video generation.
Method Camera Object Traj. Composable Off-Cam State Autoregressive
DragAnything××××
Wan-Move××××
GameCraft×××
Genie 3×××
WorldPlay×××
WorldCraft (Ours)

Trajectory Control Comparison

Side-by-side comparison on static-camera trajectory control. All methods receive the same first frame and trajectory condition. WorldCraft achieves precise object-level control while maintaining temporal consistency.

Trajectory input
Input: First frame + user-drawn trajectory
Synced playback
0.0s / 2.5s
WorldCraft (Ours) 61f / 2.5s
WorldPlay 61f / 2.5s
DragAnything 14f / 0.6s
Wan-Move 81f / 3.8s

Long-Horizon & Off-Camera

The goose moves right while the camera pans left and then returns. WorldCraft maintains scene consistency and, via TASP, recovers the object at its correct off-camera-updated position. Baselines either lose scene coherence or cannot track object state.

Off-camera trajectory
Input: First frame + object trajectory + camera pan (left then return)
Synced playback
0.0s / 2.5s
WorldCraft (Ours)
WorldPlay
MatrixGame 2.0
GameCraft
Yume

Extended Capabilities

Part trajectory
Part-level control
The shield follows the trajectory while the body stays still.
Multi trajectory
Multi-object control
Three objects steered simultaneously along independent paths.
Long trajectory
253-frame rollout
~10.5s autoregressive generation with composable control.

Quantitative Results

Table 1. Trajectory control under static camera (61 frames, 50 clips)
All methods share the same first frame and trajectory condition. Best in bold.
Method Visual Quality VBench++ TE↓
PSNR↑SSIM↑LPIPS↓DINO↑ SubjC↑BgC↑Temp↑
DragAnything15.970.6000.4680.7770.8960.9130.93839.86
Wan-Move16.420.5920.3750.7820.9270.9430.98544.08
WorldCraft (Ours)17.230.6160.3630.8070.9420.9450.98938.90
Table 2. Camera fidelity on camera-only input
WorldCraft preserves camera accuracy at 61 frames and outperforms all methods at 253-frame extended horizon.
Method Short-term (61 frames) Long-term (253 frames)
RPErotRPEtransRPEcam PSNR↑SSIM↑LPIPS↓ RPErotRPEtransRPEcam
Yume0.2610.01430.016912.390.29310.57180.3740.02470.0285
Matrix-Game 2.00.3420.01370.019612.960.32350.53260.1620.02430.0261
GameCraft0.2520.01300.015712.420.28610.55290.1980.02430.0265
WorldPlay (base)0.1200.01550.016513.770.34340.47000.1300.02620.0276
WorldCraft (Ours)0.1310.01610.017013.950.34740.46210.1230.02250.0233
Table 3. Ablation: NWT representation & adaptation strategy
Configuration#ParamsTE↓RPErot
(a) NWT Representation
Pixel space (raw user traj.)35.82 / 40.69 / 45.28
World space + single-shot depth33.82 / 37.69 / 41.28
World space + iterative depth30.82 / 32.10 / 34.65
(b) Layer Selection (Static-BI → Dynamic-AR)
Spatial-pathway LoRA (ProPE + action)~50M38.900.131
Spatial pathway + V + MLP (blocks 28-42)~120M46.600.136
Q/K/V + MLP (conventional LoRA)~200M49.430.139
Full fine-tune8B37.200.237

Citation

@misc{gu2026worldcraftcameranavigationobject,
    title={WorldCraft: From Camera Navigation to Object Manipulation
           in Interactive Video World Models},
    author={Bohai Gu and Taiyi Wu and Yueyang Yuan and Jian Liu
            and Xiaocheng Lu and Dazhao Du and Jie Zhang and Jinxiang Lai
            and Shuai Yang and Xiaotong Zhao and Alan Zhao and Song Guo},
    year={2026},
    eprint={2605.25077},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2605.25077},
}