Coherent Video Inpainting Using Optical Flow-Guided Efficient Diffusion

Abstract

The text-guided video inpainting technique has significantly improved the performance of content generation applications. A recent family for these improvements uses diffusion models, which have become essential for achieving high-quality video inpainting results, yet they still face performance bottlenecks in temporal consistency and computational efficiency. This motivates us to propose a new video inpainting framework using optical Flow-guided Efficient Diffusion (FloED) for higher video coherence. Specifically, FloED employs a dual-branch architecture, where the time-agnostic flow branch restores corrupted flow first, and the multi-scale flow adapters provide motion guidance to the main inpainting branch. Besides, a training-free latent interpolation method is proposed to accelerate the multi-step denoising process using flow warping. With the flow attention cache mechanism, FLoED efficiently reduces the computational cost of incorporating optical flow. Extensive experiments on background restoration and object removal tasks show that FloED outperforms state-of-the-art diffusion-based methods in both quality and efficiency. Our codes and models will be made publicly available.

Method Overview

Overview of FloED. FloED employs a dual-branch architecture implemented through a two-stage training approach. In the first training stage, we focus exclusively on the upper branch, optimizing the motion layer to adapt specifically to the video inpainting domain. Subsequently, we introduce a time-agnostic flow branch complemented by a multi-scale flow adapter, which provides flow guidance covering upblocks of primary UNet. During the inference phase, we enhance efficiency by integrating the flow attention cache (right part)

We introduce a training-free latent interpolation technique that leverages optical flow to speed up the multi-step denoising process. Complemented by a flow attention cache mechanism, FloED efficiently reduces the additional computational costs introduced by the flow.

Qualitative Results on Object Removal

"Forest with a stream running through it, surrounded by trees and plants."

"Fire burning in a fireplace, with a log burning on top of it."

"Dark, sleek stairs made of modern materials,set near a green plots."

"A fan of cash floating in front of a white bookshelf with plants and book in the background."

"The sinking sun spills molten gold across the lake's still surface.“

“A dusty desert with tire tracks etched into the sandy dunes.”

“A rugged outdoor training ground with sandy hills.”

Qualitative Results on Background Restoration

“The surface of water with gentle waves, reflecting warm golden hues.“

“The blue sky filled with rolling white clouds.“

“Bluish-green sea washes against the cliffs.“

“Icy shores, blanketed in frost and kissed by the relentless waves.“

“Fierce golden-orange flames burning violently on logs.“

"A breathtaking mountain landscape, with distant peaks shrouded in mist, creating a serene and mysterious atmosphere.“

Qualitative Comparison on Background Restoration

"Icy shores, blanketed in frost and kissed by the relentless waves."

Qualitative Comparison on Object Removal

"Dark, sleek stairs made of modern materials, set near a green plots."