Video inpainting, which aims to fill missing regions with visually coherent content, has emerged as a crucial technique for creative applications such as editing. While existing approaches achieve either visual consistency or text-guided generation, they often struggle to balance between coherence and creative diversity. In this work, we introduce VideoRepainter, a two-stage framework that first allows users to inpaint a keyframe using established image-level techniques, and then propagates the corresponding change to other frames. Our approach can leverage state-of-the-art image diffusion models for keyframe manipulation, thereby easing the burden of the video-inpainting process. To this end, we integrate an image-to-video model with a symmetric condition mechanism to address ambiguity caused by direct mask downsampling. We further explore efficient strategies for mask synthesis and parameter tuning to reduce costs in data processing and model training. Evaluations demonstrate our method achieves superior results in both visual fidelity and content diversity compared to existing approaches, providing a practical solution for creative video manipulation.
We start with the observation that video inpainting can be formulated as a conditional generation task, where the model synthesizes content based on partially observed pixels. This formulation closely aligns with image-to-video (I2V) generation, where synthesis is conditioned on the initial frame. Given this structural similarity and shared underlying requirements, we opt to leverage pretrained I2V priors for video inpainting through efficient model repurposing, substantially reducing computational and data requirements compared to training from scratch.
To address the mask ambiguity caused by the resolution downsampling in Latent Diffusion Model (LDM), we propose symmetric mask encoding. Our approach encodes two variants of the masked image through the VAE, one with black-filled mask regions and another with white-filled regions, and concatenates both encoded latents as UNet conditions. This enables precise pixel classification: identical values across variants indicate unmasked regions, while differences denote masked areas. Moreover, this design maintains compatibility with the VAE latent space and thus facilitates model repurposing.