We propose Long Context Tuning (LCT) for scene-level video generation to bridge the gap between current single-shot generation capabilities and real-world narrative video productions such as movies. In this framework, a scene comprises a series of single-shot videos capturing coherent events that unfold over time with semantic and temporal consistency.
We address this challenge with a data-driven approach by directly learning scene-level consistency from data while minimizing architectural inductive bias that might compromise scalability. The core innovation of our method is expanding the scope of MMDiT's context window from a single shot to encompass the entire scene. This modification is compatible with MMDiT-based single-shot video diffusion models and introduces no additional parameters, functioning as a subsequent training stage after single-shot video generation pre-training.
We employ an interleaved 3D Rotary Position Embedding (RoPE) to encode scene-level video order, along with an asynchronized diffusion timestep strategy to unify visual conditioning and diffusion samples. We further explore a context-causal architecture that enables efficient auto-regressive shot generation with KV-cache. Our experiments are conducted with an internal video model at a 3B parameter scale. For more technical details, please refer to our Research Paper.
Notably, beyond its superior scene-level video generation capabilities, Long Context Tuning enables several emerging model abilities, including single/multi-shot extension and compositional generation.
Our model enables interactive multi-shot development, facilitating an iterative production workflow. Directors can progressively shape content shot-by-shot based on previously generated footage, eliminating the need for comprehensive upfront prompting and allowing creative decision-making with immediate visual feedback.
The diagram above illustrates how we can extend a SoRA single-shot video by interactively providing text prompts. The complete video is presented below:
Our model also enables interactive shot extension without cuts. This approach can extend a single shot to minute-long durations by auto-regressively generating 10-second segments while maintaining visual consistency throughout.
The diagram above demonstrates how we can extend the narrative from an initial SoRA video by interactively providing text prompts. The complete extended video is presented below:
Remarkably, despite no explicit training for this capability, our model enables compositional generation by accepting separate identity and environment images to synthesize coherent videos that integrate these distinct elements. This ability emerges naturally from the model's learned scene-level visual relationships in the training corpus, where scenes typically contain establishing environmental shots, character close-ups, and integrated shots depicting character-environment interactions.
Based on the above generation results, we can provide the model with another identity image along with the previous content to continue the generation chain. The diagram below illustrates this process.
Below are the additional identity image and the resulting video. Note the exceptional scene consistency demonstrated throughout, particularly how the arrangement of paintings on the wall matches exactly with the initial environment image.
Our asynchronous training timestep strategy unifies conditional inputs and diffusion samples, enabling the use of arbitrary images or videos as additional visual conditions, as demonstrated in the previous sections.
In the example below, we illustrate a practical workflow enabled by our model: first designing the character, costume, and environment using external tools (such as FLUX), then integrating them into a video to achieve precise control over character appearance and scene composition.
Our bidirectional model accepts visual conditions in arbitrary order and location, supporting "scene interpolation" applications. As shown below, given the first and last shots, we can generate intermediate scenes with semantic coherence.
Using our model, we generate two possible interpolation results shown below. By providing specific text prompts, we can guide the interpolated narrative while maintaining visual and semantic coherence. In the first example, Dr. Shirley is listening to music. In the second, Tony Lip is making a phone call.
Finally, we demonstrate the versatility of our model. It excels not only in generating human-centric content (Example 1) but also in producing nature documentaries (Example 2), showcasing its broad applicability across diverse visual domains.