Long Context Tuning for Video Generation

Yuwei Guo1,2 Ceyuan Yang2,† Ziyan Yang2 Zhibei Ma2 Zhijie Lin2 Zhenheng Yang3 Dahua Lin1 Lu Jiang2
1The Chinese University of Hong Kong 2ByteDance Seed 3ByteDance
Corresponding Author
Research Paper
Scene 1
Shot 1
forest flying view
Scene 2
Shot 2
low-angle view
Scene 3
Shot 3
get lost in forest
Scene 4
Shot 4
girl walks out
Scene 5
Shot 5
walks towards camera
Scene 6
Shot 6
POV of forest
Scene 7
Shot 7
meet and discuss
Scene 8
Shot 8
close-up to serious face
Scene 9
Shot 9
worrying expression
Scene 10
Shot 10
continue exploration
Scene 11
Shot 11
high-angle view
Scene 12
Shot 12
wide shot
Scene 13
Shot 13
arrive at a clearing
Scene 14
Shot 14
find an abandoned house
Scene 15
Shot 15
focus on the house
Scene 16
Shot 16
cautiously discuss
Scene 17
Shot 17
close-up to the door
Scene 18
Shot 18
tries to open the door
Scene 19
Shot 19
interior of the house
Scene 20
Shot 20
enter and explore
Scene 21
Shot 21
continue exploration
Scene 22
Shot 22
into a dusty room
Scene 23
Shot 23
showing room's layout
Scene 24
Shot 24
close-up to the shelf
Scene 25
Shot 25
find something on desk
Scene 26
Shot 26
magic ball appears
Scene 27
Shot 27
their shocked faces
0:00 / 0:00

We propose Long Context Tuning (LCT) for scene-level video generation to bridge the gap between current single-shot generation capabilities and real-world narrative video productions such as movies. In this framework, a scene comprises a series of single-shot videos capturing coherent events that unfold over time with semantic and temporal consistency.

Architecture diagram
Architecture Designs for Long Context Tuning.

We address this challenge with a data-driven approach by directly learning scene-level consistency from data while minimizing architectural inductive bias that might compromise scalability. The core innovation of our method is expanding the scope of MMDiT's context window from a single shot to encompass the entire scene. This modification is compatible with MMDiT-based single-shot video diffusion models and introduces no additional parameters, functioning as a subsequent training stage after single-shot video generation pre-training.

We employ an interleaved 3D Rotary Position Embedding (RoPE) to encode scene-level video order, along with an asynchronized diffusion timestep strategy to unify visual conditioning and diffusion samples. We further explore a context-causal architecture that enables efficient auto-regressive shot generation with KV-cache. Our experiments are conducted with an internal video model at a 3B parameter scale. For more technical details, please refer to our Research Paper.

Notably, beyond its superior scene-level video generation capabilities, Long Context Tuning enables several emerging model abilities, including single/multi-shot extension and compositional generation.

Interactively Direct a Multi-shot Story

Our model enables interactive multi-shot development, facilitating an iterative production workflow. Directors can progressively shape content shot-by-shot based on previously generated footage, eliminating the need for comprehensive upfront prompting and allowing creative decision-making with immediate visual feedback.

Multi-shot development diagram
Interactive Multi-shot Development

The diagram above illustrates how we can extend a SoRA single-shot video by interactively providing text prompts. The complete video is presented below:

Interactively directing a multi-shot story
Interactive Single-shot Extension

Our model also enables interactive shot extension without cuts. This approach can extend a single shot to minute-long durations by auto-regressively generating 10-second segments while maintaining visual consistency throughout.

Single-shot extension diagram
Interactive Single-shot Extension

The diagram above demonstrates how we can extend the narrative from an initial SoRA video by interactively providing text prompts. The complete extended video is presented below:

Extending a single-shot video to minute-long duration
Compositional Generation

Remarkably, despite no explicit training for this capability, our model enables compositional generation by accepting separate identity and environment images to synthesize coherent videos that integrate these distinct elements. This ability emerges naturally from the model's learned scene-level visual relationships in the training corpus, where scenes typically contain establishing environmental shots, character close-ups, and integrated shots depicting character-environment interactions.

Identity image
Identity Image 1
Environment image
Environment Image
Composed Video

Based on the above generation results, we can provide the model with another identity image along with the previous content to continue the generation chain. The diagram below illustrates this process.

Compositional generation process
Compositional Generation Process

Below are the additional identity image and the resulting video. Note the exceptional scene consistency demonstrated throughout, particularly how the arrangement of paintings on the wall matches exactly with the initial environment image.

Second identity image
History + Identity Image 2
Composed Video
Visual Conditioning

Our asynchronous training timestep strategy unifies conditional inputs and diffusion samples, enabling the use of arbitrary images or videos as additional visual conditions, as demonstrated in the previous sections.

In the example below, we illustrate a practical workflow enabled by our model: first designing the character, costume, and environment using external tools (such as FLUX), then integrating them into a video to achieve precise control over character appearance and scene composition.

Character design
Character Design (source)
Environment design
Environment Design (source)
Composed Video
Controllable Scene Interpolation

Our bidirectional model accepts visual conditions in arbitrary order and location, supporting "scene interpolation" applications. As shown below, given the first and last shots, we can generate intermediate scenes with semantic coherence.

Shot 1 (source: Green Book)
Placeholder Image
Shot 2 (unknown)
Shot 3

Using our model, we generate two possible interpolation results shown below. By providing specific text prompts, we can guide the interpolated narrative while maintaining visual and semantic coherence. In the first example, Dr. Shirley is listening to music. In the second, Tony Lip is making a phone call.

Possible Interpolation 1
Possible Interpolation 2
Generalizability

Finally, we demonstrate the versatility of our model. It excels not only in generating human-centric content (Example 1) but also in producing nature documentaries (Example 2), showcasing its broad applicability across diverse visual domains.

Example 1: Two characters meet in a café and start a conversation
Example 2: An exploration through a beautiful coral reef