Long Context Tuning for Video Generation

Yuwei Guo^1,2 Ceyuan Yang^2,† Ziyan Yang² Zhibei Ma² Zhijie Lin² Zhenheng Yang³ Dahua Lin¹ Lu Jiang²

¹The Chinese University of Hong Kong ²ByteDance Seed ³ByteDance

^†Corresponding Author

Research Paper

Shot 1

forest flying view

Shot 2

low-angle view

Shot 3

get lost in forest

Shot 4

girl walks out

Shot 5

walks towards camera

Shot 6

POV of forest

Shot 7

meet and discuss

Shot 8

close-up to serious face

Shot 9

worrying expression

Shot 10

continue exploration

Shot 11

high-angle view

Shot 12

wide shot

Shot 13

arrive at a clearing

Shot 14

find an abandoned house

Shot 15

focus on the house

Shot 16

cautiously discuss

Shot 17

close-up to the door

Shot 18

tries to open the door

Shot 19

interior of the house

Shot 20

enter and explore

Shot 21

continue exploration

Shot 22

into a dusty room

Shot 23

showing room's layout

Shot 24

close-up to the shelf

Shot 25

find something on desk

Shot 26

magic ball appears

Shot 27

their shocked faces

0:00 / 0:00

We propose Long Context Tuning (LCT) for scene-level video generation to bridge the gap between current single-shot generation capabilities and real-world narrative video productions such as movies. In this framework, a scene comprises a series of single-shot videos capturing coherent events that unfold over time with semantic and temporal consistency.

Architecture Designs for Long Context Tuning.

We address this challenge with a data-driven approach by directly learning scene-level consistency from data while minimizing architectural inductive bias that might compromise scalability. The core innovation of our method is expanding the scope of MMDiT's context window from a single shot to encompass the entire scene. This modification is compatible with MMDiT-based single-shot video diffusion models and introduces no additional parameters, functioning as a subsequent training stage after single-shot video generation pre-training.

We employ an interleaved 3D Rotary Position Embedding (RoPE) to encode scene-level video order, along with an asynchronized diffusion timestep strategy to unify visual conditioning and diffusion samples. We further explore a context-causal architecture that enables efficient auto-regressive shot generation with KV-cache. Our experiments are conducted with an internal video model at a 3B parameter scale. For more technical details, please refer to our Research Paper.

Notably, beyond its superior scene-level video generation capabilities, Long Context Tuning enables several emerging model abilities, including single/multi-shot extension and compositional generation.

Interactively Direct a Multi-shot Story

Our model enables interactive multi-shot development, facilitating an iterative production workflow. Directors can progressively shape content shot-by-shot based on previously generated footage, eliminating the need for comprehensive upfront prompting and allowing creative decision-making with immediate visual feedback.

Interactive Multi-shot Development

The diagram above illustrates how we can extend a SoRA single-shot video by interactively providing text prompts. The complete video is presented below:

Interactively directing a multi-shot story

Interactive Single-shot Extension

Our model also enables interactive shot extension without cuts. This approach can extend a single shot to minute-long durations by auto-regressively generating 10-second segments while maintaining visual consistency throughout.

Interactive Single-shot Extension

The diagram above demonstrates how we can extend the narrative from an initial SoRA video by interactively providing text prompts. The complete extended video is presented below:

Extending a single-shot video to minute-long duration

Compositional Generation

Remarkably, despite no explicit training for this capability, our model enables compositional generation by accepting separate identity and environment images to synthesize coherent videos that integrate these distinct elements. This ability emerges naturally from the model's learned scene-level visual relationships in the training corpus, where scenes typically contain establishing environmental shots, character close-ups, and integrated shots depicting character-environment interactions.

Identity Image 1

Environment Image

Composed Video

Based on the above generation results, we can provide the model with another identity image along with the previous content to continue the generation chain. The diagram below illustrates this process.

Compositional Generation Process

Below are the additional identity image and the resulting video. Note the exceptional scene consistency demonstrated throughout, particularly how the arrangement of paintings on the wall matches exactly with the initial environment image.

History + Identity Image 2

Composed Video

Visual Conditioning

Our asynchronous training timestep strategy unifies conditional inputs and diffusion samples, enabling the use of arbitrary images or videos as additional visual conditions, as demonstrated in the previous sections.

In the example below, we illustrate a practical workflow enabled by our model: first designing the character, costume, and environment using external tools (such as FLUX), then integrating them into a video to achieve precise control over character appearance and scene composition.

Character Design (source)

Environment Design (source)

Composed Video

Controllable Scene Interpolation

Our bidirectional model accepts visual conditions in arbitrary order and location, supporting "scene interpolation" applications. As shown below, given the first and last shots, we can generate intermediate scenes with semantic coherence.

Shot 1 (source: Green Book)

Shot 2 (unknown)

Shot 3

Using our model, we generate two possible interpolation results shown below. By providing specific text prompts, we can guide the interpolated narrative while maintaining visual and semantic coherence. In the first example, Dr. Shirley is listening to music. In the second, Tony Lip is making a phone call.

Possible Interpolation 1

Possible Interpolation 2

Generalizability

Finally, we demonstrate the versatility of our model. It excels not only in generating human-centric content (Example 1) but also in producing nature documentaries (Example 2), showcasing its broad applicability across diverse visual domains.

Example 1: Two characters meet in a café and start a conversation

Example 2: An exploration through a beautiful coral reef