We introduce a novel method for converting pretrained text-to-video models into continuous video editors while preserving identity and background, a task that remains challenging for existing approaches.
Explore the sections below for more details:
Given a pretrained text-to-video generation model, TokenDial learns lightweight slider controls for continuous video editing without modifying the backbone model weights.
Each slider corresponds to a semantic attribute, such as appearance or motion, and allows users to smoothly adjust the strength of an edit from weaker to stronger effects.
Instead of changing the model through full fine-tuning, TokenDial operates in the intermediate video token space.
Our method learns additive token offsets that are injected into spatiotemporal visual patch tokens inside the video diffusion transformer.
For appearance sliders, we use semantic direction matching in visual embedding space to encourage the edited output toward the desired attribute direction.
For motion sliders, we use motion magnitude scaling to supervise controlled changes in temporal dynamics.
These learned offsets act as reusable control directions, enabling predictable and progressive changes while largely preserving the identity, background, and overall scene layout of the original generation.
TokenDial contributes to creative workflows and research on controllable video generation by enabling more precise, interpretable, and flexible control over appearance and motion.
Such capabilities may benefit content creation, human-AI collaboration, and the study of more transparent generative video systems.
However, like other advanced video editing technologies, these capabilities may also be misused to produce misleading or deceptive media.
We encourage responsible deployment, appropriate disclosure of edited content, and further work on safety measures and governance for generative video tools.
Acknowledgments
We are grateful for the valuable feedback and insightful discussions provided by
Xuanlei Zhao, Sheng Li.