TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets

Zhixuan Liu^1,2 Peter Schaldenbrand² Yijun Li¹ Long Mai¹ Aniruddha Mahapatra¹
Cusuh Ham¹ Jean Oh² Jui-Hsien Wang¹

¹Adobe Research ²Carnegie Mellon University

Swipe through pages to explore all appearance sliders. Refresh the page to sync the videos.

I. Appearance Sliders

II. Motion Sliders

III. Compose Different Sliders

IV. Region Masking

We introduce a novel method for converting pretrained text-to-video models into continuous video editors while preserving identity and background, a task that remains challenging for existing approaches. Explore the sections below for more details:

Baseline Comparisons

Our TokenDial Framework

Given a pretrained text-to-video generation model, TokenDial learns lightweight slider controls for continuous video editing without modifying the backbone model weights. Each slider corresponds to a semantic attribute, such as appearance or motion, and allows users to smoothly adjust the strength of an edit from weaker to stronger effects. Instead of changing the model through full fine-tuning, TokenDial operates in the intermediate video token space. Our method learns additive token offsets that are injected into spatiotemporal visual patch tokens inside the video diffusion transformer.

For appearance sliders, we use semantic direction matching in visual embedding space to encourage the edited output toward the desired attribute direction. For motion sliders, we use motion magnitude scaling to supervise controlled changes in temporal dynamics.

These learned offsets act as reusable control directions, enabling predictable and progressive changes while largely preserving the identity, background, and overall scene layout of the original generation.

Back to top

Limitations

References

Ezra et al. Free Sliders: Training-Free Modality-Agnostic Concept Sliders: Fine-Grained Control via Diffusion Models of Images, Audio, and Video. arXiv preprint arXiv:2511.00103, 2025.
Chui et al. Text Slider: Efficient and Plug-and-Play Continuous Concept Control for Image/Video Synthesis via LoRA Adapters. WACV, 2026.
Gandikota et al. Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models. ECCV, 2024.
Gandikota et al. SliderSpace: Decomposing the Visual Capabilities of Diffusion Models. ICCV, 2025.
Parihar et al. Kontinuous Kontext Continuous Strength Control for Instruction-based Image Editing. CVPR, 2026.
Zi et al. Señorita-2M : A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists. NeurIPS, 2025.
Wei et al. UniVideo: Unified Understanding, Generation, and Editing for Videos. arXiv preprint arXiv:2510.08377v3, 2025.

Back to baseline comparisons Back to top

Societal Impact

TokenDial contributes to creative workflows and research on controllable video generation by enabling more precise, interpretable, and flexible control over appearance and motion. Such capabilities may benefit content creation, human-AI collaboration, and the study of more transparent generative video systems. However, like other advanced video editing technologies, these capabilities may also be misused to produce misleading or deceptive media. We encourage responsible deployment, appropriate disclosure of edited content, and further work on safety measures and governance for generative video tools.

Acknowledgments

We are grateful for the valuable feedback and insightful discussions provided by Xuanlei Zhao, Sheng Li.