ConditionVideo: Training-Free Condition-Guided Video Generation

Abstract

Recent works have successfully extended large-scale text-to-image models to the video domain, producing promising results but at a high computational cost and requiring a large amount of video data. In this work, we introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text, by leveraging the power of off-the-shelf text-to-image generation methods (e.g., Stable Diffusion). ConditionVideo generates realistic dynamic videos from random noise or given scene videos. Our method explicitly disentangles the motion representation into condition-guided and scenery motion components. To this end, the ConditionVideo model is designed with a UNet branch and a control branch. To improve temporal coherence, we introduce sparse bi-directional spatial-temporal attention (sBiST-Attn). The 3D control network extends the conventional 2D controlnet model, aiming to strengthen conditional generation accuracy by additionally leveraging the bi-directional frames in the temporal domain. Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming compared methods.

Results

Customed Videos

Ironman in the sea.

Batman in front of the earth.

A red robot in front of a stone stair in snow.

Ironman on beach.

A girl with black hair in the blue sky and green meadows in wind, Van Gogh Style.

Ironman in the sea, Van Gogh Stype.

Pose

The Knight, in a medieval castle, oil painting style.

Batman in front of the earth.

The Astronaut, brown background.

The Cyborg, in a virtual reality world, digital art style.

The Cowboy, in a desert ghost town, Western painting style.

The Astronaut, brown background.

Canny

Road at night, oil painting style.

Road in space, van Gogh style.

Road in the mountains, Van Gogh style.

Road under Cosmic galaxies, oil painting style.

Spiderman is running.

A man is running.

Depth

A man playing cello.

A purple jellyfish.

A walking puppy.

An ostritch.

Ice coffee.

A horse under a blue sky.

Segment

A horse, watercolor.

A red jellyfish, pastel colours.

A white duck.

A blue duck, oil paintin.

A red swan, concept art.

The Explorer, in a dense jungle, documentary style.

Muti-people Half-body

Three spiderman at night.

BibTeX

@misc{peng2023conditionvideo, title={ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation}, author={Bo Peng and Xinyuan Chen and Yaohui Wang and Chaochao Lu and Yu Qiao}, year={2023}, eprint={2310.07697}, archivePrefix={arXiv}, primaryClass={cs.CV} }