1Peking University2Li Auto3Harbin Institute of Technology
Illustration of the motivation. (a) shows sampled frames from videos generated with the prompt “running”, where existing works struggle to generate human videos with reasonable structures. (b) compares existing human video datasets and our Movid, where existing datasets mostly focus on facial or upper-body regions, or consist of vertically oriented dance videos. More samples Movid are provided in supplementary materials.
Existing video generation models predominantly emphasize appearance fidelity while exhibiting limited ability to
synthesize complex human motions, such as whole-body movements, long-range dynamics, and fine-grained human-environment
interactions. This often leads to unrealistic or physically implausible movements with inadequate structural coherence.
To conquer these challenges, we propose MoSA, which decouples the process of human video generation into two components,
i.e., structure generation and appearance generation. MoSA first employs a 3D structure transformer to generate a human
motion sequence from the text prompt. The remaining video appearance is then synthesized under the guidance of this
structural sequence. We achieve fine-grained control over the sparse human structures by introducing Human-Aware Dynamic
Control modules with a dense tracking constraint during training. The modeling of human-environment interactions is
improved through the proposed contact constraint. Those two components work comprehensively to ensure the structural and
appearance fidelity across the generated videos. This paper also contributes a large-scale human video dataset, which
features more complex and diverse motions than existing human video datasets. We conduct comprehensive comparisons
between MoSA and a variety of approaches, including general video generation models, human video generation models, and
human animation models. Experiments demonstrate that MoSA substantially outperforms existing approaches across the
majority of evaluation metrics.
Method
Overview of the proposed MoSA. Given a text prompt, we first employ a 3D structure transformer to generate a structure sequence, which is subsequently encoded as structural features to guide the appearance generation. To further enhance motion consistency, we introduce human-aware dynamic control modules. For brevity, the Gate modules in blocks have been omitted.
Comparison with Existing Text-to-Video Generation Models
We show visual comparison with ModelScope, VideoCrafter2, Lavie, Mochi 1 and CogVideoX-5B, Hunyuan and Wan2.1.
Comparison with Text-Driven Human Video Generation Models
Since existing text-driven human video generation methods (Move-in-2D and HumanDreamer) are not yet open source, we perform a visual comparison based on their released videos, if accessible.
Text prompt:"A man hitting a tennis return."
Move-in-2D
Ours
Text prompt:"A man is performing a yoga pose on a mat, and he is seen moving his legs ..."
HumanDreamer
Ours
Comparison with Commercial Models Kling and Seedance
Text prompt:"A man stretches on a wooden pier, placing one hand on his foot and the other reaching towards his leg."
Seedance
Kling
Ours
More Generated Videos of Our MoSA
Videos of dynamic camera trajectory for human generated by MoSA.
Videos of dynamic camera trajectory for background generated by MoSA.
Half-body videos generated by MoSA.
Multi-person videos generated by MoSA.
More generation results of MoSA, a total of 12 videos.
Diverse Generated Videos via the Same Motion Prompt
Text prompt:"A solitary snowboarder in a jacket and black pants is skiing, with the serene, snow-covered landscape and the misty mountains in the background. The scene is quiet and devoid of other individuals or artificial elements, highlighting the tranquility of the winter sports environment."
Text prompt:"A woman with wet hair, wearing a neon green swimsuit is surfing, against a backdrop of a clear sky and lush greenery. After that, she continues her water surfing adventure, now on a large wave. The scene is set against a serene lake and dense greenery, highlighting the exhilaration of the sport."
BibTex
@article{wang2025mosa,
title={MoSA: Motion-Consistent Human Video Generation via Structure-Appearance Decoupling},
author={Haoyu Wang, Hao Tang, Donglin Di, Zhilu Zhang, Wangmeng Zuo, Feng Gao, Siwei Ma, Shiliang Zhang},
journal={arXiv preprint arXiv:2508.17404},
year={2025}
}
Project page template is borrowed from DreamBooth.