MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling

Haoyu Wang^1,2Hao Tang¹Donglin Di²Zhilu Zhang³Wangmeng Zuo³Feng Gao¹Siwei Ma¹Shiliang Zhang¹

¹Peking University ²Li Auto ³Harbin Institute of Technology

Illustration of the motivation. (a) shows sampled frames from videos generated with the prompt “running”, where existing works struggle to generate human videos with reasonable structures. (b) compares existing human video datasets and our Movid, where existing datasets mostly focus on facial or upper-body regions, or consist of vertically oriented dance videos. More samples Movid are provided in supplementary materials.

[Paper] [Code] [BibTeX]

Abstract

Existing video generation models predominantly emphasize appearance fidelity while exhibiting limited ability to synthesize complex human motions, such as whole-body movements, long-range dynamics, and fine-grained human-environment interactions. This often leads to unrealistic or physically implausible movements with inadequate structural coherence. To conquer these challenges, we propose MoSA, which decouples the process of human video generation into two components, i.e., structure generation and appearance generation. MoSA first employs a 3D structure transformer to generate a human motion sequence from the text prompt. The remaining video appearance is then synthesized under the guidance of this structural sequence. We achieve fine-grained control over the sparse human structures by introducing Human-Aware Dynamic Control modules with a dense tracking constraint during training. The modeling of human-environment interactions is improved through the proposed contact constraint. Those two components work comprehensively to ensure the structural and appearance fidelity across the generated videos. This paper also contributes a large-scale human video dataset, which features more complex and diverse motions than existing human video datasets. We conduct comprehensive comparisons between MoSA and a variety of approaches, including general video generation models, human video generation models, and human animation models. Experiments demonstrate that MoSA substantially outperforms existing approaches across the majority of evaluation metrics.

Method

Overview of the proposed MoSA. Given a text prompt, we first employ a 3D structure transformer to generate a structure sequence, which is subsequently encoded as structural features to guide the appearance generation. To further enhance motion consistency, we introduce human-aware dynamic control modules. For brevity, the Gate modules in blocks have been omitted.

Comparison with Existing Text-to-Video Generation Models

We show visual comparison with ModelScope, VideoCrafter2, Lavie, Mochi 1 and CogVideoX-5B, Hunyuan and Wan2.1.

Comparison with Text-Driven Human Video Generation Models

Since existing text-driven human video generation methods (Move-in-2D and HumanDreamer) are not yet open source, we perform a visual comparison based on their released videos, if accessible.

Text prompt: "A man hitting a tennis return."

Move-in-2D

Ours

Text prompt: "A man is performing a yoga pose on a mat, and he is seen moving his legs ..."

HumanDreamer

Ours

Comparison with Commercial Models Kling and Seedance

Text prompt: "A man stretches on a wooden pier, placing one hand on his foot and the other reaching towards his leg."

Seedance

Kling

Ours

Diverse Generated Videos via the Same Motion Prompt

Text prompt: "A solitary snowboarder in a jacket and black pants is skiing, with the serene, snow-covered landscape and the misty mountains in the background. The scene is quiet and devoid of other individuals or artificial elements, highlighting the tranquility of the winter sports environment."

Text prompt: "A woman with wet hair, wearing a neon green swimsuit is surfing, against a backdrop of a clear sky and lush greenery. After that, she continues her water surfing adventure, now on a large wave. The scene is set against a serene lake and dense greenery, highlighting the exhilaration of the sport."

BibTex

 @article{wang2025mosa,

          title={MoSA: Motion-Consistent Human Video Generation via Structure-Appearance Decoupling},

          author={Haoyu Wang, Hao Tang, Donglin Di, Zhilu Zhang, Wangmeng Zuo, Feng Gao, Siwei Ma, Shiliang Zhang},

          journal={arXiv preprint arXiv:2508.17404},

          year={2025}

        }

Project page template is borrowed from DreamBooth.

MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling

Abstract

Method

Comparison with Existing Text-to-Video Generation Models

Comparison with Text-Driven Human Video Generation Models

Comparison with Commercial Models Kling and Seedance

More Generated Videos of Our MoSA

Diverse Generated Videos via the Same Motion Prompt

BibTex