PersonaBooth: Personalized Text-to-Motion Generation

1University of Birmingham, 2Korea Electronics Technology Institute, 3Dankook University, 4KwangWoon University

Motion Personalization generates text-driven, personalized motions based on persona embedded in atomic input motions. We define a persona as the unique style expression of an individual.

Abstract

This paper introduces Motion Personalization, a new task that generates personalized motions aligned with text descriptions using several basic motions containing Persona. To support this novel task, we introduce a new large-scale motion dataset called PerMo (PersonaMotion), which captures the unique personas of multiple actors. We also propose a multi-modal finetuning method of a pretrained motion diffusion model called PersonaBooth. PersonaBooth addresses two main challenges: i) A significant distribution gap between the persona-focused PerMo dataset and the pretraining datasets, which lack persona-specific data, and ii) the difficulty of capturing a consistent persona from the motions vary in content (action type). To tackle the dataset distribution gap, we introduce a persona token to accept new persona features and perform multi-modal adaptation for both text and visuals during finetuning. To capture a consistent persona, we incorporate a contrastive learning technique to enhance intra-cohesion among samples with the same persona. Furthermore, we introduce a contextaware fusion mechanism to maximize the integration of persona cues from multiple input motions. PersonaBooth outperforms state-of-the-art motion style transfer methods, establishing a new benchmark for motion personalization.

Framework

The overall framework of PersonaBooth. PersonaBooth has two adaptation paths—visual and text—for finetuning the Motion Diffusion model ($\mathcal{D}$). The Persona Extractor extracts both a visual persona feature ($V^∗$) and a persona token ($P^∗$) from the input motions. $V^∗$ is input into the adaptive layer of $\mathcal{D}$, while $P^∗$ is processed together with the input prompt through a Personalized Text Encoder, generating a personalized text feature, which is then input to $\mathcal{D}$. The entire model is trained with a classifier-free approach, incorporating a Persona Cohesion Loss. During inference, Context-Aware Fusion is applied for multiple input cases.

Dataset

We collected a large-scale PerMo dataset, capturing personas from multiple actors. To ensure variety, we hired five professional motion capture actors of diverse genders and body types. Each actor is assigned to perform 34 styles, categorized into Age, Character, Condition, Emotion, Traits, and Surroundings, resulting in a total of 170 personas. Each actor performed 10 distinct contents for every style, carefully selected to engage different body parts. Notably, PerMo is the first dataset to collect data from multiple actors. Our dataset also offers the highest number of total clips and content categories among trainable datasets and is the only one that includes mesh, marker data, and descriptions.

Results of PersonaBooth

BibTeX

@article{kim2025personabooth,
      title={PersonaBooth: Personalized Text-to-Motion Generation},
      author={Kim, Boeun and Jeong, Hea In and Sung, JungHoon and Cheng, Yihua and Lee, Jeongmin and Chang, Ju Yong and Choi, Sang-Il and Choi, Younggeun and Shin, Saim and Kim, Jungho and Chang, Hyung Jin},
      journal={arXiv preprint arXiv:2503.07390},
      year={2025}
}