MY ALT TEXT

Abstract

We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real-scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and world-consistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.

Overall Framework of PlayerOne

MY ALT TEXT

It begins by converting the egocentric first frame into visual tokens. The human motion sequence is split into groups and fed into the motion encoders respectively to generate part-wise motion latents, with the head parameters converted into a rotation-only camera sequence. This camera sequence is then encoded via a camera encoder, and its output is injected into noised video latents to improve view-change alignment. Next, we render a 4D scene point map sequence with the ground truth video, which is then processed by a point map encoder with an adapter to produce scene latents. Then we input the concatenation of these latents into the DiT Model and perform noising and denoising on both the video and scene latents to ensure world-consistent generation. Finally, the denoised latents are decoded by VAE decoders to produce the final results. Note that only the first frame and the human motion sequence are needed for inference.

Dataset Construction

MY ALT TEXT

By seamlessly integrating detection and human pose esimation models, we can extract motion-video pairs from existing egocentric-exocentric video datasets while retaining high-quality data through our automatic filtering scheme.

Experimental Results

More simulated videos

Ablation study on core components

Comparison with other works




  Ours                                              Cosmos-7B                                      Cosmos-14B                                 Aether



Descriptions: First-person perspective, stretch out the left hand to high-five the man opposite




Descriptions: First-person perspective, I stand up and extend my right hand to shake hands with the man opposite me



Descriptions: First-person perspective,I stroll forward



Descriptions: First-person perspective, I extended my left hand to shake hands with the woman opposite me



Descriptions: First-person perspective, I squatted down and put my hands on both sides of the chair



Descriptions: First-person perspective, squat down and stretch out my hands to touch the dog's head



Descriptions: First-person perspective, I stretch out my right hand to pick up the triangular rice ball below



Descriptions: First-person perspective, I stretched out my left hand and high-fived the golden retriever.



Descriptions: First-person perspective, I reach out my left hand and stroke the golden retriever's head



Descriptions: First-person perspective, I stretch out my left hand and high-five the man opposite me


Reference

      
@article{tu2025PlayerOne,
title={PlayerOne: Egocentric World Simulator},
author={Tu, Yuanpeng and Luo, Hao and Chen, Xi and Bai, Xiang and Wang, Fan and Zhao, Hengshuang},
journal={arXiv preprint},
year={2025}}