iWorld-Bench: A Benchmark for Interactive World Models with a Unified Action Generation Framework

Overview

Achieving Artificial General Intelligence (AGI) requires agents that learn and interact adaptively, with interactive world models providing scalable environments for perception, reasoning, and action. Yet current research still lacks large-scale datasets and unified benchmarks to evaluate their physical interaction capabilities. To address this, we propose iWorld-Bench, a comprehensive benchmark for training and testing world models on interaction-related abilities such as distance perception and memory.

We construct a diverse dataset with 330k video clips and select 2.1k high-quality samples covering varied perspectives, weather, and scenes. As existing world models differ in interaction modalities, we introduce an Action Generation Framework to unify evaluation and design six task types, generating 4.9k test samples. These tasks jointly assess model performance across visual generation, trajectory following, and memory. Evaluating 14 representative world models, we identify key limitations and provide insights for future research.

Comparison

Benchmark	Field	Multiple Inputs	Interactive Task	Camera Control	Memory Ability	Multi-Scene	Multi-Perspective	All-Weather	# Examples
EWMBench	Manipulation Policies	✗	✗	✗	✗	✗	✗	✗	2,100
Movebench	Motion-Controllable Video	✗	✗	✗	✗	✓	✓	✗	1,018
WorldEval	Manipulation Policies	✗	✗	✗	✗	✗	✗	✗	1,400
VMbench	Motion-Controllable Video	✗	✗	✗	✗	✓	✗	✗	1,050
WorldModelBench	General World Model	✗	✗	✗	✗	✗	✗	✗	67,000
WorldBench	Motion-Controllable Video	✗	✗	✗	✗	✗	✗	✗	425
WorldScore	General World Model	✓	✗	✓	✗	✓	✗	✗	3,000
iWorld-Bench (Ours)	Interactive World Model	✓	✓	✓	✓	✓	✓	✓	4,900

Our benchmark focuses on interactive world modeling with comprehensive capabilities. Unlike existing benchmarks that are mostly designed for general-purpose world models or embodied world models, iWorld-Bench specifically evaluates interactive world models' responsiveness to external action sequences. It supports multiple input modalities, action control, camera control, memory ability evaluation, multi-scene coverage, multi-perspective observations, and all-weather adaptability.

Framework

We introduce a unified and comprehensive framework that supports inputs from any modality of world models, enabling the design of action tasks and guiding the generation of world models.

Action Space Definition

The generation process of a world model can be decoupled into the following representation: V_t+1 = W(I_t, C_t), where V_t+1 represents the output of the world model, W denotes the specific world model, I_t represents the current scene frame, and C_t = [D, T, R, V] is a quadruple that controls the current frame I_t.

D represents the current action difficulty (1-6)
T represents the translational ID (0-26)
R represents the rotational ID (0-26)
V represents the validity (0 or 1)

Translation Motion

Difficulty	ID	Direction	Keys
1	0	Stationary	-
1	1	Forward	W
1	2	Backward	S
1	3	Left	A
1	4	Right	D
2	5	Forward+Left	W+A
2	6	Forward+Right	W+D
2	7	Backward+Left	S+A
2	8	Backward+Right	S+D
1	9	Upward	-
1	10	Downward	-
2	11	Forward+Upward	-
2	12	Forward+Downward	-
2	13	Backward+Upward	-
2	14	Backward+Downward	-
2	15	Left+Upward	-
2	16	Left+Downward	-
2	17	Right+Upward	-
2	18	Right+Downward	-
3	19	Forward+Left+Upward	-
3	20	Forward+Right+Upward	-
3	21	Forward+Left+Downward	-
3	22	Forward+Right+Downward	-
3	23	Backward+Left+Upward	-
3	24	Backward+Right+Upward	-
3	25	Backward+Left+Downward	-
3	26	Backward+Right+Downward	-

Rotation Motion

Difficulty	ID	Direction	Keys
1	0	Stationary	-
1	1	Camera Up	↑
1	2	Camera Down	↓
1	3	Camera Right	→
1	4	Camera Left	←
2	5	Camera Up+Right	↑+→
2	6	Camera Up+Left	↑+←
2	7	Camera Down+Right	↓+→
2	8	Camera Down+Left	↓+←
1	9	Clockwise	-
1	10	Counterclockwise	-
2	11	Camera Up+Clockwise	-
2	12	Camera Up+Counterclockwise	-
2	13	Camera Down+Clockwise	-
2	14	Camera Down+Counterclockwise	-
2	15	Camera Left+Clockwise	-
2	16	Camera Left+Counterclockwise	-
2	17	Camera Right+Clockwise	-
2	18	Camera Right+Counterclockwise	-
3	19	Camera Up+Right+Clockwise	-
3	20	Camera Up+Right+Counterclockwise	-
3	21	Camera Up+Left+Clockwise	-
3	22	Camera Up+Left+Counterclockwise	-
3	23	Camera Down+Right+Clockwise	-
3	24	Camera Down+Left+Counterclockwise	-
3	25	Camera Down+Right+Clockwise	-
3	26	Camera Down+Left+Counterclockwise	-

Task Design

Based on the Action Generation Framework, we designed six types of tasks to comprehensively evaluate the interaction capabilities of world models:

Task Type	Description	Difficulty	# Tasks
Action Control Difficulty 1	Basic tasks including stationary and 9 basic actions	D = 1	1,000
Action Control Difficulty 2	Two-degree-of-freedom tasks covering 24 actions	D = 2	1,000
Action Control Difficulty 3	Three-degree-of-freedom tasks covering 32 actions	D = 3	1,000
Action Control Difficulty 4	Four-degree-of-freedom complex tasks covering 16 actions	D = 4	1,000
Memory Ability	Cyclic paths requiring model to visit same location	-	200
Camera Following	Trajectory following using camera parameter files	-	700

Evaluation Metrics (Coming Soon)

Generation Quality

We evaluate low-level visual distortions by calculating the normalized average MUSIQ score across all frames to reflect fundamental rendering fidelity.

Score: 80.96

Score: 42.14

By calculating the mirror similarity of instantaneous displacement vectors, we assess the spatial topological consistency of camera movements in reciprocal tasks.

Score: 94.98

Score: 4.00E-04

By penalizing hue drifting in the HSV space through weighted temporal similarity, this metric ensures global color temperature remains logically unified.

Score: 99.09

Score: 10.10

By coupling a vectorized Tenengrad method with a BRISQUE-triggered circuit breaker logic, we distinguish genuine texture stability from high-frequency artifacts.

Score: 99.63

Score: 7.50E-03

Trajectory Following

We utilize motion priors from frame interpolation models to evaluate the reconstruction quality of sampled frames via LPIPS, SSIM, and MSE, effectively identifying unnatural jitters or physical inconsistencies.

Score: 99.29

Score: 90.40

By evaluating directional mapping accuracy in the motion tangent space via ViPE-extracted trajectories, we quantify how precisely the model adheres to camera commands.

Score: 74.25

Score: 47.89

To eliminate estimator-induced variance, we calibrate the model's trajectory fidelity by cross-referencing generated sequences with ground-truth videos under the same ViPE framework.

This metric assesses trajectory robustness against ground-truth data and is not represented by standalone video examples.

Memory Ability

To quantify logical loop-closure in reciprocal tasks, we evaluate the pixel-wise consistency of symmetric frame pairs relative to the temporal midpoint. This metric effectively captures memory decay or structural logic failure during extended temporal reasoning.

Score: 82.55

Score: 13.97

Dataset

We established a comprehensive data processing protocol to clean and standardize 12 high-quality datasets, unifying the original coordinate systems and intrinsic/extrinsic parameter formats. Additionally, we designed an automated collection and filtering pipeline to gather 100k 1080P video clips from 18 high-quality environments across 4 simulators. Combined with vision-language models, all video clips were uniformly annotated, resulting in a high-quality dataset containing 330k video clips.

Dataset	Year	Domain	Pose Representation	Viewpoint	Clip Count	Selected [N₁, N₂]
KITTI	2012	Autonomous driving	External parameter matrix	UGV	281	[20, 5]
NuScenes	2019	Autonomous driving	Seven-element	UGV	1,000+	[15, 5]
Waymo	2019	Autonomous driving	External parameter matrix	UGV	453	[15, 5]
TUM-RGB-D	2011	3D reconstruction	Seven-element	Human, UGV	405	[15, 5]
7-Scenes	2013	3D reconstruction	External parameter matrix	Human	516	[30, 10]
RealEstate-10K	2018	3D reconstruction	External parameter matrix	Human, UGV	20,000+	[200, 147]
Princeton365	2019	3D reconstruction	External parameter matrix	Human	365	[29, 2]
DL3DV-10K	2023	3D reconstruction	External parameter matrix	Human	10,000+	[60, 10]
NCLT Dataset	2016	Robotics inspection	6-DoF	UGV	10,000+	[60, 21]
TartanGround	2024	Robotics inspection	Seven-element	UGV	3,000+	[56, 10]
TartanAir-V2	2024	Drone inspection	Seven-element	UAV	2,000+	[100, 40]
SpatialVid	2025	World model	6-DoF	Human, UGV, UAV	180,000	[100, 40]
Total	-	-	-	-	230,000+	[700, 200]

Results

We selected 14 representative interactive world models for evaluation, including five text-conditioned camera control models, two one-hot-conditioned camera control models, and seven models with camera control via explicit intrinsics and extrinsics.

Action Control and Memory Ability

Method	Rank	Avg. ↑	Generation Quality				Trajectory Following		Memory Ability
Method	Rank	Avg. ↑	Image Quality ↑	Brightness Consistency ↑	Color Temperature ↑	Sharpness Retention ↑	Motion Smoothness ↑	Trajectory Accuracy ↑	Memory Symmetry ↑	Trajectory Alignment ↑
Camera Control via Text
NVIDIA Cosmos	7	0.6275	0.6778	0.6952	0.7170	0.4363	0.9907	0.4955	0.3738	0.6419
HunyuanVideo-1.5	3	0.7188	0.7128	0.7027	0.7477	0.5545	0.9908	0.6844	0.6336	0.6449
WAN 2.2	12	0.5731	0.5545	0.3886	0.3411	0.3428	0.9557	0.6514	0.4480	0.5703
CogVideoX-I2V	5	0.6963	0.6521	0.8988	0.8129	0.7951	0.9938	0.5950	0.6010	0.4084
YUME 1.5	8	0.6209	0.6232	0.3810	0.4165	0.4023	0.9765	0.7113	0.5276	0.5988
Camera Control via One-hot Encoding
Matrix-game 2.0	13	0.5663	0.4851	0.2963	0.2937	0.4149	0.9848	0.7008	0.3311	0.6362
HY-World 1.5	1	0.7873	0.6675	0.8051	0.7819	0.6634	0.9921	0.7472	0.8481	0.6776
Camera Control via Intrinsics and Extrinsics
CameraCtrl	11	0.5762	0.4473	0.3717	0.2511	0.4545	0.9796	0.6778	0.4279	0.6097
MotionCtrl	14	0.5486	0.4562	0.3980	0.2012	0.4294	0.9735	0.6730	0.3098	0.5932
CamI2V	10	0.5765	0.5284	0.4343	0.3568	0.4297	0.9861	0.6314	0.3631	0.6038
RealCam-I2V	6	0.6865	0.6227	0.4130	0.5547	0.6269	0.9860	0.5630	0.7948	0.6668
videox-fun-Wan	2	0.7474	0.6410	0.5972	0.5473	0.5998	0.9858	0.7172	0.9009	0.6876
AC3D	4	0.7149	0.4573	0.7307	0.6524	0.5332	0.9919	0.5785	0.9068	0.6250
ASTRA	9	0.5980	0.5335	0.5091	0.4338	0.5488	0.9799	0.6115	0.4323	0.5518

Camera Following Results

Method	Generation Quality				Trajectory Following
Method	Image Quality ↑	Brightness Consistency ↑	Color Temperature ↑	Sharpness Retention ↑	Motion Smoothness ↑	Trajectory Tolerance ↑
Camera Control via Intrinsics and Extrinsics
CameraCtrl	0.3980	0.3497	0.2008	0.4211	0.9659	0.7099
MotionCtrl	0.4270	0.3924	0.1810	0.4256	0.9622	0.7120
CamI2V	0.4046	0.3674	0.2605	0.3830	0.9766	0.7143
RealCam-I2V	0.5889	0.4777	0.5521	0.6838	0.9783	0.7480
videox-fun-Wan	0.5701	0.5584	0.3659	0.4925	0.9604	0.7381
AC3D	0.5208	0.8927	0.7404	0.6472	0.9919	0.9091
ASTRA	0.4743	0.3972	0.2819	0.4171	0.9615	0.4286

BibTeX

If you find this work useful, please consider citing:

@misc{iworldbench2026,
    title={iWorld-Bench: A Benchmark for Interactive World Models with a Unified Action Generation Framework},
    author={Fang, Jianjie and Lei, Yingshan and Wan, Qin and Wang, Ziyou and Huang, Yuchao and Xu, Yongyan and Zhao, Baining and Zhang, Weichen and Gao, Chen and Chen, Xinlei and Li, Yong},
    year={2026},
    eprint={2605.03941},
    archivePrefix={arXiv},
    url={https://arxiv.org/abs/2605.03941}
}

iWorld-Bench

Authors

Affiliations

Date