ICML 2026

iWorld-Bench

330K video clips
4.9K test tasks for evaluation
9 comprehensive metrics

Overview

Achieving Artificial General Intelligence (AGI) requires agents that learn and interact adaptively, with interactive world models providing scalable environments for perception, reasoning, and action. Yet current research still lacks large-scale datasets and unified benchmarks to evaluate their physical interaction capabilities. To address this, we propose iWorld-Bench, a comprehensive benchmark for training and testing world models on interaction-related abilities such as distance perception and memory.

iWorld-Bench Overview

We construct a diverse dataset with 330k video clips and select 2.1k high-quality samples covering varied perspectives, weather, and scenes. As existing world models differ in interaction modalities, we introduce an Action Generation Framework to unify evaluation and design six task types, generating 4.9k test samples. These tasks jointly assess model performance across visual generation, trajectory following, and memory. Evaluating 14 representative world models, we identify key limitations and provide insights for future research.

Comparison

Benchmark Field Multiple Inputs Interactive Task Camera Control Memory Ability Multi-Scene Multi-Perspective All-Weather # Examples
EWMBench Manipulation Policies 2,100
Movebench Motion-Controllable Video 1,018
WorldEval Manipulation Policies 1,400
VMbench Motion-Controllable Video 1,050
WorldModelBench General World Model 67,000
WorldBench Motion-Controllable Video 425
WorldScore General World Model 3,000
iWorld-Bench (Ours) Interactive World Model 4,900

Our benchmark focuses on interactive world modeling with comprehensive capabilities. Unlike existing benchmarks that are mostly designed for general-purpose world models or embodied world models, iWorld-Bench specifically evaluates interactive world models' responsiveness to external action sequences. It supports multiple input modalities, action control, camera control, memory ability evaluation, multi-scene coverage, multi-perspective observations, and all-weather adaptability.

Framework

We introduce a unified and comprehensive framework that supports inputs from any modality of world models, enabling the design of action tasks and guiding the generation of world models.

Action Space Definition

The generation process of a world model can be decoupled into the following representation: Vt+1 = W(It, Ct), where Vt+1 represents the output of the world model, W denotes the specific world model, It represents the current scene frame, and Ct = [D, T, R, V] is a quadruple that controls the current frame It.

Translation Motion

Difficulty ID Direction Keys
10Stationary-
11ForwardW
12BackwardS
13LeftA
14RightD
25Forward+LeftW+A
26Forward+RightW+D
27Backward+LeftS+A
28Backward+RightS+D
19Upward-
110Downward-
211Forward+Upward-
212Forward+Downward-
213Backward+Upward-
214Backward+Downward-
215Left+Upward-
216Left+Downward-
217Right+Upward-
218Right+Downward-
319Forward+Left+Upward-
320Forward+Right+Upward-
321Forward+Left+Downward-
322Forward+Right+Downward-
323Backward+Left+Upward-
324Backward+Right+Upward-
325Backward+Left+Downward-
326Backward+Right+Downward-

Rotation Motion

Difficulty ID Direction Keys
10Stationary-
11Camera Up
12Camera Down
13Camera Right
14Camera Left
25Camera Up+Right↑+→
26Camera Up+Left↑+←
27Camera Down+Right↓+→
28Camera Down+Left↓+←
19Clockwise-
110Counterclockwise-
211Camera Up+Clockwise-
212Camera Up+Counterclockwise-
213Camera Down+Clockwise-
214Camera Down+Counterclockwise-
215Camera Left+Clockwise-
216Camera Left+Counterclockwise-
217Camera Right+Clockwise-
218Camera Right+Counterclockwise-
319Camera Up+Right+Clockwise-
320Camera Up+Right+Counterclockwise-
321Camera Up+Left+Clockwise-
322Camera Up+Left+Counterclockwise-
323Camera Down+Right+Clockwise-
324Camera Down+Left+Counterclockwise-
325Camera Down+Right+Clockwise-
326Camera Down+Left+Counterclockwise-

Task Design

Based on the Action Generation Framework, we designed six types of tasks to comprehensively evaluate the interaction capabilities of world models:

Task Type Description Difficulty # Tasks
Action Control Difficulty 1 Basic tasks including stationary and 9 basic actions D = 1 1,000
Action Control Difficulty 2 Two-degree-of-freedom tasks covering 24 actions D = 2 1,000
Action Control Difficulty 3 Three-degree-of-freedom tasks covering 32 actions D = 3 1,000
Action Control Difficulty 4 Four-degree-of-freedom complex tasks covering 16 actions D = 4 1,000
Memory Ability Cyclic paths requiring model to visit same location - 200
Camera Following Trajectory following using camera parameter files - 700

Evaluation Metrics (Coming Soon)

Generation Quality

We evaluate low-level visual distortions by calculating the normalized average MUSIQ score across all frames to reflect fundamental rendering fidelity.

Example 1
Example 2
Trajectory Following

We utilize motion priors from frame interpolation models to evaluate the reconstruction quality of sampled frames via LPIPS, SSIM, and MSE, effectively identifying unnatural jitters or physical inconsistencies.

Example 1
Example 2
Memory Ability

To quantify logical loop-closure in reciprocal tasks, we evaluate the pixel-wise consistency of symmetric frame pairs relative to the temporal midpoint. This metric effectively captures memory decay or structural logic failure during extended temporal reasoning.

Example 1
Example 2

Dataset

We established a comprehensive data processing protocol to clean and standardize 12 high-quality datasets, unifying the original coordinate systems and intrinsic/extrinsic parameter formats. Additionally, we designed an automated collection and filtering pipeline to gather 100k 1080P video clips from 18 high-quality environments across 4 simulators. Combined with vision-language models, all video clips were uniformly annotated, resulting in a high-quality dataset containing 330k video clips.

Dataset Overview
Dataset Year Domain Pose Representation Viewpoint Clip Count Selected [N1, N2]
KITTI2012Autonomous drivingExternal parameter matrixUGV281[20, 5]
NuScenes2019Autonomous drivingSeven-elementUGV1,000+[15, 5]
Waymo2019Autonomous drivingExternal parameter matrixUGV453[15, 5]
TUM-RGB-D20113D reconstructionSeven-elementHuman, UGV405[15, 5]
7-Scenes20133D reconstructionExternal parameter matrixHuman516[30, 10]
RealEstate-10K20183D reconstructionExternal parameter matrixHuman, UGV20,000+[200, 147]
Princeton36520193D reconstructionExternal parameter matrixHuman365[29, 2]
DL3DV-10K20233D reconstructionExternal parameter matrixHuman10,000+[60, 10]
NCLT Dataset2016Robotics inspection6-DoFUGV10,000+[60, 21]
TartanGround2024Robotics inspectionSeven-elementUGV3,000+[56, 10]
TartanAir-V22024Drone inspectionSeven-elementUAV2,000+[100, 40]
SpatialVid2025World model6-DoFHuman, UGV, UAV180,000[100, 40]
Total----230,000+[700, 200]

Results

We selected 14 representative interactive world models for evaluation, including five text-conditioned camera control models, two one-hot-conditioned camera control models, and seven models with camera control via explicit intrinsics and extrinsics.

Action Control and Memory Ability

Method Rank Avg. ↑ Generation Quality Trajectory Following Memory Ability
Image Quality ↑ Brightness Consistency ↑ Color Temperature ↑ Sharpness Retention ↑ Motion Smoothness ↑ Trajectory Accuracy ↑ Memory Symmetry ↑ Trajectory Alignment ↑
Camera Control via Text
NVIDIA Cosmos70.62750.67780.69520.71700.43630.99070.49550.37380.6419
HunyuanVideo-1.530.71880.71280.70270.74770.55450.99080.68440.63360.6449
WAN 2.2120.57310.55450.38860.34110.34280.95570.65140.44800.5703
CogVideoX-I2V50.69630.65210.89880.81290.79510.99380.59500.60100.4084
YUME 1.580.62090.62320.38100.41650.40230.97650.71130.52760.5988
Camera Control via One-hot Encoding
Matrix-game 2.0130.56630.48510.29630.29370.41490.98480.70080.33110.6362
HY-World 1.510.78730.66750.80510.78190.66340.99210.74720.84810.6776
Camera Control via Intrinsics and Extrinsics
CameraCtrl110.57620.44730.37170.25110.45450.97960.67780.42790.6097
MotionCtrl140.54860.45620.39800.20120.42940.97350.67300.30980.5932
CamI2V100.57650.52840.43430.35680.42970.98610.63140.36310.6038
RealCam-I2V60.68650.62270.41300.55470.62690.98600.56300.79480.6668
videox-fun-Wan20.74740.64100.59720.54730.59980.98580.71720.90090.6876
AC3D40.71490.45730.73070.65240.53320.99190.57850.90680.6250
ASTRA90.59800.53350.50910.43380.54880.97990.61150.43230.5518

Camera Following Results

Method Generation Quality Trajectory Following
Image Quality ↑ Brightness Consistency ↑ Color Temperature ↑ Sharpness Retention ↑ Motion Smoothness ↑ Trajectory Tolerance ↑
Camera Control via Intrinsics and Extrinsics
CameraCtrl0.39800.34970.20080.42110.96590.7099
MotionCtrl0.42700.39240.18100.42560.96220.7120
CamI2V0.40460.36740.26050.38300.97660.7143
RealCam-I2V0.58890.47770.55210.68380.97830.7480
videox-fun-Wan0.57010.55840.36590.49250.96040.7381
AC3D0.52080.89270.74040.64720.99190.9091
ASTRA0.47430.39720.28190.41710.96150.4286
Performance Radar Chart

BibTeX

If you find this work useful, please consider citing:

@inproceedings{iworldbench2026,
    title={iWorld-Bench: A Benchmark for Interactive World Models with a Unified Action Generation Framework},
    author={Anonymous Authors},
    year={2026},
    organization={Anonymous Organization}
    }