StreamForest

NIPS 2025

Spotlight

StreamForest: Efficient Online Video Understanding with Persistent Event Memory

Xiangyu Zeng1, Kefan Qiu1, Qingyu Zhang1, Xinhao Li1, Jing Wang1, Jiaxin Li1, Ziang Yan3,2, Kun Tian4, Meng Tian5, Xinhai Zhao4, Yi Wang2 Limin Wang1

1Nanjing University  2Shanghai AI Laboratory 3Zhejiang University 4Noah’s Ark Lab, Huawei 5Yinwang Intelligent Tech.
Co-author

Abstract

Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in video understanding. However, their effectiveness in real-time streaming scenarios remains limited due to storage constraints of historical visual features and insufficient real-time spatiotemporal reasoning. To address these challenges, we propose StreamForest, a novel architecture specifically designed for streaming video understanding. Central to StreamForest is the Persistent Event Memory Forest, a memory mechanism that adaptively organizes video frames into multiple event-level tree structures. This process is guided by penalty functions based on temporal distance, content similarity, and merge frequency, enabling efficient long-term memory retention under limited computational resources. To enhance real-time perception, we introduce a Fine-grained Spatiotemporal Window, which captures detailed short-term visual cues to improve current scene perception. Additionally, we present OnlineIT, an instruction-tuning dataset tailored for streaming video tasks. OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction. To evaluate generalization in practical applications, we introduce ODV-Bench, a new benchmark focused on real-time streaming video understanding in autonomous driving scenarios. Experimental results demonstrate that StreamForest achieves the state-of-the-art performance, with accuracies of 77.3% on StreamingBench, 60.5% on OVBench, and 55.6% on OVO-Bench. Inparticular, even under extreme visual token compression (limited to 1024 tokens), the model retains 96.8% of its average accuracy in eight benchmarks relative to the default setting. These results underscore the robustness, efficiency, and generalizability of StreamForest for streaming video understanding.

Leaderboard

Static Target

        

RTP: Real-time Traffic Perception          HD: Hallucination Detection          KIE: Key Information Extraction

TCD: Traffic Change Detection          DDM: Driving Decision-Making          PTM: Past Traffic Memory

Dynamic Target

        

AP: Action Prediction          LP: Location Prediction          DP: Distance Prediction

Event Oriented

        

RP: Risk Prediction          RA: Risk Analysis          ARA: Accident Reason Answering

By default, this leaderboard is sorted by overall Accuracy scores. To view other sorted results, please click on the corresponding cell.

# Task Name
Subset Name
Size #Frames Static Target Dynamic Target Event Oriented Overall
RTP HD KIE TCD DDM PTM Avg. AP LP DP Avg. RP RA ARA Avg.
⭐StreamForest (FT-drive)

Ours

7B 1fps 70.1 17.1 100.0 60.0 32.7 83.6 64.6 64.0 96.6 59.6 70.7 71.8 93.4 58.3 78.5 71.2
⭐StreamForest

Ours

7B 1fps 51.4 15.5 54.7 56.4 38.6 65.3 51.5 72.6 83.2 46.0 62.3 60.2 73.3 47.4 63.8 59.9
Qwen2.5-VL

Alibaba

7B 1fps 51.8 8.1 79.3 49.1 36.0 57.3 48.3 50.4 82.6 46.9 57.5 47.6 78.6 52.6 59.4 55.6
⭐VideoChat-Online

NJU

4B 1fps 36.9 0.8 62.3 49.1 21.5 47.0 36.1 70.2 86.7 46.4 62.9 51.2 69.4 45.5 57.4 54.5
VideoChat-Flash

Shanghai AI Lab

7B 256 29.6 15.5 45.3 76.4 26.1 36.1 32.2 73.5 75.3 47.2 61.0 67.1 64.8 46.2 64.3 54.4
InternVL2.5

Shanghai AI Lab

8B 32 40.1 16.3 37.7 52.7 30.4 40.9 37.2 64.1 84.6 49.5 62.5 54.0 60.6 50.6 56.1 54.2
LLaVA-OneVision

Bytedance

7B 64 36.0 4.9 22.6 60.0 31.4 39.0 34.2 53.6 70.3 47.4 55.1 57.9 72.2 47.4 62.2 51.6
MiniCPM-V 2.6

OpenBMB

7B 64 20.0 87.8 15.1 49.1 26.4 20.6 27.3 71.2 73.4 47.2 60.0 73.4 33.3 16.7 53.6 49.8
LongVA

LMMs-Lab

7B 64 29.9 7.3 37.7 47.3 38.0 33.6 31.8 66.6 58.6 50.9 56.6 57.5 58.1 46.2 56.7 50.2
⭐ Dispider

CUHK

7B 1fps 31.1 7.3 34.0 63.6 34.0 35.4 32.5 43.2 73.1 45.8 52.7 38.2 55.4 36.5 44.3 45.2
⭐ Flash-Vstream

THU

7B 1fps 25.4 1.6 11.3 50.9 36.0 22.1 24.8 25.5 39.8 47.2 40.2 32.4 48.6 30.1 38.1 35.7

: indicates the input is streaming video

Benchmark

ODV-Bench

data-composition

Model Architecture

Overview of StreamForest.

The Fine-grained Spatiotemporal Window captures instance-level spatiotemporal features, while the Persistent Event Memory Forest adaptively organizes event-level representations into a set of tree structures. Dashed arrows and feature tokens illustrate potential operations performed during each memory update iteration.
grade-lv

Citation


@article{zeng2025streamforest,
  title={StreamForest: Efficient Online Video Understanding with Persistent Event Memory},
  author={Zeng, Xiangyu and Qiu, Kefan and Zhang, Qingyu and Li, Xinhao and Wang, Jing and Li, Jiaxin and Yan, Ziang and Tian, Kun and Tian, Meng and Zhao, Xinhai and others},
  journal={arXiv preprint arXiv:2509.24871},
  year={2025}
}