StreamForest

Abstract

Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in video understanding. However, their effectiveness in real-time streaming scenarios remains limited due to storage constraints of historical visual features and insufficient real-time spatiotemporal reasoning. To address these challenges, we propose StreamForest, a novel architecture specifically designed for streaming video understanding. Central to StreamForest is the Persistent Event Memory Forest, a memory mechanism that adaptively organizes video frames into multiple event-level tree structures. This process is guided by penalty functions based on temporal distance, content similarity, and merge frequency, enabling efficient long-term memory retention under limited computational resources. To enhance real-time perception, we introduce a Fine-grained Spatiotemporal Window, which captures detailed short-term visual cues to improve current scene perception. Additionally, we present OnlineIT, an instruction-tuning dataset tailored for streaming video tasks. OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction. To evaluate generalization in practical applications, we introduce ODV-Bench, a new benchmark focused on real-time streaming video understanding in autonomous driving scenarios. Experimental results demonstrate that StreamForest achieves the state-of-the-art performance, with accuracies of 77.3% on StreamingBench, 60.5% on OVBench, and 55.6% on OVO-Bench. Inparticular, even under extreme visual token compression (limited to 1024 tokens), the model retains 96.8% of its average accuracy in eight benchmarks relative to the default setting. These results underscore the robustness, efficiency, and generalizability of StreamForest for streaming video understanding.

Leaderboard

Static Target

RTP: Real-time Traffic Perception HD: Hallucination Detection KIE: Key Information Extraction

TCD: Traffic Change Detection DDM: Driving Decision-Making PTM: Past Traffic Memory

Dynamic Target

AP: Action Prediction LP: Location Prediction DP: Distance Prediction

Event Oriented

RP: Risk Prediction RA: Risk Analysis ARA: Accident Reason Answering

By default, this leaderboard is sorted by overall Accuracy scores. To view other sorted results, please click on the corresponding cell.

#	Task Name Subset Name	Size	#Frames	Static Target							Dynamic Target				Event Oriented				Overall
#	Task Name Subset Name	Size	#Frames	RTP	HD	KIE	TCD	DDM	PTM	*Avg.*	AP	LP	DP	*Avg.*	RP	RA	ARA	*Avg.*	Overall
	⭐StreamForest (FT-drive) Ours	7B	1fps	70.1	17.1	100.0	60.0	32.7	83.6	64.6	64.0	96.6	59.6	70.7	71.8	93.4	58.3	78.5	71.2
	⭐StreamForest Ours	7B	1fps	51.4	15.5	54.7	56.4	38.6	65.3	51.5	72.6	83.2	46.0	62.3	60.2	73.3	47.4	63.8	59.9
	Qwen2.5-VL Alibaba	7B	1fps	51.8	8.1	79.3	49.1	36.0	57.3	48.3	50.4	82.6	46.9	57.5	47.6	78.6	52.6	59.4	55.6
	⭐VideoChat-Online NJU	4B	1fps	36.9	0.8	62.3	49.1	21.5	47.0	36.1	70.2	86.7	46.4	62.9	51.2	69.4	45.5	57.4	54.5
	VideoChat-Flash Shanghai AI Lab	7B	256	29.6	15.5	45.3	76.4	26.1	36.1	32.2	73.5	75.3	47.2	61.0	67.1	64.8	46.2	64.3	54.4
	InternVL2.5 Shanghai AI Lab	8B	32	40.1	16.3	37.7	52.7	30.4	40.9	37.2	64.1	84.6	49.5	62.5	54.0	60.6	50.6	56.1	54.2
	LLaVA-OneVision Bytedance	7B	64	36.0	4.9	22.6	60.0	31.4	39.0	34.2	53.6	70.3	47.4	55.1	57.9	72.2	47.4	62.2	51.6
	MiniCPM-V 2.6 OpenBMB	7B	64	20.0	87.8	15.1	49.1	26.4	20.6	27.3	71.2	73.4	47.2	60.0	73.4	33.3	16.7	53.6	49.8
	LongVA LMMs-Lab	7B	64	29.9	7.3	37.7	47.3	38.0	33.6	31.8	66.6	58.6	50.9	56.6	57.5	58.1	46.2	56.7	50.2
	⭐ Dispider CUHK	7B	1fps	31.1	7.3	34.0	63.6	34.0	35.4	32.5	43.2	73.1	45.8	52.7	38.2	55.4	36.5	44.3	45.2
	⭐ Flash-Vstream THU	7B	1fps	25.4	1.6	11.3	50.9	36.0	22.1	24.8	25.5	39.8	47.2	40.2	32.4	48.6	30.1	38.1	35.7

⭐: indicates the input is streaming video

ODV-Bench

Overview of StreamForest.

The Fine-grained Spatiotemporal Window captures instance-level spatiotemporal features, while the Persistent Event Memory Forest adaptively organizes event-level representations into a set of tree structures. Dashed arrows and feature tokens illustrate potential operations performed during each memory update iteration.


@article{zeng2025streamforest,
  title={StreamForest: Efficient Online Video Understanding with Persistent Event Memory},
  author={Zeng, Xiangyu and Qiu, Kefan and Zhang, Qingyu and Li, Xinhao and Wang, Jing and Li, Jiaxin and Yan, Ziang and Tian, Kun and Tian, Meng and Zhao, Xinhai and others},
  journal={arXiv preprint arXiv:2509.24871},
  year={2025}
}

StreamForest

NIPS 2025
Spotlight

StreamForest: Efficient Online Video Understanding with Persistent Event Memory

Abstract

Leaderboard

Static Target

Dynamic Target

Event Oriented

Benchmark

ODV-Bench

Model Architecture

Overview of StreamForest.

Citation

StreamForest

NIPS 2025 Spotlight

StreamForest: Efficient Online Video Understanding with Persistent Event Memory

Abstract

Leaderboard

Static Target

Dynamic Target

Event Oriented

Benchmark

ODV-Bench

Model Architecture

Overview of StreamForest.

Citation

NIPS 2025
Spotlight