VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models

Apple ML Research·AI·May 22, 2026

Streaming vision-language models (VLMs) continuously generate responses given an instruction prompt and an online stream of input frames. This is a core mechanism for real-time visual assistants. Existing VLM frameworks predominantly assess models in offline settings. In contrast, the performance of a streaming VLM depends on additional metrics beyond pure video understanding, including proactiveness, which reflects the timeliness of the model’s responses, and consistency, which captures the rob...

Read full article →

VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models

Related Articles