Qwen3-VL can scan two-hour videos and pinpoint nearly every detail
SMRTR summary
Alibaba's newest AI model can find a single frame hidden in a two-hour video with 99.5 percent accuracy, demonstrating remarkable precision in what researchers call "needle-in-a-haystack" tests. The Qwen3-VL system processes massive amounts of visual data within a 256,000-token window, analyzing everything from hours of video footage to hundreds of document pages simultaneously.
In head-to-head competitions, the model often outperforms industry giants like OpenAI's GPT-5 and Google's Gemini, particularly excelling at visual mathematics where it scored 85.8 percent compared to GPT-5's 81.3 percent on challenging benchmarks.
The system trained on one trillion tokens using up to 10,000 GPUs, learning from web scrapes, millions of PDFs, and over 60 million STEM problems. Three key technical advances power its capabilities: a new mathematical positioning system for long videos, deeper access to visual processing layers, and simple text timestamps replacing complex time-coding methods.
Unlike its commercial competitors, Alibaba released all model weights freely under Apache 2.0 licensing, potentially accelerating open-source AI development in multimodal applications.
SMRTR provides this summary for quick context. The original article belongs to Daily.dev.
Read the original article