LLM 'benchmark' as a 1v1 RTS game where models write code controlling the units
SMRTR summary
A new benchmark tests large language models by having them compete in real-time strategy games where AI systems write JavaScript code to control 9 units each in tactical combat scenarios. The models iteratively improve their strategies through a write-code-play-review cycle before competing in round-robin tournaments, with Gemini 3.1 Pro dominating by winning 46 out of 50 games.
SMRTR provides this summary for quick context. The original article belongs to lobste.rs.
Read the original article