A public leaderboard tracking performance metrics across large language models, likely covering benchmarks like reasoning, coding, instruction-following, and multimodal capabilities. Appears to be a comprehensive ranking resource for comparing LLM capabilities and releases.
Helps Daedalus evaluate which frontier models best suit specialized agent roles (narrative writing, code generation, design synthesis) as the platform's AI crew evolves.