CoLLAB: A Framework for Designing Scalable Benchmarks for Agentic LLMs

Saaduddin Mahmud, Eugene Bagdasarian, Shlomo Zilberstein
University of Massachusetts Amherst
smahmud@umass.edu · eugene@umass.edu · shlomo@umass.edu

GitHub Repo Paper (OpenReview)

🚀 DEMO CoLLAB

Abstract

Agents capable of making decisions from complex, unstructured instructions have seen a surge with the rise of large language models (LLMs). However, their ability to coordinate with other agents while following instructions is still an active area of research. To facilitate research in this area, we introduce a framework for designing scalable environments to evaluate coordination in agentic LLM networks, called Coordinating LLM Agents Benchmark (CoLLAB). CoLLAB adapts a widely used classical cooperative multi-agent problem-solving framework called Distributed Constraint Optimization Problems (DCOPs), and extends it with unstructured instructions and communication, making it directly relevant for studying coordination in agentic LLM networks. We provide a design blueprint for how CoLLAB environments can scale across multiple dimensions. Finally, we implement three case-study environments within this framework and evaluate a range of LLM-based agents on them. Performance is quantitatively compared to well-established symbolic solvers, enabling us to directly assess the quality of LLM-generated solutions relative to a proven baseline. In addition, we study how scaling environment complexity affects agent performance across these environments.

CoLLAB overview figure — Figure: CoLLAB couples natural‑language instructions and communication with a DCOP backbone for evaluating coordination in agentic LLM netowrk.