LangMARL Documentation ====================== **LangMARL** is a language-space multi-agent reinforcement learning library that applies credit assignment and policy gradient optimization from classical MARL into natural language space. It enables principled autonomous optimization of multi-agent LLM-based systems via **Centralized Training with Decentralized Execution (CTDE)**. .. code-block:: python import langmarl langmarl.train("configs/language_task/qa_central_credit.json") Core Concepts ------------- LangMARL treats **natural language as a first-class optimization space**: * **Policies are Language** -- Each agent's policy is a natural language instruction (system prompt), not a numeric parameter vector. * **Credits are Language** -- A centralized critic assigns agent-level credit via trajectory-level language analysis, producing causal and interpretable feedback. * **Optimization is Language Evolution** -- Policies are updated via language gradients (improvement instructions) instead of numeric gradients. Architecture ------------ LangMARL implements the CTDE paradigm with four components: 1. **LLM Actors** -- Each agent is an LLM whose behavior is governed by a natural language policy. During execution, each agent observes only its local information and acts independently. 2. **Centralized Critic** -- Used only during training. Evaluates full episode trajectories and generates per-agent credit using LLM-as-judge. 3. **Policy Gradient Optimizer** -- Converts credit signals into language gradients (concrete improvement instructions) and applies them to agent policies. 4. **Monte Carlo Trainer** -- Orchestrates the training loop: collect trajectories, evaluate, generate gradients, aggregate, and update policies. Training Paradigms ------------------ .. list-table:: :header-rows: 1 :widths: 25 75 * - Paradigm - Description * - ``central_global`` - A shared critic evaluates overall team performance. All agents receive the same shared gradient signal. * - ``central_credit`` - A shared critic evaluates each agent's individual contribution to team success. Each agent receives a targeted per-agent gradient. Supported Environments ---------------------- .. list-table:: :header-rows: 1 :widths: 25 20 55 * - Environment - Agents - Description * - Language Tasks - 2+ - Sequential collaboration on QA (HotPotQA), Math, Creative Writing, and Coding (HumanEval). * - Overcooked-AI - 2 - Cooperative cooking with sparse team rewards and role differentiation. * - Pistonball - 10--20 - Large-scale cooperative control with partial observability. Custom environments can be registered via the ``@langmarl.register_env`` decorator. .. note:: This project is under active development. Contents -------- .. toctree:: usage api