LangMARL Documentation

LangMARL is a language-space multi-agent reinforcement learning library that applies credit assignment and policy gradient optimization from classical MARL into natural language space. It enables principled autonomous optimization of multi-agent LLM-based systems via Centralized Training with Decentralized Execution (CTDE).

import langmarl

langmarl.train("configs/language_task/qa_central_credit.json")

Core Concepts

LangMARL treats natural language as a first-class optimization space:

  • Policies are Language – Each agent’s policy is a natural language instruction (system prompt), not a numeric parameter vector.

  • Credits are Language – A centralized critic assigns agent-level credit via trajectory-level language analysis, producing causal and interpretable feedback.

  • Optimization is Language Evolution – Policies are updated via language gradients (improvement instructions) instead of numeric gradients.

Architecture

LangMARL implements the CTDE paradigm with four components:

  1. LLM Actors – Each agent is an LLM whose behavior is governed by a natural language policy. During execution, each agent observes only its local information and acts independently.

  2. Centralized Critic – Used only during training. Evaluates full episode trajectories and generates per-agent credit using LLM-as-judge.

  3. Policy Gradient Optimizer – Converts credit signals into language gradients (concrete improvement instructions) and applies them to agent policies.

  4. Monte Carlo Trainer – Orchestrates the training loop: collect trajectories, evaluate, generate gradients, aggregate, and update policies.

Training Paradigms

Paradigm

Description

central_global

A shared critic evaluates overall team performance. All agents receive the same shared gradient signal.

central_credit

A shared critic evaluates each agent’s individual contribution to team success. Each agent receives a targeted per-agent gradient.

Supported Environments

Environment

Agents

Description

Language Tasks

2+

Sequential collaboration on QA (HotPotQA), Math, Creative Writing, and Coding (HumanEval).

Overcooked-AI

2

Cooperative cooking with sparse team rewards and role differentiation.

Pistonball

10–20

Large-scale cooperative control with partial observability.

Custom environments can be registered via the @langmarl.register_env decorator.

Note

This project is under active development.

Contents