API Reference

Convenience Functions

langmarl.train(config_path, **overrides)

One-line training: load config, create components, run training.

Parameters:

config_path (str) – Path to a JSON config file.
overrides – Key=value overrides applied to the loaded config.

Returns:

Training metrics.

langmarl.train("configs/language_task/qa_central_credit.json")
langmarl.train("configs/qa.json", num_iterations=10, paradigm="central_global")

langmarl.load_config(path, overrides=None)

Load a configuration from a JSON file.

Parameters:

path (str) – Path to JSON config file.
overrides (dict or None) – Optional dictionary of overrides.

Returns:

BaseConfig or subclass instance (LanguageTaskConfig, etc.).

langmarl.make_env(name, config)

Create an environment instance by registered name.

Parameters:

name (str) – Environment name (e.g., "language", "overcooked", "pistonball").
config (BaseConfig) – Configuration object.

Returns:

Environment instance.

Return type:

BaseEnvironment

Raises:

ValueError – If the environment name is not registered.

langmarl.register_env(name)

Decorator to register a custom environment class.

Parameters:: name (str) – Name to register the environment under.

@langmarl.register_env("my_env")
class MyEnv(langmarl.BaseEnvironment):
    ...

langmarl.list_envs()

List all registered environment names.

Returns:: List of environment name strings.
Return type:: list[str]

Configuration

BaseConfig

class langmarl.BaseConfig

Shared configuration fields across all environments.

Parameters:

exp_name (str) – Experiment name (default: "experiment").
paradigm (str) – Training paradigm. One of "central_global", "central_credit" (default: "central_credit").
num_iterations (int) – Number of training iterations (default: 5).
trajectories_per_iteration (int) – Episodes per iteration (default: 10).
mini_batch_size (int or None) – If set, subsample this many trajectories for gradient computation (default: None = use all).
start_iteration (int) – Iteration to start from (default: 0).
llm (LLMConfig or None) – Default LLM config, used as fallback for actor/critic/optimizer (default: None).
actor_llm (LLMConfig or None) – LLM config for actors. Falls back to llm if not set.
critic_llm (LLMConfig or None) – LLM config for the critic. Falls back to llm if not set.
optimizer_llm (LLMConfig or None) – LLM config for the optimizer. Falls back to llm if not set.
num_agents (int) – Number of agents (default: 2).
experiment_dir (str) – Directory to save experiment data (default: "./experiments").
checkpoint_dir (str) – Directory to save policy checkpoints (default: "./ckpt_policy").
max_workers (int) – Maximum parallel workers for trajectory generation (default: 5).
log_level (str) – Logging level (default: "INFO").

get_actor_llm()

Get LLM config for actors. Returns actor_llm if set, otherwise llm.

Return type:: LLMConfig

get_critic_llm()

Get LLM config for the critic. Returns critic_llm if set, otherwise llm.

Return type:: LLMConfig

get_optimizer_llm()

Get LLM config for the optimizer. Returns optimizer_llm if set, otherwise llm.

Return type:: LLMConfig

classmethod from_json(path, overrides=None)

Load config from a JSON file with optional overrides. LLM fields can be specified as predefined model name strings or full config dicts.

Parameters:

path (str) – Path to JSON config file.
overrides (dict or None) – Optional dictionary of overrides.

Returns:

Config instance.

to_json(path)

Save config to a JSON file.

Parameters:: path (str) – Output file path.

LanguageTaskConfig

class langmarl.LanguageTaskConfig

Configuration for language task environments. Inherits all fields from BaseConfig.

Parameters:

task_type (str) – Task type. One of "qa", "math", "writing", "coding" (default: "qa").
benchmark_path (str) – Path to the benchmark dataset directory.
data_limit (int or None) – Limit the number of tasks loaded from the benchmark (default: None = all).
use_verified_reward (bool) – Enable LLM-as-judge for reward verification (default: False).
episode_generation_workers (int) – Parallel workers for episode generation (default: 8).
optimizer_workers (int) – Parallel workers for gradient generation (default: 1).

OvercookedConfig

class langmarl.OvercookedConfig

Configuration for the Overcooked environment. Inherits all fields from BaseConfig.

Parameters:

layout (str) – Kitchen layout name (default: "cramped_room").
episode_horizon (int) – Maximum timesteps per episode (default: 400).
p0_agent (str) – Agent 0 type (default: "ProAgent").
p1_agent (str) – Agent 1 type (default: "ProAgent").

PistonballConfig

class langmarl.PistonballConfig

Configuration for the Pistonball environment. Inherits all fields from BaseConfig.

Parameters:

num_pistons (int) – Number of pistons (default: 20).
max_cycles (int) – Maximum cycles per episode (default: 125).
frame_size (int) – Observation frame size (default: 84).
action_mode (str) – "discrete" or "continuous" (default: "discrete").

LLMConfig

class langmarl.LLMConfig

Configuration for a single LLM model. All providers use the OpenAI-compatible API format.

Parameters:

name (str) – Human-readable model identifier (e.g., "gpt-4o").
model_string (str) – Model string passed to the API (e.g., "gpt-4o").
base_url (str or None) – Custom API base URL. None means default OpenAI endpoint.
api_key (str or None) – API key. None means use environment variable.
api_key_env_var (str) – Environment variable name for the API key (default: "OPENAI_API_KEY").
is_multimodal (bool) – Whether the model supports multimodal input (default: False).
max_tokens (int) – Maximum tokens per response (default: 4096).
input_price_per_million (float) – Input token price per million (USD).
output_price_per_million (float) – Output token price per million (USD).
extra_params (dict) – Additional provider-specific parameters.

get_api_key()

Get API key from config or environment variable.

Returns:: API key string.
Raises:: ValueError – If no API key is found.

classmethod from_preset(name)

Create an LLMConfig from a predefined model name.

Parameters:: name (str) – Predefined model name (e.g., "gpt-4o-mini").
Returns:: LLMConfig instance.

classmethod from_dict(data): Create an LLMConfig from a dictionary.

to_dict(): Convert to a dictionary for JSON serialization.

langmarl.get_llm_config(name_or_path)

Get LLM config by predefined name or load from a JSON file path.

Parameters:: name_or_path (str) – Predefined model name or path to JSON config file.
Returns:: LLMConfig instance.
Raises:: ValueError – If the name is not recognized and the path does not exist.

langmarl.list_available_models()

List all available predefined models with their descriptions.

Returns:: Dict mapping model name to description string.
Return type:: dict[str, str]

Core Abstractions

Trajectory

class langmarl.Trajectory

A single episode trajectory (dataclass).

Parameters:

task (dict) – Task description dict (e.g., {"question": "...", "ground_truth": "..."}).
steps (list[dict]) – List of step dicts, each containing agent_id, observation, action, etc.
reward (float) – Episode reward.
metadata (dict) – Optional metadata dict (default: {}).

BaseEnvironment

class langmarl.BaseEnvironment

Abstract base class for environment adapters. Subclass this to add new environments.

abstract reset(task)

Reset the environment with the given task.

Parameters:: task (dict) – Task description dict.
Returns:: Initial observations dict.

abstract step(agent_id, action)

Execute an agent’s action.

Parameters:

agent_id – Agent identifier string.
action – Action string.

Returns:

Tuple of (observation, reward, done, info).

abstract collect_trajectory(policies, task)

Run a full episode with the given policies and return a trajectory.

Parameters:

policies (dict[str, str]) – Dict mapping agent name to policy text.
task (dict) – Task description dict.

Returns:

Episode trajectory.

Return type:

Trajectory

BaseAgent

class langmarl.BaseAgent

Abstract base class for agents with language policies.

abstract act(observation, policy)

Given an observation and a policy, produce an action.

Parameters:

observation (str) – Text observation.
policy (str) – Language policy (system prompt).

Returns:

Action text.

Return type:

str

BaseCritic

class langmarl.BaseCritic

Abstract base class for trajectory evaluators.

abstract evaluate(trajectory, policies)

Evaluate a trajectory and return per-agent credits.

Parameters:

trajectory (Trajectory) – Episode trajectory.
policies (dict[str, str]) – Current agent policies.

Returns:

Evaluation dict with raw_response and per_agent credits.

Return type:

dict

BaseReward

class langmarl.BaseReward

Abstract base class for reward computation.

abstract compute(trajectory)

Compute reward for a trajectory.

Parameters:: trajectory (Trajectory) – Episode trajectory.
Returns:: Reward value.
Return type:: float

BaseOptimizer

class langmarl.BaseOptimizer

Abstract base class for language gradient optimizers.

abstract generate_gradient(policy, evaluation, context)

Generate a language gradient (improvement instruction) for one agent.

Parameters:

policy – Current policy text.
evaluation – Evaluation feedback text.
context – Task context text.

Returns:

Gradient text.

Return type:

str

abstract apply_gradient(policy, gradient)

Apply a gradient to a policy, returning the updated policy.

Parameters:

policy – Current policy text.
gradient – Gradient text.

Returns:

Updated policy text.

Return type:

str

abstract aggregate_gradients(gradients)

Aggregate multiple gradients into a single gradient.

Parameters:: gradients (list[str]) – List of gradient texts.
Returns:: Aggregated gradient text.
Return type:: str

Concrete Implementations

CentralizedCritic

class langmarl.CentralizedCritic(config, prompts_dir=None)

Centralized critic supporting central_global and central_credit paradigms. Evaluates multi-agent sequential collaboration trajectories using LLM-as-judge.

Parameters:

config (BaseConfig) – Configuration with paradigm, num_agents, and LLM fields.
prompts_dir (Path or None) – Optional custom path to evaluation prompt templates.

evaluate(trajectory, policies)

Evaluate a trajectory. For central_credit, returns per-agent causal credits. For central_global, returns the same evaluation for all agents.

Parameters:

trajectory (Trajectory) – Episode trajectory.
policies (dict[str, str]) – Current agent policies.

Returns:

Dict with keys raw_response, paradigm, and per_agent.

Return type:

dict

PolicyGradientOptimizer

class langmarl.PolicyGradientOptimizer(llm_config)

LLM-based policy gradient optimizer. Generates language gradients (concrete improvement instructions) and applies them to agent policies.

Parameters:: llm_config (LLMConfig) – LLM configuration for the optimizer model.

generate_gradient(policy, evaluation, context, agent_name='agent')

Generate a per-agent improvement instruction based on evaluation feedback.

Parameters:

policy – Agent’s current policy text.
evaluation – Evaluation feedback for this agent.
context – Task context (truncated to 800 chars).
agent_name – Agent name for logging.

Returns:

Gradient text (2-4 sentences of specific improvement advice).

Return type:

str

generate_shared_gradient(evaluation, task_context)

Generate a shared gradient for all agents (used in central_global paradigm).

Parameters:

evaluation – Team-level evaluation feedback.
task_context – Task context (truncated to 800 chars).

Returns:

Shared gradient text.

Return type:

str

static apply_gradient(base_policy, gradient)

Apply a gradient to a base policy. Appends the gradient as a [CASE-SPECIFIC FEEDBACK] section. The base policy is never modified; the feedback section is replaced on every call.

Parameters:

base_policy – Original policy text.
gradient – Gradient text to append.

Returns:

Updated policy text.

Return type:

str

static aggregate_gradients(gradients)

Aggregate multiple gradients (from multiple episodes) into one by joining with separator markers.

Parameters:: gradients (list[str]) – List of gradient texts.
Returns:: Aggregated gradient text.
Return type:: str

static parse_credit_response(response, agent_names)

Parse per-agent evaluations from a credit-assignment LLM response. Handles JSON format (language tasks), bracket markers (Overcooked, Pistonball), and falls back to returning the full response for all agents.

Parameters:

response (str) – Raw LLM evaluation response.
agent_names (list[str]) – List of agent name strings.

Returns:

Dict mapping agent name to evaluation text.

Return type:

dict[str, str]

TrajectoryFormatter

class langmarl.TrajectoryFormatter

Formats episode trajectories for LLM evaluation.

static format_trajectory(episode)

Format an episode into a text string for global evaluation.

Parameters:: episode – Episode dict with task, transitions, reward.
Returns:: Formatted trajectory string.
Return type:: str

static format_for_credit_assignment(episode)

Format an episode for per-agent credit assignment, with structured per-agent contribution sections.

Parameters:: episode – Episode dict.
Returns:: Formatted trajectory string.
Return type:: str

static format_trajectory_minimal(episode)

Format a concise episode summary.

Parameters:: episode – Episode dict.
Returns:: Minimal trajectory string.
Return type:: str

Training

MonteCarloTrainer

class langmarl.MonteCarloTrainer(config, env, critic, optimizer, reward_fn=None, store=None, callbacks=None)

Generic Monte Carlo trainer for any LangMARL environment. Implements a five-phase training iteration: load policies, generate trajectories, evaluate and generate gradients, aggregate and apply gradients, save checkpoint.

Parameters:

config (BaseConfig) – Training configuration.
env (BaseEnvironment) – Environment instance.
critic (BaseCritic) – Critic instance.
optimizer (BaseOptimizer) – Optimizer instance.
reward_fn (BaseReward or None) – Optional reward function for verified rewards.
store (BaseStore or None) – Storage backend (default: LocalStore).
callbacks (list[Callback] or None) – List of training callbacks.

train(num_iterations=None)

Main training loop. Automatically resumes from the latest checkpoint.

Parameters:: num_iterations (int or None) – Number of iterations to train. Defaults to config.num_iterations.
Returns:: Training metrics from the store.

train_one_iteration(iteration)

Run a single training iteration (five phases).

Parameters:: iteration (int) – Iteration number.
Returns:: Statistics dict with keys: paradigm, num_episodes, avg_reward, min_reward, max_reward, rewards, input_tokens, output_tokens, total_tokens, cost_usd.
Return type:: dict

Callbacks

class langmarl.Callback

Abstract base class for training callbacks. Override any of the following methods:

on_iteration_start(iteration, trainer)

on_iteration_end(iteration, stats, trainer)

on_episode_complete(trajectory, trainer)

on_policy_update(agent_id, old_policy, new_policy)

class langmarl.LoggingCallback: Callback that logs iteration start/end events via the trainer’s RunLogger.

class langmarl.CheckpointCallback: Callback for checkpoint management (saving is handled by the trainer by default).

class langmarl.EarlyStoppingCallback(patience=3, min_delta=0.01)

Stop training when the average reward stops improving.

Parameters:

patience (int) – Number of iterations to wait without improvement before stopping.
min_delta (float) – Minimum improvement to qualify as progress.

LLM Client

LLMClient

class langmarl.LLMClient(llm_config)

Unified LLM client wrapping OpenAI-compatible APIs.

Parameters:: llm_config (LLMConfig) – LLM configuration.

chat(system_prompt, user_input, max_tokens=None)

Send a chat completion request and return the response text.

Parameters:

system_prompt (str) – System message.
user_input (str) – User message.
max_tokens (int or None) – Maximum tokens (default: from config).

Returns:

Response text.

Return type:

str

chat_with_usage(system_prompt, user_input, max_tokens=None)

Chat and return both response text and token usage.

Returns:: Tuple of (response_text, {"input": N, "output": N}).
Return type:: tuple[str, dict[str, int]]

raw_client: Access the underlying OpenAI client instance.

TokenTracker

class langmarl.TokenTracker(model='gpt-4o-mini', input_price=None, output_price=None)

Tracks token usage and estimates costs for LLM API calls. Has built-in pricing for common models.

Parameters:

model (str) – Model name for pricing lookup.
input_price (float or None) – Custom input price per million tokens (overrides lookup).
output_price (float or None) – Custom output price per million tokens (overrides lookup).

add_usage(input_tokens, output_tokens)

Add token usage counts.

Parameters:

input_tokens – Number of input tokens.
output_tokens – Number of output tokens.

get_stats()

Get usage statistics and cost estimate.

Returns:: Dict with model, input_tokens, output_tokens, total_tokens, input_cost_usd, output_cost_usd, cost_usd.
Return type:: dict

estimate_cost(input_tokens, output_tokens)

Estimate cost for given token counts without adding to the tracker.

Returns:: Estimated cost in USD.
Return type:: float

reset(): Reset all token counters to zero.

get_summary_string()

Get a human-readable summary of token usage and costs.

Returns:: Formatted summary string.
Return type:: str

Storage

LocalStore

class langmarl.LocalStore(base_dir)

Filesystem-backed storage backend. Manages training run data including trajectories, checkpoints, evaluations, gradients, metrics, and logs.

Parameters:: base_dir (str) – Base directory for all storage.

PolicyCheckpoint

class langmarl.PolicyCheckpoint(store, run_id, num_agents)

Manages versioned policy snapshots. Stores one text file per agent per iteration.

Parameters:

store (BaseStore) – Storage backend.
run_id (str) – Training run identifier.
num_agents (int) – Number of agents.

get_policies(iteration=None)

Load policies from the latest (or a specific) iteration. Generates default policies if no checkpoint exists.

Parameters:: iteration (int or None) – Specific iteration to load. None loads the latest.
Returns:: Dict mapping agent name to policy text.
Return type:: dict[str, str]

save_policies(iteration, policies, stats=None)

Save policies as a new iteration checkpoint.

Parameters:

iteration – Iteration number.
policies – Dict mapping agent name to policy text.
stats – Optional training statistics.

diff_policies(iter_a, iter_b)

Compare policies across two iterations.

Parameters:

iter_a – First iteration number.
iter_b – Second iteration number.

Returns:

Dict mapping agent name to diff text.

TrajectoryStore

class langmarl.TrajectoryStore(store, run_id)

Manages episode trajectory persistence.

save(iteration, episode_id, trajectory): Save a trajectory for a given iteration and episode.

load(iteration, limit=None)

Load trajectories for a given iteration.

Parameters:

iteration – Iteration number.
limit – Maximum number of trajectories to load.

Returns:

List of Trajectory objects.

Return type:

list[Trajectory]

count(iteration)

Count stored trajectories for a given iteration.

Returns:: Number of stored trajectories.
Return type:: int

RunLogger

class langmarl.RunLogger(store, run_id)

Structured training logger. Writes to both file and console.

info(message)

debug(message)

warning(message)

error(message)

iteration_start(iteration, policies)

iteration_end(iteration, stats)

episode_saved(iteration, episode_id, reward)

evaluation_done(iteration, episode_id, paradigm, response)

gradient_saved(iteration, agent_name, num_gradients)