API Reference

Convenience Functions

langmarl.train(config_path, **overrides)

One-line training: load config, create components, run training.

Parameters:
  • config_path (str) – Path to a JSON config file.

  • overrides – Key=value overrides applied to the loaded config.

Returns:

Training metrics.

langmarl.train("configs/language_task/qa_central_credit.json")
langmarl.train("configs/qa.json", num_iterations=10, paradigm="central_global")
langmarl.load_config(path, overrides=None)

Load a configuration from a JSON file.

Parameters:
  • path (str) – Path to JSON config file.

  • overrides (dict or None) – Optional dictionary of overrides.

Returns:

BaseConfig or subclass instance (LanguageTaskConfig, etc.).

langmarl.make_env(name, config)

Create an environment instance by registered name.

Parameters:
  • name (str) – Environment name (e.g., "language", "overcooked", "pistonball").

  • config (BaseConfig) – Configuration object.

Returns:

Environment instance.

Return type:

BaseEnvironment

Raises:

ValueError – If the environment name is not registered.

langmarl.register_env(name)

Decorator to register a custom environment class.

Parameters:

name (str) – Name to register the environment under.

@langmarl.register_env("my_env")
class MyEnv(langmarl.BaseEnvironment):
    ...
langmarl.list_envs()

List all registered environment names.

Returns:

List of environment name strings.

Return type:

list[str]

Configuration

BaseConfig

class langmarl.BaseConfig

Shared configuration fields across all environments.

Parameters:
  • exp_name (str) – Experiment name (default: "experiment").

  • paradigm (str) – Training paradigm. One of "central_global", "central_credit" (default: "central_credit").

  • num_iterations (int) – Number of training iterations (default: 5).

  • trajectories_per_iteration (int) – Episodes per iteration (default: 10).

  • mini_batch_size (int or None) – If set, subsample this many trajectories for gradient computation (default: None = use all).

  • start_iteration (int) – Iteration to start from (default: 0).

  • llm (LLMConfig or None) – Default LLM config, used as fallback for actor/critic/optimizer (default: None).

  • actor_llm (LLMConfig or None) – LLM config for actors. Falls back to llm if not set.

  • critic_llm (LLMConfig or None) – LLM config for the critic. Falls back to llm if not set.

  • optimizer_llm (LLMConfig or None) – LLM config for the optimizer. Falls back to llm if not set.

  • num_agents (int) – Number of agents (default: 2).

  • experiment_dir (str) – Directory to save experiment data (default: "./experiments").

  • checkpoint_dir (str) – Directory to save policy checkpoints (default: "./ckpt_policy").

  • max_workers (int) – Maximum parallel workers for trajectory generation (default: 5).

  • log_level (str) – Logging level (default: "INFO").

get_actor_llm()

Get LLM config for actors. Returns actor_llm if set, otherwise llm.

Return type:

LLMConfig

get_critic_llm()

Get LLM config for the critic. Returns critic_llm if set, otherwise llm.

Return type:

LLMConfig

get_optimizer_llm()

Get LLM config for the optimizer. Returns optimizer_llm if set, otherwise llm.

Return type:

LLMConfig

classmethod from_json(path, overrides=None)

Load config from a JSON file with optional overrides. LLM fields can be specified as predefined model name strings or full config dicts.

Parameters:
  • path (str) – Path to JSON config file.

  • overrides (dict or None) – Optional dictionary of overrides.

Returns:

Config instance.

to_json(path)

Save config to a JSON file.

Parameters:

path (str) – Output file path.

LanguageTaskConfig

class langmarl.LanguageTaskConfig

Configuration for language task environments. Inherits all fields from BaseConfig.

Parameters:
  • task_type (str) – Task type. One of "qa", "math", "writing", "coding" (default: "qa").

  • benchmark_path (str) – Path to the benchmark dataset directory.

  • data_limit (int or None) – Limit the number of tasks loaded from the benchmark (default: None = all).

  • use_verified_reward (bool) – Enable LLM-as-judge for reward verification (default: False).

  • episode_generation_workers (int) – Parallel workers for episode generation (default: 8).

  • optimizer_workers (int) – Parallel workers for gradient generation (default: 1).

OvercookedConfig

class langmarl.OvercookedConfig

Configuration for the Overcooked environment. Inherits all fields from BaseConfig.

Parameters:
  • layout (str) – Kitchen layout name (default: "cramped_room").

  • episode_horizon (int) – Maximum timesteps per episode (default: 400).

  • p0_agent (str) – Agent 0 type (default: "ProAgent").

  • p1_agent (str) – Agent 1 type (default: "ProAgent").

PistonballConfig

class langmarl.PistonballConfig

Configuration for the Pistonball environment. Inherits all fields from BaseConfig.

Parameters:
  • num_pistons (int) – Number of pistons (default: 20).

  • max_cycles (int) – Maximum cycles per episode (default: 125).

  • frame_size (int) – Observation frame size (default: 84).

  • action_mode (str) – "discrete" or "continuous" (default: "discrete").

LLMConfig

class langmarl.LLMConfig

Configuration for a single LLM model. All providers use the OpenAI-compatible API format.

Parameters:
  • name (str) – Human-readable model identifier (e.g., "gpt-4o").

  • model_string (str) – Model string passed to the API (e.g., "gpt-4o").

  • base_url (str or None) – Custom API base URL. None means default OpenAI endpoint.

  • api_key (str or None) – API key. None means use environment variable.

  • api_key_env_var (str) – Environment variable name for the API key (default: "OPENAI_API_KEY").

  • is_multimodal (bool) – Whether the model supports multimodal input (default: False).

  • max_tokens (int) – Maximum tokens per response (default: 4096).

  • input_price_per_million (float) – Input token price per million (USD).

  • output_price_per_million (float) – Output token price per million (USD).

  • extra_params (dict) – Additional provider-specific parameters.

get_api_key()

Get API key from config or environment variable.

Returns:

API key string.

Raises:

ValueError – If no API key is found.

classmethod from_preset(name)

Create an LLMConfig from a predefined model name.

Parameters:

name (str) – Predefined model name (e.g., "gpt-4o-mini").

Returns:

LLMConfig instance.

classmethod from_dict(data)

Create an LLMConfig from a dictionary.

to_dict()

Convert to a dictionary for JSON serialization.

langmarl.get_llm_config(name_or_path)

Get LLM config by predefined name or load from a JSON file path.

Parameters:

name_or_path (str) – Predefined model name or path to JSON config file.

Returns:

LLMConfig instance.

Raises:

ValueError – If the name is not recognized and the path does not exist.

langmarl.list_available_models()

List all available predefined models with their descriptions.

Returns:

Dict mapping model name to description string.

Return type:

dict[str, str]

Core Abstractions

Trajectory

class langmarl.Trajectory

A single episode trajectory (dataclass).

Parameters:
  • task (dict) – Task description dict (e.g., {"question": "...", "ground_truth": "..."}).

  • steps (list[dict]) – List of step dicts, each containing agent_id, observation, action, etc.

  • reward (float) – Episode reward.

  • metadata (dict) – Optional metadata dict (default: {}).

BaseEnvironment

class langmarl.BaseEnvironment

Abstract base class for environment adapters. Subclass this to add new environments.

abstract reset(task)

Reset the environment with the given task.

Parameters:

task (dict) – Task description dict.

Returns:

Initial observations dict.

abstract step(agent_id, action)

Execute an agent’s action.

Parameters:
  • agent_id – Agent identifier string.

  • action – Action string.

Returns:

Tuple of (observation, reward, done, info).

abstract collect_trajectory(policies, task)

Run a full episode with the given policies and return a trajectory.

Parameters:
  • policies (dict[str, str]) – Dict mapping agent name to policy text.

  • task (dict) – Task description dict.

Returns:

Episode trajectory.

Return type:

Trajectory

BaseAgent

class langmarl.BaseAgent

Abstract base class for agents with language policies.

abstract act(observation, policy)

Given an observation and a policy, produce an action.

Parameters:
  • observation (str) – Text observation.

  • policy (str) – Language policy (system prompt).

Returns:

Action text.

Return type:

str

BaseCritic

class langmarl.BaseCritic

Abstract base class for trajectory evaluators.

abstract evaluate(trajectory, policies)

Evaluate a trajectory and return per-agent credits.

Parameters:
  • trajectory (Trajectory) – Episode trajectory.

  • policies (dict[str, str]) – Current agent policies.

Returns:

Evaluation dict with raw_response and per_agent credits.

Return type:

dict

BaseReward

class langmarl.BaseReward

Abstract base class for reward computation.

abstract compute(trajectory)

Compute reward for a trajectory.

Parameters:

trajectory (Trajectory) – Episode trajectory.

Returns:

Reward value.

Return type:

float

BaseOptimizer

class langmarl.BaseOptimizer

Abstract base class for language gradient optimizers.

abstract generate_gradient(policy, evaluation, context)

Generate a language gradient (improvement instruction) for one agent.

Parameters:
  • policy – Current policy text.

  • evaluation – Evaluation feedback text.

  • context – Task context text.

Returns:

Gradient text.

Return type:

str

abstract apply_gradient(policy, gradient)

Apply a gradient to a policy, returning the updated policy.

Parameters:
  • policy – Current policy text.

  • gradient – Gradient text.

Returns:

Updated policy text.

Return type:

str

abstract aggregate_gradients(gradients)

Aggregate multiple gradients into a single gradient.

Parameters:

gradients (list[str]) – List of gradient texts.

Returns:

Aggregated gradient text.

Return type:

str

Concrete Implementations

CentralizedCritic

class langmarl.CentralizedCritic(config, prompts_dir=None)

Centralized critic supporting central_global and central_credit paradigms. Evaluates multi-agent sequential collaboration trajectories using LLM-as-judge.

Parameters:
  • config (BaseConfig) – Configuration with paradigm, num_agents, and LLM fields.

  • prompts_dir (Path or None) – Optional custom path to evaluation prompt templates.

evaluate(trajectory, policies)

Evaluate a trajectory. For central_credit, returns per-agent causal credits. For central_global, returns the same evaluation for all agents.

Parameters:
  • trajectory (Trajectory) – Episode trajectory.

  • policies (dict[str, str]) – Current agent policies.

Returns:

Dict with keys raw_response, paradigm, and per_agent.

Return type:

dict

PolicyGradientOptimizer

class langmarl.PolicyGradientOptimizer(llm_config)

LLM-based policy gradient optimizer. Generates language gradients (concrete improvement instructions) and applies them to agent policies.

Parameters:

llm_config (LLMConfig) – LLM configuration for the optimizer model.

generate_gradient(policy, evaluation, context, agent_name='agent')

Generate a per-agent improvement instruction based on evaluation feedback.

Parameters:
  • policy – Agent’s current policy text.

  • evaluation – Evaluation feedback for this agent.

  • context – Task context (truncated to 800 chars).

  • agent_name – Agent name for logging.

Returns:

Gradient text (2-4 sentences of specific improvement advice).

Return type:

str

generate_shared_gradient(evaluation, task_context)

Generate a shared gradient for all agents (used in central_global paradigm).

Parameters:
  • evaluation – Team-level evaluation feedback.

  • task_context – Task context (truncated to 800 chars).

Returns:

Shared gradient text.

Return type:

str

static apply_gradient(base_policy, gradient)

Apply a gradient to a base policy. Appends the gradient as a [CASE-SPECIFIC FEEDBACK] section. The base policy is never modified; the feedback section is replaced on every call.

Parameters:
  • base_policy – Original policy text.

  • gradient – Gradient text to append.

Returns:

Updated policy text.

Return type:

str

static aggregate_gradients(gradients)

Aggregate multiple gradients (from multiple episodes) into one by joining with separator markers.

Parameters:

gradients (list[str]) – List of gradient texts.

Returns:

Aggregated gradient text.

Return type:

str

static parse_credit_response(response, agent_names)

Parse per-agent evaluations from a credit-assignment LLM response. Handles JSON format (language tasks), bracket markers (Overcooked, Pistonball), and falls back to returning the full response for all agents.

Parameters:
  • response (str) – Raw LLM evaluation response.

  • agent_names (list[str]) – List of agent name strings.

Returns:

Dict mapping agent name to evaluation text.

Return type:

dict[str, str]

TrajectoryFormatter

class langmarl.TrajectoryFormatter

Formats episode trajectories for LLM evaluation.

static format_trajectory(episode)

Format an episode into a text string for global evaluation.

Parameters:

episode – Episode dict with task, transitions, reward.

Returns:

Formatted trajectory string.

Return type:

str

static format_for_credit_assignment(episode)

Format an episode for per-agent credit assignment, with structured per-agent contribution sections.

Parameters:

episode – Episode dict.

Returns:

Formatted trajectory string.

Return type:

str

static format_trajectory_minimal(episode)

Format a concise episode summary.

Parameters:

episode – Episode dict.

Returns:

Minimal trajectory string.

Return type:

str

Training

MonteCarloTrainer

class langmarl.MonteCarloTrainer(config, env, critic, optimizer, reward_fn=None, store=None, callbacks=None)

Generic Monte Carlo trainer for any LangMARL environment. Implements a five-phase training iteration: load policies, generate trajectories, evaluate and generate gradients, aggregate and apply gradients, save checkpoint.

Parameters:
  • config (BaseConfig) – Training configuration.

  • env (BaseEnvironment) – Environment instance.

  • critic (BaseCritic) – Critic instance.

  • optimizer (BaseOptimizer) – Optimizer instance.

  • reward_fn (BaseReward or None) – Optional reward function for verified rewards.

  • store (BaseStore or None) – Storage backend (default: LocalStore).

  • callbacks (list[Callback] or None) – List of training callbacks.

train(num_iterations=None)

Main training loop. Automatically resumes from the latest checkpoint.

Parameters:

num_iterations (int or None) – Number of iterations to train. Defaults to config.num_iterations.

Returns:

Training metrics from the store.

train_one_iteration(iteration)

Run a single training iteration (five phases).

Parameters:

iteration (int) – Iteration number.

Returns:

Statistics dict with keys: paradigm, num_episodes, avg_reward, min_reward, max_reward, rewards, input_tokens, output_tokens, total_tokens, cost_usd.

Return type:

dict

Callbacks

class langmarl.Callback

Abstract base class for training callbacks. Override any of the following methods:

on_iteration_start(iteration, trainer)
on_iteration_end(iteration, stats, trainer)
on_episode_complete(trajectory, trainer)
on_policy_update(agent_id, old_policy, new_policy)
class langmarl.LoggingCallback

Callback that logs iteration start/end events via the trainer’s RunLogger.

class langmarl.CheckpointCallback

Callback for checkpoint management (saving is handled by the trainer by default).

class langmarl.EarlyStoppingCallback(patience=3, min_delta=0.01)

Stop training when the average reward stops improving.

Parameters:
  • patience (int) – Number of iterations to wait without improvement before stopping.

  • min_delta (float) – Minimum improvement to qualify as progress.

LLM Client

LLMClient

class langmarl.LLMClient(llm_config)

Unified LLM client wrapping OpenAI-compatible APIs.

Parameters:

llm_config (LLMConfig) – LLM configuration.

chat(system_prompt, user_input, max_tokens=None)

Send a chat completion request and return the response text.

Parameters:
  • system_prompt (str) – System message.

  • user_input (str) – User message.

  • max_tokens (int or None) – Maximum tokens (default: from config).

Returns:

Response text.

Return type:

str

chat_with_usage(system_prompt, user_input, max_tokens=None)

Chat and return both response text and token usage.

Returns:

Tuple of (response_text, {"input": N, "output": N}).

Return type:

tuple[str, dict[str, int]]

raw_client

Access the underlying OpenAI client instance.

TokenTracker

class langmarl.TokenTracker(model='gpt-4o-mini', input_price=None, output_price=None)

Tracks token usage and estimates costs for LLM API calls. Has built-in pricing for common models.

Parameters:
  • model (str) – Model name for pricing lookup.

  • input_price (float or None) – Custom input price per million tokens (overrides lookup).

  • output_price (float or None) – Custom output price per million tokens (overrides lookup).

add_usage(input_tokens, output_tokens)

Add token usage counts.

Parameters:
  • input_tokens – Number of input tokens.

  • output_tokens – Number of output tokens.

get_stats()

Get usage statistics and cost estimate.

Returns:

Dict with model, input_tokens, output_tokens, total_tokens, input_cost_usd, output_cost_usd, cost_usd.

Return type:

dict

estimate_cost(input_tokens, output_tokens)

Estimate cost for given token counts without adding to the tracker.

Returns:

Estimated cost in USD.

Return type:

float

reset()

Reset all token counters to zero.

get_summary_string()

Get a human-readable summary of token usage and costs.

Returns:

Formatted summary string.

Return type:

str

Storage

LocalStore

class langmarl.LocalStore(base_dir)

Filesystem-backed storage backend. Manages training run data including trajectories, checkpoints, evaluations, gradients, metrics, and logs.

Parameters:

base_dir (str) – Base directory for all storage.

PolicyCheckpoint

class langmarl.PolicyCheckpoint(store, run_id, num_agents)

Manages versioned policy snapshots. Stores one text file per agent per iteration.

Parameters:
  • store (BaseStore) – Storage backend.

  • run_id (str) – Training run identifier.

  • num_agents (int) – Number of agents.

get_policies(iteration=None)

Load policies from the latest (or a specific) iteration. Generates default policies if no checkpoint exists.

Parameters:

iteration (int or None) – Specific iteration to load. None loads the latest.

Returns:

Dict mapping agent name to policy text.

Return type:

dict[str, str]

save_policies(iteration, policies, stats=None)

Save policies as a new iteration checkpoint.

Parameters:
  • iteration – Iteration number.

  • policies – Dict mapping agent name to policy text.

  • stats – Optional training statistics.

diff_policies(iter_a, iter_b)

Compare policies across two iterations.

Parameters:
  • iter_a – First iteration number.

  • iter_b – Second iteration number.

Returns:

Dict mapping agent name to diff text.

TrajectoryStore

class langmarl.TrajectoryStore(store, run_id)

Manages episode trajectory persistence.

save(iteration, episode_id, trajectory)

Save a trajectory for a given iteration and episode.

load(iteration, limit=None)

Load trajectories for a given iteration.

Parameters:
  • iteration – Iteration number.

  • limit – Maximum number of trajectories to load.

Returns:

List of Trajectory objects.

Return type:

list[Trajectory]

count(iteration)

Count stored trajectories for a given iteration.

Returns:

Number of stored trajectories.

Return type:

int

RunLogger

class langmarl.RunLogger(store, run_id)

Structured training logger. Writes to both file and console.

info(message)
debug(message)
warning(message)
error(message)
iteration_start(iteration, policies)
iteration_end(iteration, stats)
episode_saved(iteration, episode_id, reward)
evaluation_done(iteration, episode_id, paradigm, response)
gradient_saved(iteration, agent_name, num_gradients)