API Reference
Convenience Functions
- langmarl.train(config_path, **overrides)
One-line training: load config, create components, run training.
- Parameters:
config_path (str) – Path to a JSON config file.
overrides – Key=value overrides applied to the loaded config.
- Returns:
Training metrics.
langmarl.train("configs/language_task/qa_central_credit.json") langmarl.train("configs/qa.json", num_iterations=10, paradigm="central_global")
- langmarl.load_config(path, overrides=None)
Load a configuration from a JSON file.
- langmarl.make_env(name, config)
Create an environment instance by registered name.
- Parameters:
name (str) – Environment name (e.g.,
"language","overcooked","pistonball").config (BaseConfig) – Configuration object.
- Returns:
Environment instance.
- Return type:
- Raises:
ValueError – If the environment name is not registered.
Configuration
BaseConfig
- class langmarl.BaseConfig
Shared configuration fields across all environments.
- Parameters:
exp_name (str) – Experiment name (default:
"experiment").paradigm (str) – Training paradigm. One of
"central_global","central_credit"(default:"central_credit").num_iterations (int) – Number of training iterations (default:
5).trajectories_per_iteration (int) – Episodes per iteration (default:
10).mini_batch_size (int or None) – If set, subsample this many trajectories for gradient computation (default:
None= use all).start_iteration (int) – Iteration to start from (default:
0).llm (LLMConfig or None) – Default LLM config, used as fallback for actor/critic/optimizer (default:
None).actor_llm (LLMConfig or None) – LLM config for actors. Falls back to
llmif not set.critic_llm (LLMConfig or None) – LLM config for the critic. Falls back to
llmif not set.optimizer_llm (LLMConfig or None) – LLM config for the optimizer. Falls back to
llmif not set.num_agents (int) – Number of agents (default:
2).experiment_dir (str) – Directory to save experiment data (default:
"./experiments").checkpoint_dir (str) – Directory to save policy checkpoints (default:
"./ckpt_policy").max_workers (int) – Maximum parallel workers for trajectory generation (default:
5).log_level (str) – Logging level (default:
"INFO").
- get_actor_llm()
Get LLM config for actors. Returns
actor_llmif set, otherwisellm.- Return type:
- get_critic_llm()
Get LLM config for the critic. Returns
critic_llmif set, otherwisellm.- Return type:
- get_optimizer_llm()
Get LLM config for the optimizer. Returns
optimizer_llmif set, otherwisellm.- Return type:
- classmethod from_json(path, overrides=None)
Load config from a JSON file with optional overrides. LLM fields can be specified as predefined model name strings or full config dicts.
LanguageTaskConfig
- class langmarl.LanguageTaskConfig
Configuration for language task environments. Inherits all fields from
BaseConfig.- Parameters:
task_type (str) – Task type. One of
"qa","math","writing","coding"(default:"qa").benchmark_path (str) – Path to the benchmark dataset directory.
data_limit (int or None) – Limit the number of tasks loaded from the benchmark (default:
None= all).use_verified_reward (bool) – Enable LLM-as-judge for reward verification (default:
False).episode_generation_workers (int) – Parallel workers for episode generation (default:
8).optimizer_workers (int) – Parallel workers for gradient generation (default:
1).
OvercookedConfig
- class langmarl.OvercookedConfig
Configuration for the Overcooked environment. Inherits all fields from
BaseConfig.
PistonballConfig
- class langmarl.PistonballConfig
Configuration for the Pistonball environment. Inherits all fields from
BaseConfig.
LLMConfig
- class langmarl.LLMConfig
Configuration for a single LLM model. All providers use the OpenAI-compatible API format.
- Parameters:
name (str) – Human-readable model identifier (e.g.,
"gpt-4o").model_string (str) – Model string passed to the API (e.g.,
"gpt-4o").base_url (str or None) – Custom API base URL.
Nonemeans default OpenAI endpoint.api_key (str or None) – API key.
Nonemeans use environment variable.api_key_env_var (str) – Environment variable name for the API key (default:
"OPENAI_API_KEY").is_multimodal (bool) – Whether the model supports multimodal input (default:
False).max_tokens (int) – Maximum tokens per response (default:
4096).input_price_per_million (float) – Input token price per million (USD).
output_price_per_million (float) – Output token price per million (USD).
extra_params (dict) – Additional provider-specific parameters.
- get_api_key()
Get API key from config or environment variable.
- Returns:
API key string.
- Raises:
ValueError – If no API key is found.
- classmethod from_preset(name)
Create an
LLMConfigfrom a predefined model name.- Parameters:
name (str) – Predefined model name (e.g.,
"gpt-4o-mini").- Returns:
LLMConfig instance.
- classmethod from_dict(data)
Create an
LLMConfigfrom a dictionary.
- to_dict()
Convert to a dictionary for JSON serialization.
- langmarl.get_llm_config(name_or_path)
Get LLM config by predefined name or load from a JSON file path.
- Parameters:
name_or_path (str) – Predefined model name or path to JSON config file.
- Returns:
LLMConfig instance.
- Raises:
ValueError – If the name is not recognized and the path does not exist.
Core Abstractions
Trajectory
- class langmarl.Trajectory
A single episode trajectory (dataclass).
BaseEnvironment
- class langmarl.BaseEnvironment
Abstract base class for environment adapters. Subclass this to add new environments.
- abstract reset(task)
Reset the environment with the given task.
- Parameters:
task (dict) – Task description dict.
- Returns:
Initial observations dict.
- abstract step(agent_id, action)
Execute an agent’s action.
- Parameters:
agent_id – Agent identifier string.
action – Action string.
- Returns:
Tuple of
(observation, reward, done, info).
BaseAgent
- class langmarl.BaseAgent
Abstract base class for agents with language policies.
BaseCritic
- class langmarl.BaseCritic
Abstract base class for trajectory evaluators.
BaseReward
- class langmarl.BaseReward
Abstract base class for reward computation.
- abstract compute(trajectory)
Compute reward for a trajectory.
- Parameters:
trajectory (Trajectory) – Episode trajectory.
- Returns:
Reward value.
- Return type:
BaseOptimizer
- class langmarl.BaseOptimizer
Abstract base class for language gradient optimizers.
- abstract generate_gradient(policy, evaluation, context)
Generate a language gradient (improvement instruction) for one agent.
- Parameters:
policy – Current policy text.
evaluation – Evaluation feedback text.
context – Task context text.
- Returns:
Gradient text.
- Return type:
- abstract apply_gradient(policy, gradient)
Apply a gradient to a policy, returning the updated policy.
- Parameters:
policy – Current policy text.
gradient – Gradient text.
- Returns:
Updated policy text.
- Return type:
Concrete Implementations
CentralizedCritic
- class langmarl.CentralizedCritic(config, prompts_dir=None)
Centralized critic supporting
central_globalandcentral_creditparadigms. Evaluates multi-agent sequential collaboration trajectories using LLM-as-judge.- Parameters:
config (BaseConfig) – Configuration with
paradigm,num_agents, and LLM fields.prompts_dir (Path or None) – Optional custom path to evaluation prompt templates.
- evaluate(trajectory, policies)
Evaluate a trajectory. For
central_credit, returns per-agent causal credits. Forcentral_global, returns the same evaluation for all agents.- Parameters:
trajectory (Trajectory) – Episode trajectory.
- Returns:
Dict with keys
raw_response,paradigm, andper_agent.- Return type:
PolicyGradientOptimizer
- class langmarl.PolicyGradientOptimizer(llm_config)
LLM-based policy gradient optimizer. Generates language gradients (concrete improvement instructions) and applies them to agent policies.
- Parameters:
llm_config (LLMConfig) – LLM configuration for the optimizer model.
- generate_gradient(policy, evaluation, context, agent_name='agent')
Generate a per-agent improvement instruction based on evaluation feedback.
- Parameters:
policy – Agent’s current policy text.
evaluation – Evaluation feedback for this agent.
context – Task context (truncated to 800 chars).
agent_name – Agent name for logging.
- Returns:
Gradient text (2-4 sentences of specific improvement advice).
- Return type:
Generate a shared gradient for all agents (used in
central_globalparadigm).- Parameters:
evaluation – Team-level evaluation feedback.
task_context – Task context (truncated to 800 chars).
- Returns:
Shared gradient text.
- Return type:
- static apply_gradient(base_policy, gradient)
Apply a gradient to a base policy. Appends the gradient as a
[CASE-SPECIFIC FEEDBACK]section. The base policy is never modified; the feedback section is replaced on every call.- Parameters:
base_policy – Original policy text.
gradient – Gradient text to append.
- Returns:
Updated policy text.
- Return type:
- static aggregate_gradients(gradients)
Aggregate multiple gradients (from multiple episodes) into one by joining with separator markers.
- static parse_credit_response(response, agent_names)
Parse per-agent evaluations from a credit-assignment LLM response. Handles JSON format (language tasks), bracket markers (Overcooked, Pistonball), and falls back to returning the full response for all agents.
TrajectoryFormatter
- class langmarl.TrajectoryFormatter
Formats episode trajectories for LLM evaluation.
- static format_trajectory(episode)
Format an episode into a text string for global evaluation.
- Parameters:
episode – Episode dict with
task,transitions,reward.- Returns:
Formatted trajectory string.
- Return type:
- static format_for_credit_assignment(episode)
Format an episode for per-agent credit assignment, with structured per-agent contribution sections.
- Parameters:
episode – Episode dict.
- Returns:
Formatted trajectory string.
- Return type:
Training
MonteCarloTrainer
- class langmarl.MonteCarloTrainer(config, env, critic, optimizer, reward_fn=None, store=None, callbacks=None)
Generic Monte Carlo trainer for any LangMARL environment. Implements a five-phase training iteration: load policies, generate trajectories, evaluate and generate gradients, aggregate and apply gradients, save checkpoint.
- Parameters:
config (BaseConfig) – Training configuration.
env (BaseEnvironment) – Environment instance.
critic (BaseCritic) – Critic instance.
optimizer (BaseOptimizer) – Optimizer instance.
reward_fn (BaseReward or None) – Optional reward function for verified rewards.
store (BaseStore or None) – Storage backend (default:
LocalStore).callbacks (list[Callback] or None) – List of training callbacks.
- train(num_iterations=None)
Main training loop. Automatically resumes from the latest checkpoint.
- Parameters:
num_iterations (int or None) – Number of iterations to train. Defaults to
config.num_iterations.- Returns:
Training metrics from the store.
- train_one_iteration(iteration)
Run a single training iteration (five phases).
Callbacks
- class langmarl.Callback
Abstract base class for training callbacks. Override any of the following methods:
- on_iteration_start(iteration, trainer)
- on_iteration_end(iteration, stats, trainer)
- on_episode_complete(trajectory, trainer)
- on_policy_update(agent_id, old_policy, new_policy)
- class langmarl.LoggingCallback
Callback that logs iteration start/end events via the trainer’s
RunLogger.
- class langmarl.CheckpointCallback
Callback for checkpoint management (saving is handled by the trainer by default).
LLM Client
LLMClient
- class langmarl.LLMClient(llm_config)
Unified LLM client wrapping OpenAI-compatible APIs.
- Parameters:
llm_config (LLMConfig) – LLM configuration.
- chat(system_prompt, user_input, max_tokens=None)
Send a chat completion request and return the response text.
- chat_with_usage(system_prompt, user_input, max_tokens=None)
Chat and return both response text and token usage.
- raw_client
Access the underlying
OpenAIclient instance.
TokenTracker
- class langmarl.TokenTracker(model='gpt-4o-mini', input_price=None, output_price=None)
Tracks token usage and estimates costs for LLM API calls. Has built-in pricing for common models.
- Parameters:
- add_usage(input_tokens, output_tokens)
Add token usage counts.
- Parameters:
input_tokens – Number of input tokens.
output_tokens – Number of output tokens.
- get_stats()
Get usage statistics and cost estimate.
- Returns:
Dict with
model,input_tokens,output_tokens,total_tokens,input_cost_usd,output_cost_usd,cost_usd.- Return type:
- estimate_cost(input_tokens, output_tokens)
Estimate cost for given token counts without adding to the tracker.
- Returns:
Estimated cost in USD.
- Return type:
- reset()
Reset all token counters to zero.
Storage
LocalStore
PolicyCheckpoint
- class langmarl.PolicyCheckpoint(store, run_id, num_agents)
Manages versioned policy snapshots. Stores one text file per agent per iteration.
- Parameters:
- get_policies(iteration=None)
Load policies from the latest (or a specific) iteration. Generates default policies if no checkpoint exists.
- save_policies(iteration, policies, stats=None)
Save policies as a new iteration checkpoint.
- Parameters:
iteration – Iteration number.
policies – Dict mapping agent name to policy text.
stats – Optional training statistics.
- diff_policies(iter_a, iter_b)
Compare policies across two iterations.
- Parameters:
iter_a – First iteration number.
iter_b – Second iteration number.
- Returns:
Dict mapping agent name to diff text.
TrajectoryStore
- class langmarl.TrajectoryStore(store, run_id)
Manages episode trajectory persistence.
- save(iteration, episode_id, trajectory)
Save a trajectory for a given iteration and episode.
- load(iteration, limit=None)
Load trajectories for a given iteration.
- Parameters:
iteration – Iteration number.
limit – Maximum number of trajectories to load.
- Returns:
List of Trajectory objects.
- Return type:
RunLogger
- class langmarl.RunLogger(store, run_id)
Structured training logger. Writes to both file and console.
- info(message)
- debug(message)
- warning(message)
- error(message)
- iteration_start(iteration, policies)
- iteration_end(iteration, stats)
- episode_saved(iteration, episode_id, reward)
- evaluation_done(iteration, episode_id, paradigm, response)
- gradient_saved(iteration, agent_name, num_gradients)