Usage

Installation

Install LangMARL:

$ pip install langmarl

For environment-specific dependencies:

$ pip install langmarl[pettingzoo]   # Pistonball / PettingZoo support
$ pip install langmarl[all]          # All optional dependencies

Set up your LLM API key as an environment variable:

$ export OPENAI_API_KEY="your-api-key"

For other providers, set the corresponding key:

$ export GOOGLE_API_KEY="your-key"     # Google Gemini
$ export TOGETHER_API_KEY="your-key"   # Together (Llama, Qwen)
$ export DEEPSEEK_API_KEY="your-key"   # DeepSeek

Quick Start

One-line training

Train from a JSON config file with a single call:

import langmarl

langmarl.train("configs/language_task/qa_central_credit.json")

Programmatic usage

For full control over the training pipeline:

import langmarl

# 1. Configure
config = langmarl.LanguageTaskConfig(
    task_type="qa",
    paradigm="central_credit",
    llm=langmarl.LLMConfig.from_preset("gpt-4o-mini"),
    num_agents=2,
    num_iterations=5,
    trajectories_per_iteration=10,
)

# 2. Create components
env = langmarl.make_env("language", config)
critic = langmarl.CentralizedCritic(config)
optimizer = langmarl.PolicyGradientOptimizer(config.get_optimizer_llm())

# 3. Train
trainer = langmarl.MonteCarloTrainer(
    config=config,
    env=env,
    critic=critic,
    optimizer=optimizer,
)
metrics = trainer.train()

Custom environment

Register a custom environment by subclassing BaseEnvironment:

import langmarl

@langmarl.register_env("my_env")
class MyEnv(langmarl.BaseEnvironment):
    def __init__(self, config):
        self.num_agents = config.num_agents
        self.llm_client = langmarl.LLMClient(config.get_actor_llm())

    def reset(self, task: dict) -> dict:
        return {"task": task}

    def step(self, agent_id: str, action: str):
        return {}, 0.0, False, {}

    def sample_tasks(self, num_samples: int) -> list[dict]:
        return [{"question": "What is 2+2?", "ground_truth": "4"}] * num_samples

    def collect_trajectory(self, policies, task) -> langmarl.Trajectory:
        steps = []
        for i, (agent, policy) in enumerate(policies.items()):
            response = self.llm_client.chat(policy, task["question"])
            steps.append({"agent_id": agent, "input": task["question"], "output": response})
        reward = 1.0 if task["ground_truth"] in steps[-1]["output"] else 0.0
        return langmarl.Trajectory(task=task, steps=steps, reward=reward)

# Use the custom environment
config = langmarl.BaseConfig(
    paradigm="central_credit",
    llm=langmarl.LLMConfig.from_preset("gpt-4o-mini"),
)
env = langmarl.make_env("my_env", config)

Using callbacks

Add callbacks to hook into the training loop:

trainer = langmarl.MonteCarloTrainer(
    config=config,
    env=env,
    critic=critic,
    optimizer=optimizer,
    callbacks=[
        langmarl.LoggingCallback(),
        langmarl.EarlyStoppingCallback(patience=3, min_delta=0.01),
    ],
)
trainer.train()

Using different LLMs per role

Assign different models to actors, critics, and optimizers:

config = langmarl.LanguageTaskConfig(
    task_type="qa",
    paradigm="central_credit",
    actor_llm=langmarl.LLMConfig.from_preset("gpt-4o-mini"),    # cheap for actors
    critic_llm=langmarl.LLMConfig.from_preset("gpt-4o"),         # strong for critic
    optimizer_llm=langmarl.LLMConfig.from_preset("gpt-4o-mini"), # cheap for optimizer
    num_agents=2,
    num_iterations=5,
)

If only llm is set, it is used as the fallback for all three roles.

Configuration

JSON config files

Training runs can be configured via JSON files:

{
    "exp_name": "qa_central_credit",
    "paradigm": "central_credit",
    "task_type": "qa",
    "benchmark_path": "env/lang_benchmark/HotPotQA",
    "num_agents": 2,
    "num_iterations": 5,
    "trajectories_per_iteration": 10,
    "llm": "gpt-4o-mini",
    "log_level": "INFO"
}

The llm field accepts either a predefined model name (string) or a full LLM config object:

{
    "llm": {
        "name": "my-model",
        "model_string": "Qwen/Qwen2.5-72B-Instruct",
        "base_url": "https://api.together.xyz/v1",
        "api_key_env_var": "TOGETHER_API_KEY"
    }
}

Supported Models

All providers use the OpenAI-compatible API format. Predefined models:

Name

Model String

Provider

gpt-4o

gpt-4o

OpenAI

gpt-4o-mini

gpt-4o-mini

OpenAI

gpt-5

gpt-5

OpenAI

o1

o1

OpenAI

o1-mini

o1-mini

OpenAI

gemini-pro

gemini-1.5-pro

Google

gemini-flash

gemini-1.5-flash

Google

gemini-2.0-flash

gemini-2.0-flash

Google

llama-3.1-70b

meta-llama/llama-3.1-70b-instruct

Together

llama-3.1-8b

meta-llama/llama-3.1-8b-instruct

Together

llama-3.3-70b

meta-llama/Llama-3.3-70B-Instruct-Turbo

Together

qwen-72b

Qwen/Qwen2.5-72B-Instruct-Turbo

Together

qwen-7b

Qwen/Qwen2.5-7B-Instruct-Turbo

Together

qwen-coder-32b

Qwen/Qwen2.5-Coder-32B-Instruct

Together

deepseek-chat

deepseek-chat

DeepSeek

deepseek-reasoner

deepseek-reasoner

DeepSeek

ollama-llama3

llama3

Local Ollama

ollama-qwen2

qwen2

Local Ollama

You can also create a custom LLMConfig for any OpenAI-compatible endpoint:

custom_llm = langmarl.LLMConfig(
    name="my-custom-model",
    model_string="my-model-id",
    base_url="http://localhost:8000/v1",
    api_key="my-key",
    max_tokens=4096,
    input_price_per_million=0.0,
    output_price_per_million=0.0,
)

Output Structure

Training runs produce the following directory structure:

experiments/
+-- {exp_name}_{timestamp}/
    +-- config.json
    +-- run.log
    +-- metrics.json
    +-- trajectories/
    |   +-- iter_0/
    |   |   +-- episode_0.json
    |   |   +-- episode_1.json
    |   +-- iter_1/
    +-- checkpoints/
    |   +-- iter_0/
    |   |   +-- agent_1.txt
    |   |   +-- agent_2.txt
    |   |   +-- metadata.json
    |   +-- iter_1/
    +-- gradients/
    |   +-- iter_0/
    |       +-- agent_1_gradients.json
    |       +-- agent_1_aggregated.txt
    +-- evaluations/
        +-- iter_0/
            +-- episode_0_eval.txt