Usage
=====

.. _installation:

Installation
------------

Install LangMARL:

.. code-block:: console

   $ pip install langmarl

For environment-specific dependencies:

.. code-block:: console

   $ pip install langmarl[pettingzoo]   # Pistonball / PettingZoo support
   $ pip install langmarl[all]          # All optional dependencies

Set up your LLM API key as an environment variable:

.. code-block:: console

   $ export OPENAI_API_KEY="your-api-key"

For other providers, set the corresponding key:

.. code-block:: console

   $ export GOOGLE_API_KEY="your-key"     # Google Gemini
   $ export TOGETHER_API_KEY="your-key"   # Together (Llama, Qwen)
   $ export DEEPSEEK_API_KEY="your-key"   # DeepSeek

Quick Start
-----------

One-line training
~~~~~~~~~~~~~~~~~

Train from a JSON config file with a single call:

.. code-block:: python

   import langmarl

   langmarl.train("configs/language_task/qa_central_credit.json")

Programmatic usage
~~~~~~~~~~~~~~~~~~

For full control over the training pipeline:

.. code-block:: python

   import langmarl

   # 1. Configure
   config = langmarl.LanguageTaskConfig(
       task_type="qa",
       paradigm="central_credit",
       llm=langmarl.LLMConfig.from_preset("gpt-4o-mini"),
       num_agents=2,
       num_iterations=5,
       trajectories_per_iteration=10,
   )

   # 2. Create components
   env = langmarl.make_env("language", config)
   critic = langmarl.CentralizedCritic(config)
   optimizer = langmarl.PolicyGradientOptimizer(config.get_optimizer_llm())

   # 3. Train
   trainer = langmarl.MonteCarloTrainer(
       config=config,
       env=env,
       critic=critic,
       optimizer=optimizer,
   )
   metrics = trainer.train()

Custom environment
~~~~~~~~~~~~~~~~~~

Register a custom environment by subclassing ``BaseEnvironment``:

.. code-block:: python

   import langmarl

   @langmarl.register_env("my_env")
   class MyEnv(langmarl.BaseEnvironment):
       def __init__(self, config):
           self.num_agents = config.num_agents
           self.llm_client = langmarl.LLMClient(config.get_actor_llm())

       def reset(self, task: dict) -> dict:
           return {"task": task}

       def step(self, agent_id: str, action: str):
           return {}, 0.0, False, {}

       def sample_tasks(self, num_samples: int) -> list[dict]:
           return [{"question": "What is 2+2?", "ground_truth": "4"}] * num_samples

       def collect_trajectory(self, policies, task) -> langmarl.Trajectory:
           steps = []
           for i, (agent, policy) in enumerate(policies.items()):
               response = self.llm_client.chat(policy, task["question"])
               steps.append({"agent_id": agent, "input": task["question"], "output": response})
           reward = 1.0 if task["ground_truth"] in steps[-1]["output"] else 0.0
           return langmarl.Trajectory(task=task, steps=steps, reward=reward)

   # Use the custom environment
   config = langmarl.BaseConfig(
       paradigm="central_credit",
       llm=langmarl.LLMConfig.from_preset("gpt-4o-mini"),
   )
   env = langmarl.make_env("my_env", config)

Using callbacks
~~~~~~~~~~~~~~~

Add callbacks to hook into the training loop:

.. code-block:: python

   trainer = langmarl.MonteCarloTrainer(
       config=config,
       env=env,
       critic=critic,
       optimizer=optimizer,
       callbacks=[
           langmarl.LoggingCallback(),
           langmarl.EarlyStoppingCallback(patience=3, min_delta=0.01),
       ],
   )
   trainer.train()

Using different LLMs per role
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Assign different models to actors, critics, and optimizers:

.. code-block:: python

   config = langmarl.LanguageTaskConfig(
       task_type="qa",
       paradigm="central_credit",
       actor_llm=langmarl.LLMConfig.from_preset("gpt-4o-mini"),    # cheap for actors
       critic_llm=langmarl.LLMConfig.from_preset("gpt-4o"),         # strong for critic
       optimizer_llm=langmarl.LLMConfig.from_preset("gpt-4o-mini"), # cheap for optimizer
       num_agents=2,
       num_iterations=5,
   )

If only ``llm`` is set, it is used as the fallback for all three roles.

Configuration
-------------

JSON config files
~~~~~~~~~~~~~~~~~

Training runs can be configured via JSON files:

.. code-block:: json

   {
       "exp_name": "qa_central_credit",
       "paradigm": "central_credit",
       "task_type": "qa",
       "benchmark_path": "env/lang_benchmark/HotPotQA",
       "num_agents": 2,
       "num_iterations": 5,
       "trajectories_per_iteration": 10,
       "llm": "gpt-4o-mini",
       "log_level": "INFO"
   }

The ``llm`` field accepts either a predefined model name (string) or a full LLM config object:

.. code-block:: json

   {
       "llm": {
           "name": "my-model",
           "model_string": "Qwen/Qwen2.5-72B-Instruct",
           "base_url": "https://api.together.xyz/v1",
           "api_key_env_var": "TOGETHER_API_KEY"
       }
   }

Supported Models
----------------

All providers use the OpenAI-compatible API format. Predefined models:

.. list-table::
   :header-rows: 1
   :widths: 20 35 20

   * - Name
     - Model String
     - Provider
   * - ``gpt-4o``
     - gpt-4o
     - OpenAI
   * - ``gpt-4o-mini``
     - gpt-4o-mini
     - OpenAI
   * - ``gpt-5``
     - gpt-5
     - OpenAI
   * - ``o1``
     - o1
     - OpenAI
   * - ``o1-mini``
     - o1-mini
     - OpenAI
   * - ``gemini-pro``
     - gemini-1.5-pro
     - Google
   * - ``gemini-flash``
     - gemini-1.5-flash
     - Google
   * - ``gemini-2.0-flash``
     - gemini-2.0-flash
     - Google
   * - ``llama-3.1-70b``
     - meta-llama/llama-3.1-70b-instruct
     - Together
   * - ``llama-3.1-8b``
     - meta-llama/llama-3.1-8b-instruct
     - Together
   * - ``llama-3.3-70b``
     - meta-llama/Llama-3.3-70B-Instruct-Turbo
     - Together
   * - ``qwen-72b``
     - Qwen/Qwen2.5-72B-Instruct-Turbo
     - Together
   * - ``qwen-7b``
     - Qwen/Qwen2.5-7B-Instruct-Turbo
     - Together
   * - ``qwen-coder-32b``
     - Qwen/Qwen2.5-Coder-32B-Instruct
     - Together
   * - ``deepseek-chat``
     - deepseek-chat
     - DeepSeek
   * - ``deepseek-reasoner``
     - deepseek-reasoner
     - DeepSeek
   * - ``ollama-llama3``
     - llama3
     - Local Ollama
   * - ``ollama-qwen2``
     - qwen2
     - Local Ollama

You can also create a custom ``LLMConfig`` for any OpenAI-compatible endpoint:

.. code-block:: python

   custom_llm = langmarl.LLMConfig(
       name="my-custom-model",
       model_string="my-model-id",
       base_url="http://localhost:8000/v1",
       api_key="my-key",
       max_tokens=4096,
       input_price_per_million=0.0,
       output_price_per_million=0.0,
   )

Output Structure
----------------

Training runs produce the following directory structure:

.. code-block:: text

   experiments/
   +-- {exp_name}_{timestamp}/
       +-- config.json
       +-- run.log
       +-- metrics.json
       +-- trajectories/
       |   +-- iter_0/
       |   |   +-- episode_0.json
       |   |   +-- episode_1.json
       |   +-- iter_1/
       +-- checkpoints/
       |   +-- iter_0/
       |   |   +-- agent_1.txt
       |   |   +-- agent_2.txt
       |   |   +-- metadata.json
       |   +-- iter_1/
       +-- gradients/
       |   +-- iter_0/
       |       +-- agent_1_gradients.json
       |       +-- agent_1_aggregated.txt
       +-- evaluations/
           +-- iter_0/
               +-- episode_0_eval.txt