API Reference
=============

.. module:: langmarl

Convenience Functions
---------------------

.. function:: train(config_path, **overrides)

   One-line training: load config, create components, run training.

   :param config_path: Path to a JSON config file.
   :type config_path: str
   :param overrides: Key=value overrides applied to the loaded config.
   :returns: Training metrics.

   .. code-block:: python

      langmarl.train("configs/language_task/qa_central_credit.json")
      langmarl.train("configs/qa.json", num_iterations=10, paradigm="central_global")

.. function:: load_config(path, overrides=None)

   Load a configuration from a JSON file.

   :param path: Path to JSON config file.
   :type path: str
   :param overrides: Optional dictionary of overrides.
   :type overrides: dict or None
   :returns: ``BaseConfig`` or subclass instance (``LanguageTaskConfig``, etc.).

.. function:: make_env(name, config)

   Create an environment instance by registered name.

   :param name: Environment name (e.g., ``"language"``, ``"overcooked"``, ``"pistonball"``).
   :type name: str
   :param config: Configuration object.
   :type config: BaseConfig
   :returns: Environment instance.
   :rtype: BaseEnvironment
   :raises ValueError: If the environment name is not registered.

.. function:: register_env(name)

   Decorator to register a custom environment class.

   :param name: Name to register the environment under.
   :type name: str

   .. code-block:: python

      @langmarl.register_env("my_env")
      class MyEnv(langmarl.BaseEnvironment):
          ...

.. function:: list_envs()

   List all registered environment names.

   :returns: List of environment name strings.
   :rtype: list[str]

Configuration
-------------

BaseConfig
~~~~~~~~~~

.. class:: BaseConfig

   Shared configuration fields across all environments.

   :param exp_name: Experiment name (default: ``"experiment"``).
   :type exp_name: str
   :param paradigm: Training paradigm. One of ``"central_global"``, ``"central_credit"`` (default: ``"central_credit"``).
   :type paradigm: str
   :param num_iterations: Number of training iterations (default: ``5``).
   :type num_iterations: int
   :param trajectories_per_iteration: Episodes per iteration (default: ``10``).
   :type trajectories_per_iteration: int
   :param mini_batch_size: If set, subsample this many trajectories for gradient computation (default: ``None`` = use all).
   :type mini_batch_size: int or None
   :param start_iteration: Iteration to start from (default: ``0``).
   :type start_iteration: int
   :param llm: Default LLM config, used as fallback for actor/critic/optimizer (default: ``None``).
   :type llm: LLMConfig or None
   :param actor_llm: LLM config for actors. Falls back to ``llm`` if not set.
   :type actor_llm: LLMConfig or None
   :param critic_llm: LLM config for the critic. Falls back to ``llm`` if not set.
   :type critic_llm: LLMConfig or None
   :param optimizer_llm: LLM config for the optimizer. Falls back to ``llm`` if not set.
   :type optimizer_llm: LLMConfig or None
   :param num_agents: Number of agents (default: ``2``).
   :type num_agents: int
   :param experiment_dir: Directory to save experiment data (default: ``"./experiments"``).
   :type experiment_dir: str
   :param checkpoint_dir: Directory to save policy checkpoints (default: ``"./ckpt_policy"``).
   :type checkpoint_dir: str
   :param max_workers: Maximum parallel workers for trajectory generation (default: ``5``).
   :type max_workers: int
   :param log_level: Logging level (default: ``"INFO"``).
   :type log_level: str

   .. method:: get_actor_llm()

      Get LLM config for actors. Returns ``actor_llm`` if set, otherwise ``llm``.

      :rtype: LLMConfig

   .. method:: get_critic_llm()

      Get LLM config for the critic. Returns ``critic_llm`` if set, otherwise ``llm``.

      :rtype: LLMConfig

   .. method:: get_optimizer_llm()

      Get LLM config for the optimizer. Returns ``optimizer_llm`` if set, otherwise ``llm``.

      :rtype: LLMConfig

   .. method:: from_json(path, overrides=None)
      :classmethod:

      Load config from a JSON file with optional overrides. LLM fields can be
      specified as predefined model name strings or full config dicts.

      :param path: Path to JSON config file.
      :type path: str
      :param overrides: Optional dictionary of overrides.
      :type overrides: dict or None
      :returns: Config instance.

   .. method:: to_json(path)

      Save config to a JSON file.

      :param path: Output file path.
      :type path: str

LanguageTaskConfig
~~~~~~~~~~~~~~~~~~

.. class:: LanguageTaskConfig

   Configuration for language task environments. Inherits all fields from :class:`BaseConfig`.

   :param task_type: Task type. One of ``"qa"``, ``"math"``, ``"writing"``, ``"coding"`` (default: ``"qa"``).
   :type task_type: str
   :param benchmark_path: Path to the benchmark dataset directory.
   :type benchmark_path: str
   :param data_limit: Limit the number of tasks loaded from the benchmark (default: ``None`` = all).
   :type data_limit: int or None
   :param use_verified_reward: Enable LLM-as-judge for reward verification (default: ``False``).
   :type use_verified_reward: bool
   :param episode_generation_workers: Parallel workers for episode generation (default: ``8``).
   :type episode_generation_workers: int
   :param optimizer_workers: Parallel workers for gradient generation (default: ``1``).
   :type optimizer_workers: int

OvercookedConfig
~~~~~~~~~~~~~~~~

.. class:: OvercookedConfig

   Configuration for the Overcooked environment. Inherits all fields from :class:`BaseConfig`.

   :param layout: Kitchen layout name (default: ``"cramped_room"``).
   :type layout: str
   :param episode_horizon: Maximum timesteps per episode (default: ``400``).
   :type episode_horizon: int
   :param p0_agent: Agent 0 type (default: ``"ProAgent"``).
   :type p0_agent: str
   :param p1_agent: Agent 1 type (default: ``"ProAgent"``).
   :type p1_agent: str

PistonballConfig
~~~~~~~~~~~~~~~~

.. class:: PistonballConfig

   Configuration for the Pistonball environment. Inherits all fields from :class:`BaseConfig`.

   :param num_pistons: Number of pistons (default: ``20``).
   :type num_pistons: int
   :param max_cycles: Maximum cycles per episode (default: ``125``).
   :type max_cycles: int
   :param frame_size: Observation frame size (default: ``84``).
   :type frame_size: int
   :param action_mode: ``"discrete"`` or ``"continuous"`` (default: ``"discrete"``).
   :type action_mode: str

LLMConfig
~~~~~~~~~

.. class:: LLMConfig

   Configuration for a single LLM model. All providers use the OpenAI-compatible API format.

   :param name: Human-readable model identifier (e.g., ``"gpt-4o"``).
   :type name: str
   :param model_string: Model string passed to the API (e.g., ``"gpt-4o"``).
   :type model_string: str
   :param base_url: Custom API base URL. ``None`` means default OpenAI endpoint.
   :type base_url: str or None
   :param api_key: API key. ``None`` means use environment variable.
   :type api_key: str or None
   :param api_key_env_var: Environment variable name for the API key (default: ``"OPENAI_API_KEY"``).
   :type api_key_env_var: str
   :param is_multimodal: Whether the model supports multimodal input (default: ``False``).
   :type is_multimodal: bool
   :param max_tokens: Maximum tokens per response (default: ``4096``).
   :type max_tokens: int
   :param input_price_per_million: Input token price per million (USD).
   :type input_price_per_million: float
   :param output_price_per_million: Output token price per million (USD).
   :type output_price_per_million: float
   :param extra_params: Additional provider-specific parameters.
   :type extra_params: dict

   .. method:: get_api_key()

      Get API key from config or environment variable.

      :returns: API key string.
      :raises ValueError: If no API key is found.

   .. method:: from_preset(name)
      :classmethod:

      Create an ``LLMConfig`` from a predefined model name.

      :param name: Predefined model name (e.g., ``"gpt-4o-mini"``).
      :type name: str
      :returns: LLMConfig instance.

   .. method:: from_dict(data)
      :classmethod:

      Create an ``LLMConfig`` from a dictionary.

   .. method:: to_dict()

      Convert to a dictionary for JSON serialization.

.. function:: get_llm_config(name_or_path)

   Get LLM config by predefined name or load from a JSON file path.

   :param name_or_path: Predefined model name or path to JSON config file.
   :type name_or_path: str
   :returns: LLMConfig instance.
   :raises ValueError: If the name is not recognized and the path does not exist.

.. function:: list_available_models()

   List all available predefined models with their descriptions.

   :returns: Dict mapping model name to description string.
   :rtype: dict[str, str]

Core Abstractions
-----------------

Trajectory
~~~~~~~~~~

.. class:: Trajectory

   A single episode trajectory (dataclass).

   :param task: Task description dict (e.g., ``{"question": "...", "ground_truth": "..."}``).
   :type task: dict
   :param steps: List of step dicts, each containing ``agent_id``, ``observation``, ``action``, etc.
   :type steps: list[dict]
   :param reward: Episode reward.
   :type reward: float
   :param metadata: Optional metadata dict (default: ``{}``).
   :type metadata: dict

BaseEnvironment
~~~~~~~~~~~~~~~

.. class:: BaseEnvironment

   Abstract base class for environment adapters. Subclass this to add new environments.

   .. method:: reset(task)
      :abstractmethod:

      Reset the environment with the given task.

      :param task: Task description dict.
      :type task: dict
      :returns: Initial observations dict.

   .. method:: step(agent_id, action)
      :abstractmethod:

      Execute an agent's action.

      :param agent_id: Agent identifier string.
      :param action: Action string.
      :returns: Tuple of ``(observation, reward, done, info)``.

   .. method:: collect_trajectory(policies, task)
      :abstractmethod:

      Run a full episode with the given policies and return a trajectory.

      :param policies: Dict mapping agent name to policy text.
      :type policies: dict[str, str]
      :param task: Task description dict.
      :type task: dict
      :returns: Episode trajectory.
      :rtype: Trajectory

BaseAgent
~~~~~~~~~

.. class:: BaseAgent

   Abstract base class for agents with language policies.

   .. method:: act(observation, policy)
      :abstractmethod:

      Given an observation and a policy, produce an action.

      :param observation: Text observation.
      :type observation: str
      :param policy: Language policy (system prompt).
      :type policy: str
      :returns: Action text.
      :rtype: str

BaseCritic
~~~~~~~~~~

.. class:: BaseCritic

   Abstract base class for trajectory evaluators.

   .. method:: evaluate(trajectory, policies)
      :abstractmethod:

      Evaluate a trajectory and return per-agent credits.

      :param trajectory: Episode trajectory.
      :type trajectory: Trajectory
      :param policies: Current agent policies.
      :type policies: dict[str, str]
      :returns: Evaluation dict with ``raw_response`` and ``per_agent`` credits.
      :rtype: dict

BaseReward
~~~~~~~~~~

.. class:: BaseReward

   Abstract base class for reward computation.

   .. method:: compute(trajectory)
      :abstractmethod:

      Compute reward for a trajectory.

      :param trajectory: Episode trajectory.
      :type trajectory: Trajectory
      :returns: Reward value.
      :rtype: float

BaseOptimizer
~~~~~~~~~~~~~

.. class:: BaseOptimizer

   Abstract base class for language gradient optimizers.

   .. method:: generate_gradient(policy, evaluation, context)
      :abstractmethod:

      Generate a language gradient (improvement instruction) for one agent.

      :param policy: Current policy text.
      :param evaluation: Evaluation feedback text.
      :param context: Task context text.
      :returns: Gradient text.
      :rtype: str

   .. method:: apply_gradient(policy, gradient)
      :abstractmethod:

      Apply a gradient to a policy, returning the updated policy.

      :param policy: Current policy text.
      :param gradient: Gradient text.
      :returns: Updated policy text.
      :rtype: str

   .. method:: aggregate_gradients(gradients)
      :abstractmethod:

      Aggregate multiple gradients into a single gradient.

      :param gradients: List of gradient texts.
      :type gradients: list[str]
      :returns: Aggregated gradient text.
      :rtype: str

Concrete Implementations
------------------------

CentralizedCritic
~~~~~~~~~~~~~~~~~

.. class:: CentralizedCritic(config, prompts_dir=None)

   Centralized critic supporting ``central_global`` and ``central_credit`` paradigms.
   Evaluates multi-agent sequential collaboration trajectories using LLM-as-judge.

   :param config: Configuration with ``paradigm``, ``num_agents``, and LLM fields.
   :type config: BaseConfig
   :param prompts_dir: Optional custom path to evaluation prompt templates.
   :type prompts_dir: Path or None

   .. method:: evaluate(trajectory, policies)

      Evaluate a trajectory. For ``central_credit``, returns per-agent causal credits.
      For ``central_global``, returns the same evaluation for all agents.

      :param trajectory: Episode trajectory.
      :type trajectory: Trajectory
      :param policies: Current agent policies.
      :type policies: dict[str, str]
      :returns: Dict with keys ``raw_response``, ``paradigm``, and ``per_agent``.
      :rtype: dict

PolicyGradientOptimizer
~~~~~~~~~~~~~~~~~~~~~~~

.. class:: PolicyGradientOptimizer(llm_config)

   LLM-based policy gradient optimizer. Generates language gradients (concrete
   improvement instructions) and applies them to agent policies.

   :param llm_config: LLM configuration for the optimizer model.
   :type llm_config: LLMConfig

   .. method:: generate_gradient(policy, evaluation, context, agent_name="agent")

      Generate a per-agent improvement instruction based on evaluation feedback.

      :param policy: Agent's current policy text.
      :param evaluation: Evaluation feedback for this agent.
      :param context: Task context (truncated to 800 chars).
      :param agent_name: Agent name for logging.
      :returns: Gradient text (2-4 sentences of specific improvement advice).
      :rtype: str

   .. method:: generate_shared_gradient(evaluation, task_context)

      Generate a shared gradient for all agents (used in ``central_global`` paradigm).

      :param evaluation: Team-level evaluation feedback.
      :param task_context: Task context (truncated to 800 chars).
      :returns: Shared gradient text.
      :rtype: str

   .. staticmethod:: apply_gradient(base_policy, gradient)

      Apply a gradient to a base policy. Appends the gradient as a
      ``[CASE-SPECIFIC FEEDBACK]`` section. The base policy is never modified;
      the feedback section is replaced on every call.

      :param base_policy: Original policy text.
      :param gradient: Gradient text to append.
      :returns: Updated policy text.
      :rtype: str

   .. staticmethod:: aggregate_gradients(gradients)

      Aggregate multiple gradients (from multiple episodes) into one by
      joining with separator markers.

      :param gradients: List of gradient texts.
      :type gradients: list[str]
      :returns: Aggregated gradient text.
      :rtype: str

   .. staticmethod:: parse_credit_response(response, agent_names)

      Parse per-agent evaluations from a credit-assignment LLM response.
      Handles JSON format (language tasks), bracket markers (Overcooked, Pistonball),
      and falls back to returning the full response for all agents.

      :param response: Raw LLM evaluation response.
      :type response: str
      :param agent_names: List of agent name strings.
      :type agent_names: list[str]
      :returns: Dict mapping agent name to evaluation text.
      :rtype: dict[str, str]

TrajectoryFormatter
~~~~~~~~~~~~~~~~~~~

.. class:: TrajectoryFormatter

   Formats episode trajectories for LLM evaluation.

   .. staticmethod:: format_trajectory(episode)

      Format an episode into a text string for global evaluation.

      :param episode: Episode dict with ``task``, ``transitions``, ``reward``.
      :returns: Formatted trajectory string.
      :rtype: str

   .. staticmethod:: format_for_credit_assignment(episode)

      Format an episode for per-agent credit assignment, with structured
      per-agent contribution sections.

      :param episode: Episode dict.
      :returns: Formatted trajectory string.
      :rtype: str

   .. staticmethod:: format_trajectory_minimal(episode)

      Format a concise episode summary.

      :param episode: Episode dict.
      :returns: Minimal trajectory string.
      :rtype: str

Training
--------

MonteCarloTrainer
~~~~~~~~~~~~~~~~~

.. class:: MonteCarloTrainer(config, env, critic, optimizer, reward_fn=None, store=None, callbacks=None)

   Generic Monte Carlo trainer for any LangMARL environment. Implements a five-phase
   training iteration: load policies, generate trajectories, evaluate and generate
   gradients, aggregate and apply gradients, save checkpoint.

   :param config: Training configuration.
   :type config: BaseConfig
   :param env: Environment instance.
   :type env: BaseEnvironment
   :param critic: Critic instance.
   :type critic: BaseCritic
   :param optimizer: Optimizer instance.
   :type optimizer: BaseOptimizer
   :param reward_fn: Optional reward function for verified rewards.
   :type reward_fn: BaseReward or None
   :param store: Storage backend (default: ``LocalStore``).
   :type store: BaseStore or None
   :param callbacks: List of training callbacks.
   :type callbacks: list[Callback] or None

   .. method:: train(num_iterations=None)

      Main training loop. Automatically resumes from the latest checkpoint.

      :param num_iterations: Number of iterations to train. Defaults to ``config.num_iterations``.
      :type num_iterations: int or None
      :returns: Training metrics from the store.

   .. method:: train_one_iteration(iteration)

      Run a single training iteration (five phases).

      :param iteration: Iteration number.
      :type iteration: int
      :returns: Statistics dict with keys: ``paradigm``, ``num_episodes``, ``avg_reward``,
         ``min_reward``, ``max_reward``, ``rewards``, ``input_tokens``, ``output_tokens``,
         ``total_tokens``, ``cost_usd``.
      :rtype: dict

Callbacks
~~~~~~~~~

.. class:: Callback

   Abstract base class for training callbacks. Override any of the following methods:

   .. method:: on_iteration_start(iteration, trainer)
   .. method:: on_iteration_end(iteration, stats, trainer)
   .. method:: on_episode_complete(trajectory, trainer)
   .. method:: on_policy_update(agent_id, old_policy, new_policy)

.. class:: LoggingCallback

   Callback that logs iteration start/end events via the trainer's ``RunLogger``.

.. class:: CheckpointCallback

   Callback for checkpoint management (saving is handled by the trainer by default).

.. class:: EarlyStoppingCallback(patience=3, min_delta=0.01)

   Stop training when the average reward stops improving.

   :param patience: Number of iterations to wait without improvement before stopping.
   :type patience: int
   :param min_delta: Minimum improvement to qualify as progress.
   :type min_delta: float

LLM Client
-----------

LLMClient
~~~~~~~~~

.. class:: LLMClient(llm_config)

   Unified LLM client wrapping OpenAI-compatible APIs.

   :param llm_config: LLM configuration.
   :type llm_config: LLMConfig

   .. method:: chat(system_prompt, user_input, max_tokens=None)

      Send a chat completion request and return the response text.

      :param system_prompt: System message.
      :type system_prompt: str
      :param user_input: User message.
      :type user_input: str
      :param max_tokens: Maximum tokens (default: from config).
      :type max_tokens: int or None
      :returns: Response text.
      :rtype: str

   .. method:: chat_with_usage(system_prompt, user_input, max_tokens=None)

      Chat and return both response text and token usage.

      :returns: Tuple of ``(response_text, {"input": N, "output": N})``.
      :rtype: tuple[str, dict[str, int]]

   .. attribute:: raw_client

      Access the underlying ``OpenAI`` client instance.

TokenTracker
~~~~~~~~~~~~

.. class:: TokenTracker(model="gpt-4o-mini", input_price=None, output_price=None)

   Tracks token usage and estimates costs for LLM API calls. Has built-in pricing
   for common models.

   :param model: Model name for pricing lookup.
   :type model: str
   :param input_price: Custom input price per million tokens (overrides lookup).
   :type input_price: float or None
   :param output_price: Custom output price per million tokens (overrides lookup).
   :type output_price: float or None

   .. method:: add_usage(input_tokens, output_tokens)

      Add token usage counts.

      :param input_tokens: Number of input tokens.
      :param output_tokens: Number of output tokens.

   .. method:: get_stats()

      Get usage statistics and cost estimate.

      :returns: Dict with ``model``, ``input_tokens``, ``output_tokens``, ``total_tokens``,
         ``input_cost_usd``, ``output_cost_usd``, ``cost_usd``.
      :rtype: dict

   .. method:: estimate_cost(input_tokens, output_tokens)

      Estimate cost for given token counts without adding to the tracker.

      :returns: Estimated cost in USD.
      :rtype: float

   .. method:: reset()

      Reset all token counters to zero.

   .. method:: get_summary_string()

      Get a human-readable summary of token usage and costs.

      :returns: Formatted summary string.
      :rtype: str

Storage
-------

LocalStore
~~~~~~~~~~

.. class:: LocalStore(base_dir)

   Filesystem-backed storage backend. Manages training run data including
   trajectories, checkpoints, evaluations, gradients, metrics, and logs.

   :param base_dir: Base directory for all storage.
   :type base_dir: str

PolicyCheckpoint
~~~~~~~~~~~~~~~~

.. class:: PolicyCheckpoint(store, run_id, num_agents)

   Manages versioned policy snapshots. Stores one text file per agent per iteration.

   :param store: Storage backend.
   :type store: BaseStore
   :param run_id: Training run identifier.
   :type run_id: str
   :param num_agents: Number of agents.
   :type num_agents: int

   .. method:: get_policies(iteration=None)

      Load policies from the latest (or a specific) iteration. Generates default
      policies if no checkpoint exists.

      :param iteration: Specific iteration to load. ``None`` loads the latest.
      :type iteration: int or None
      :returns: Dict mapping agent name to policy text.
      :rtype: dict[str, str]

   .. method:: save_policies(iteration, policies, stats=None)

      Save policies as a new iteration checkpoint.

      :param iteration: Iteration number.
      :param policies: Dict mapping agent name to policy text.
      :param stats: Optional training statistics.

   .. method:: diff_policies(iter_a, iter_b)

      Compare policies across two iterations.

      :param iter_a: First iteration number.
      :param iter_b: Second iteration number.
      :returns: Dict mapping agent name to diff text.

TrajectoryStore
~~~~~~~~~~~~~~~

.. class:: TrajectoryStore(store, run_id)

   Manages episode trajectory persistence.

   .. method:: save(iteration, episode_id, trajectory)

      Save a trajectory for a given iteration and episode.

   .. method:: load(iteration, limit=None)

      Load trajectories for a given iteration.

      :param iteration: Iteration number.
      :param limit: Maximum number of trajectories to load.
      :returns: List of Trajectory objects.
      :rtype: list[Trajectory]

   .. method:: count(iteration)

      Count stored trajectories for a given iteration.

      :returns: Number of stored trajectories.
      :rtype: int

RunLogger
~~~~~~~~~

.. class:: RunLogger(store, run_id)

   Structured training logger. Writes to both file and console.

   .. method:: info(message)
   .. method:: debug(message)
   .. method:: warning(message)
   .. method:: error(message)
   .. method:: iteration_start(iteration, policies)
   .. method:: iteration_end(iteration, stats)
   .. method:: episode_saved(iteration, episode_id, reward)
   .. method:: evaluation_done(iteration, episode_id, paradigm, response)
   .. method:: gradient_saved(iteration, agent_name, num_gradients)