DemoCustom Reward

Custom Reward

This page documents quickstart/custom_reward.py in mint-quickstart.

What this recipe does

  • Shows the standard MinT custom reward pattern: sample on the server, score on the client, convert rewards into advantages, then train with importance_sampling.
  • Uses arithmetic prompts so the reward stays easy to inspect.
  • Replaces binary exact-match scoring with a shaped reward that gives partial credit for well-formed and numerically close answers.

Reward breakdown

The script computes a bounded reward in [0, 1]:

ComponentRangeMeaning
format_reward0.0 or 0.2The response contains a parseable integer
distance_reward0.0 to 0.5The prediction is numerically close to the target
exact_bonus0.0 or 0.3The parsed integer exactly matches the target

This keeps the loop simple while avoiding the fully binary 0/1 signal from the basic quickstart.

How to run

export MINT_API_KEY=sk-...
python quickstart/custom_reward.py

Parameters (env vars)

  • MINT_BASE_MODEL: default Qwen/Qwen3-0.6B
  • MINT_LORA_RANK: default 16
  • MINT_CUSTOM_REWARD_STEPS: default 8
  • MINT_CUSTOM_REWARD_BATCH: default 8
  • MINT_CUSTOM_REWARD_GROUP: default 6
  • MINT_CUSTOM_REWARD_LR: default 2e-5
  • MINT_CUSTOM_REWARD_MAX_TOKENS: default 16
  • MINT_CUSTOM_REWARD_TEMPERATURE: default 0.8

Key code

def compute_reward_breakdown(response: str, correct_answer: int) -> RewardBreakdown:
    prediction = extract_prediction(response)
    if prediction is None:
        return RewardBreakdown(total=0.0, format_reward=0.0, distance_reward=0.0, exact_bonus=0.0)
 
    format_reward = 0.2
    error = abs(prediction - correct_answer)
    distance_scale = max(abs(correct_answer), 20)
    closeness = max(0.0, 1.0 - min(error / distance_scale, 1.0))
    distance_reward = 0.5 * closeness
    exact_bonus = 0.3 if prediction == correct_answer else 0.0
    total = min(1.0, format_reward + distance_reward + exact_bonus)
    return RewardBreakdown(total, format_reward, distance_reward, exact_bonus)
mean_reward = sum(group_rewards) / len(group_rewards)
advantages = [reward - mean_reward for reward in group_rewards]
 
training_client.forward_backward(
    training_datums,
    loss_fn="importance_sampling",
).result()
training_client.optim_step(types.AdamParams(learning_rate=RL_LR)).result()

Expected output

  • Prints avg_reward, exact_rate, format_rate, and datums every step.
  • Finishes with Saved: ....

When to use this pattern

  • You can score outputs directly in Python.
  • You want partial credit, style penalties, or tool-cost penalties before moving to more complex RL pipelines.
  • You do not need a server-hosted reward callback; the reward logic lives entirely in the client script.