Custom Reward

This page documents quickstart/custom_reward.py in mint-quickstart.

What this recipe does

Shows the standard MinT custom reward pattern: sample on the server, score on the client, convert rewards into advantages, then train with importance_sampling.
Uses arithmetic prompts so the reward stays easy to inspect.
Replaces binary exact-match scoring with a shaped reward that gives partial credit for well-formed and numerically close answers.

Reward breakdown

The script computes a bounded reward in [0, 1]:

Component	Range	Meaning
`format_reward`	`0.0` or `0.2`	The response contains a parseable integer
`distance_reward`	`0.0` to `0.5`	The prediction is numerically close to the target
`exact_bonus`	`0.0` or `0.3`	The parsed integer exactly matches the target

This keeps the loop simple while avoiding the fully binary 0/1 signal from the basic quickstart.

How to run

export MINT_API_KEY=sk-...
python quickstart/custom_reward.py

Parameters (env vars)

MINT_BASE_MODEL: default Qwen/Qwen3-0.6B
MINT_LORA_RANK: default 16
MINT_CUSTOM_REWARD_STEPS: default 8
MINT_CUSTOM_REWARD_BATCH: default 8
MINT_CUSTOM_REWARD_GROUP: default 6
MINT_CUSTOM_REWARD_LR: default 2e-5
MINT_CUSTOM_REWARD_MAX_TOKENS: default 16
MINT_CUSTOM_REWARD_TEMPERATURE: default 0.8

Key code

def compute_reward_breakdown(response: str, correct_answer: int) -> RewardBreakdown:
    prediction = extract_prediction(response)
    if prediction is None:
        return RewardBreakdown(total=0.0, format_reward=0.0, distance_reward=0.0, exact_bonus=0.0)
 
    format_reward = 0.2
    error = abs(prediction - correct_answer)
    distance_scale = max(abs(correct_answer), 20)
    closeness = max(0.0, 1.0 - min(error / distance_scale, 1.0))
    distance_reward = 0.5 * closeness
    exact_bonus = 0.3 if prediction == correct_answer else 0.0
    total = min(1.0, format_reward + distance_reward + exact_bonus)
    return RewardBreakdown(total, format_reward, distance_reward, exact_bonus)

mean_reward = sum(group_rewards) / len(group_rewards)
advantages = [reward - mean_reward for reward in group_rewards]
 
training_client.forward_backward(
    training_datums,
    loss_fn="importance_sampling",
).result()
training_client.optim_step(types.AdamParams(learning_rate=RL_LR)).result()

Expected output

Prints avg_reward, exact_rate, format_rate, and datums every step.
Finishes with Saved: ....

When to use this pattern

You can score outputs directly in Python.
You want partial credit, style penalties, or tool-cost penalties before moving to more complex RL pipelines.
You do not need a server-hosted reward callback; the reward logic lives entirely in the client script.

Quickstart Mini RL Trip