Custom Reward
This page documents quickstart/custom_reward.py in mint-quickstart.
What this recipe does
- Shows the standard MinT custom reward pattern: sample on the server, score on the client, convert rewards into advantages, then train with
importance_sampling. - Uses arithmetic prompts so the reward stays easy to inspect.
- Replaces binary exact-match scoring with a shaped reward that gives partial credit for well-formed and numerically close answers.
Reward breakdown
The script computes a bounded reward in [0, 1]:
| Component | Range | Meaning |
|---|---|---|
format_reward | 0.0 or 0.2 | The response contains a parseable integer |
distance_reward | 0.0 to 0.5 | The prediction is numerically close to the target |
exact_bonus | 0.0 or 0.3 | The parsed integer exactly matches the target |
This keeps the loop simple while avoiding the fully binary 0/1 signal from the basic quickstart.
How to run
export MINT_API_KEY=sk-...
python quickstart/custom_reward.pyParameters (env vars)
MINT_BASE_MODEL: defaultQwen/Qwen3-0.6BMINT_LORA_RANK: default16MINT_CUSTOM_REWARD_STEPS: default8MINT_CUSTOM_REWARD_BATCH: default8MINT_CUSTOM_REWARD_GROUP: default6MINT_CUSTOM_REWARD_LR: default2e-5MINT_CUSTOM_REWARD_MAX_TOKENS: default16MINT_CUSTOM_REWARD_TEMPERATURE: default0.8
Key code
def compute_reward_breakdown(response: str, correct_answer: int) -> RewardBreakdown:
prediction = extract_prediction(response)
if prediction is None:
return RewardBreakdown(total=0.0, format_reward=0.0, distance_reward=0.0, exact_bonus=0.0)
format_reward = 0.2
error = abs(prediction - correct_answer)
distance_scale = max(abs(correct_answer), 20)
closeness = max(0.0, 1.0 - min(error / distance_scale, 1.0))
distance_reward = 0.5 * closeness
exact_bonus = 0.3 if prediction == correct_answer else 0.0
total = min(1.0, format_reward + distance_reward + exact_bonus)
return RewardBreakdown(total, format_reward, distance_reward, exact_bonus)mean_reward = sum(group_rewards) / len(group_rewards)
advantages = [reward - mean_reward for reward in group_rewards]
training_client.forward_backward(
training_datums,
loss_fn="importance_sampling",
).result()
training_client.optim_step(types.AdamParams(learning_rate=RL_LR)).result()Expected output
- Prints
avg_reward,exact_rate,format_rate, anddatumsevery step. - Finishes with
Saved: ....
When to use this pattern
- You can score outputs directly in Python.
- You want partial credit, style penalties, or tool-cost penalties before moving to more complex RL pipelines.
- You do not need a server-hosted reward callback; the reward logic lives entirely in the client script.