AdvancedCheckpointResume

Resume Training from a Checkpoint

This page documents advanced/checkpoint.py resume in mint-quickstart.

Two resume modes

  • Weights only: the script first tries create_training_client_from_state(path) for auto-detection. If the metadata lookup returns 404 for a raw checkpoint path, it falls back to create_lora_training_client(...) plus load_state(path) using MINT_BASE_MODEL / MINT_LORA_RANK (or their defaults).
  • With optimizer: create_lora_training_client(...) plus load_state_with_optimizer(path) preserves optimizer momentum, but requires MINT_BASE_MODEL and MINT_LORA_RANK.

Use the MinT endpoint that matches your region:

  • Mainland China: https://mint-cn.macaron.xin/
  • Outside Mainland China: https://mint.macaron.xin/

Commands

# Weights only
export MINT_API_KEY=sk-...
python advanced/checkpoint.py resume tinker://<run-id>/weights/<checkpoint-name>
 
# Preserve optimizer state
export MINT_API_KEY=sk-...
export MINT_BASE_MODEL=Qwen/Qwen3-0.6B
export MINT_LORA_RANK=16
python advanced/checkpoint.py resume tinker://<run-id>/weights/<checkpoint-name> --with-optimizer --steps 3

Useful flags:

  • --steps: number of post-resume SFT steps to run
  • --lr: learning rate for those steps
  • --save-name: name of the checkpoint saved after the resume steps finish

Core APIs

tc = service_client.create_training_client_from_state(resume_path)
 
tc = service_client.create_lora_training_client(base_model=model, rank=rank)
tc.load_state(resume_path).result()
 
tc = service_client.create_lora_training_client(base_model=model, rank=rank)
tc.load_state_with_optimizer(resume_path).result()

Expected output

[resume] path=tinker://.../weights/my-ckpt-state with_optimizer=False steps=3
[resume] creating training client from state (optimizer resets)...
[resume] auto-detect state metadata lookup returned 404; retrying with explicit model/rank from env/defaults
[resume] fallback to explicit training client: model=Qwen/Qwen3-0.6B rank=16
[resume] loading state from tinker://.../weights/my-ckpt-state...
[resume] loaded, running 3 SFT step(s)...
[resume] step 1/3 done
[resume] saved: tinker://.../weights/resumed-checkpoint

Common failure cases

  • the checkpoint path is missing or invalid
  • --with-optimizer is used without matching MINT_BASE_MODEL / MINT_LORA_RANK
  • the checkpoint was saved for a different adapter shape than the new client
  • the base model is unavailable for your account
  • Generate a checkpoint to resume from: Save
  • Pull a server-side checkpoint archive to local disk: Download
  • Push a local archive back to MinT: Upload