ResumableSequentialLR#
- class ResumableSequentialLR(
- optimizer,
- schedulers,
- milestones,
- last_epoch=-1,
Bases:
SequentialLRSequentialLRwith a correctedload_state_dict.- Bug (PyTorch <= 2.10, confirmed on 2.10.0+cu129):
SequentialLR.load_state_dictcorrectly deserializes its internal bookkeeping (_last_lr, sub-scheduler states,last_epoch) but never writes the restored learning rates back to ``optimizer.param_groups``. As a result, after loading a checkpoint the optimizer silently continues with the LR that the freshly constructed scheduler initialized (typically the warmup start value), rather than the LR the training had reached before the checkpoint was saved. In practice this means the LR schedule restarts from zero on every job resume.- Fix:
After the parent
load_state_dictfinishes, copy_last_lrinto the optimizer’sparam_groupsso the nextoptimizer.step()uses the correct restored learning rate.
See
tests/test_checkpoint_resume.py::TestResumableSequentialLRfor round-trip verification and a sentinel test that confirms the upstream bug still exists.- Parameters:
- load_state_dict(state_dict)#
Load scheduler state and propagate restored LRs to the optimizer.
Calls the parent
SequentialLR.load_state_dict, then immediately copies each value inself._last_lrinto the correspondingoptimizer.param_groups[i]["lr"]. This ensures that the firstoptimizer.step()after a resume uses the learning rate that was active when the checkpoint was saved, rather than the freshly initialised (warmup-start) value.- Parameters:
state_dict (dict) – Scheduler state dictionary as produced by
state_dict(). Typically loaded from a checkpoint withtorch.loadand passed directly to this method.- Return type:
None