WalltimeCheckpointer#

class WalltimeCheckpointer(*args, **kwargs)#

Bases: Callback

Checkpoints and stops the training when a walltime limit is reached.

This is useful for Slurm jobs that have a hard time limit. We want to stop short of the limit to save a checkpoint and exit gracefully, so that the job can be automatically resumed.

Parameters:
  • start_time (float)

  • checkpoint_dir (str | Path)

  • time_limit_hours (float)

  • buffer_minutes (float)

  • checkpoint_filename (str)

  • stop_trainer_after_time_limit (bool)

__init__(
start_time,
checkpoint_dir,
time_limit_hours=4.0,
buffer_minutes=2.0,
checkpoint_filename='last.ckpt',
stop_trainer_after_time_limit=True,
)#

WalltimeCheckpointer constructor.

Parameters:
  • start_time (float) – Timestamp when the job started (e.g. from time.time() or date +%s).

  • checkpoint_dir (str | Path) – Directory where to save the checkpoint.

  • time_limit_hours (float) – The allocation time limit in hours.

  • buffer_minutes (float) – How many minutes before the limit to stop.

  • checkpoint_filename (str) – Name of the checkpoint file to save.

  • stop_trainer_after_time_limit (bool) – Whether to stop the trainer after the time limit.

on_val_batch_end(
trainer,
pl_module,
outputs,
batch,
batch_idx,
)#

Check time after every batch.

Parameters:
  • trainer (pytorch_lightning.Trainer)

  • pl_module (pytorch_lightning.LightningModule)

on_train_batch_end(
trainer,
pl_module,
outputs,
batch,
batch_idx,
)#

Check time after every batch.

Parameters:
  • trainer (pytorch_lightning.Trainer)

  • pl_module (pytorch_lightning.LightningModule)