WalltimeCheckpointer#
- class WalltimeCheckpointer(*args, **kwargs)#
Bases:
CallbackCheckpoints and stops the training when a walltime limit is reached.
This is useful for Slurm jobs that have a hard time limit. We want to stop short of the limit to save a checkpoint and exit gracefully, so that the job can be automatically resumed.
- Parameters:
- __init__(
- start_time,
- checkpoint_dir,
- time_limit_hours=4.0,
- buffer_minutes=2.0,
- checkpoint_filename='last.ckpt',
- stop_trainer_after_time_limit=True,
WalltimeCheckpointer constructor.
- Parameters:
start_time (float) – Timestamp when the job started (e.g. from time.time() or date +%s).
checkpoint_dir (str | Path) – Directory where to save the checkpoint.
time_limit_hours (float) – The allocation time limit in hours.
buffer_minutes (float) – How many minutes before the limit to stop.
checkpoint_filename (str) – Name of the checkpoint file to save.
stop_trainer_after_time_limit (bool) – Whether to stop the trainer after the time limit.
- on_val_batch_end(
- trainer,
- pl_module,
- outputs,
- batch,
- batch_idx,
Check time after every batch.
- Parameters:
trainer (pytorch_lightning.Trainer)
pl_module (pytorch_lightning.LightningModule)
- on_train_batch_end(
- trainer,
- pl_module,
- outputs,
- batch,
- batch_idx,
Check time after every batch.
- Parameters:
trainer (pytorch_lightning.Trainer)
pl_module (pytorch_lightning.LightningModule)