IterationSpeedCallback#

class IterationSpeedCallback(*args, **kwargs)#

Bases: Callback

Logs iteration throughput, fwd/bwd breakdown, and GPU memory to wandb.

Provides two families of metrics:

  • Wall-clock (perf/wc_*): cumulative counters that track true training throughput. The timer pauses during validation and resumes when training continues, so these are immune to variable-frequency validation skewing the numbers.

  • Windowed (perf/iter_per_sec, perf/fwd_ms, etc.): rolling averages over the last window_size batches. More responsive to local changes but can be noisy, especially during torch.compile warmup (the first warmup_batches steps are excluded).

GPU memory is sampled from torch.cuda.max_memory_allocated / memory_allocated and logged as perf/peak_gpu_mb / perf/current_gpu_mb.

Parameters:
  • log_every_n_steps (int) – How often to log speed metrics.

  • window_size (int | None) – Number of recent batch times to average over. Defaults to log_every_n_steps.

  • batch_size_per_gpu (int | None) – Batch size on each GPU (for samples/sec calc). If None, attempts to read from trainer.datamodule.

__init__(
log_every_n_steps=10,
window_size=None,
batch_size_per_gpu=None,
)#
Parameters:
  • log_every_n_steps (int)

  • window_size (int | None)

  • batch_size_per_gpu (int | None)

on_train_batch_start(
trainer,
pl_module,
batch,
batch_idx,
)#
on_before_backward(trainer, pl_module, loss)#
on_after_backward(trainer, pl_module)#
on_train_batch_end(
trainer,
pl_module,
outputs,
batch,
batch_idx,
)#
on_validation_start(trainer, pl_module)#
on_validation_end(trainer, pl_module)#