compute_relative_error#

compute_relative_error(tensor1, tensor2)#

Compute relative error between two tensors: ||t1 - t2|| / ||t1||.

This metric is scale-invariant and useful for comparing gradients or activations in distributed training. Following the TTrace methodology (arXiv:2506.09280), this helps distinguish between floating-point round-off errors and actual bugs in distributed implementations.

Parameters:
  • tensor1 (Tensor) – Reference tensor

  • tensor2 (Tensor) – Tensor to compare against reference

Returns:

Relative error as a float scalar. Returns absolute difference if ||t1|| is near zero.

Return type:

float

Example

>>> grad_ref = torch.randn(100, 100)
>>> grad_test = grad_ref + 0.001 * torch.randn(100, 100)  # Add 0.1% noise
>>> rel_err = compute_relative_error(grad_ref, grad_test)
>>> assert rel_err < 0.01  # Less than 1% relative error
Reference:

TTrace: Lightweight Error Checking and Diagnosis for Distributed Training https://arxiv.org/abs/2506.09280