compute_relative_error#
- compute_relative_error(tensor1, tensor2)#
Compute relative error between two tensors: ||t1 - t2|| / ||t1||.
This metric is scale-invariant and useful for comparing gradients or activations in distributed training. Following the TTrace methodology (arXiv:2506.09280), this helps distinguish between floating-point round-off errors and actual bugs in distributed implementations.
- Parameters:
- Returns:
Relative error as a float scalar. Returns absolute difference if ||t1|| is near zero.
- Return type:
Example
>>> grad_ref = torch.randn(100, 100) >>> grad_test = grad_ref + 0.001 * torch.randn(100, 100) # Add 0.1% noise >>> rel_err = compute_relative_error(grad_ref, grad_test) >>> assert rel_err < 0.01 # Less than 1% relative error
- Reference:
TTrace: Lightweight Error Checking and Diagnosis for Distributed Training https://arxiv.org/abs/2506.09280