You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm experimenting with FasterViT in an MMDetection project. In this project the validation data augmentation pipeline does not crop the image, and simply pads it to the minimum size. This minimum size is calculated per GPU, meaning each GPU can have a different height and width for the images in its batch.
For all the timm models I've worked with this is fine, including Swin. However with FasterViT this causes a tricky NCCL timeout due to the self.relative_bias buffer that's being cached to support the self.deploy switch.
Because each GPU has a different image size, the number of carrier tokens is different, and thus the sequence length for this buffer is different. That's fine, but when restarting training the next epoch this buffer gets sync'd between the GPUs, but it's not the same size anymore on all GPUs. This causes NCCL timeout, but what's worse is there's no indication of what's happening, and in fact the timeout happens during the next synchronization op on the GPU, e.g. something like SyncBatchNorm.
The solution here is to simply force self.relative_bias to be re-calculated everytime during training (which already happens when self.deploy is False), AND re-calculate it during validation when the image size changes. This would require some dynamic checks which would probably break torchscript, but maybe it would work with torch.compile?
I'm open to submitting a PR in the future if there's a clear path forward. I'm mostly adding this issue in case other folks run into the same NCCL timeout.
The text was updated successfully, but these errors were encountered:
Just to be clear, my temporary solution here was to:
Change lines like self.register_buffer("relative_bias", relative_bias) to self.relative_bias = relative_bias (this doesn't need to be a buffer if it's always re-computed)
Comment out lines like self.grid_exists = True so self.grid_exists is always False and the relative bias is always re-computed (this only effects validation by disabling the caching mechanism)
I'm experimenting with FasterViT in an MMDetection project. In this project the validation data augmentation pipeline does not crop the image, and simply pads it to the minimum size. This minimum size is calculated per GPU, meaning each GPU can have a different height and width for the images in its batch.
For all the timm models I've worked with this is fine, including Swin. However with FasterViT this causes a tricky NCCL timeout due to the
self.relative_bias
buffer that's being cached to support theself.deploy
switch.Because each GPU has a different image size, the number of carrier tokens is different, and thus the sequence length for this buffer is different. That's fine, but when restarting training the next epoch this buffer gets sync'd between the GPUs, but it's not the same size anymore on all GPUs. This causes NCCL timeout, but what's worse is there's no indication of what's happening, and in fact the timeout happens during the next synchronization op on the GPU, e.g. something like SyncBatchNorm.
The solution here is to simply force
self.relative_bias
to be re-calculated everytime during training (which already happens whenself.deploy
is False), AND re-calculate it during validation when the image size changes. This would require some dynamic checks which would probably break torchscript, but maybe it would work withtorch.compile
?I'm open to submitting a PR in the future if there's a clear path forward. I'm mostly adding this issue in case other folks run into the same NCCL timeout.
The text was updated successfully, but these errors were encountered: