-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix test_eager_matches_sdpa_inference for XPU backend #34889
Conversation
c472e44
to
be8177c
Compare
Looks like Friday evening is not the best time to run ci. Pushed same code 3 times, seeing different errors on each run:). Not related to the change I think. Will continue on Monday :). |
Clarified XPU backend behavior for |
# As of PyTorch 2.5 XPU backend supports only torch.nn.attention.SDPBackend.MATH | ||
# which is implemented on PyTorch level using aten operators and is | ||
# device agnostic with respect to implementation of each aten operator. | ||
atol = atols["cuda", False, torch_dtype] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
although xpu only support MATH, does it means the results from XPU will be the same as, say CUDA? Otherwise, I don't see the reason to reuse CUDA threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my understanding, at least for the recent versions of PyTorch (2.5 and upcoming 2.6 ) MATH should give identical results on any hardware including different GPU devices and CPU because algorithm is implemented on torch level and is device agnostic (well, up to aten operators implementation which is device specific, but they still should give same results). The only exception here is MPS which has separate branch in the code though also implemented at torch level. Here are relevant places in the sources:
- https://github.com/pytorch/pytorch/blob/5ececd4caa4ec1534c567ec9db6efb861b760685/aten/src/ATen/native/transformers/attention.cpp#L776 (see separate code branch for MPS right above this code)
- https://github.com/pytorch/pytorch/blob/5ececd4caa4ec1534c567ec9db6efb861b760685/aten/src/ATen/native/transformers/attention.cpp#L795
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, per above I think that current reuse of CUDA thresholds is reasonable... That being said, question is whether this is sustainable in a longer term or we soon will need to adjust thresholds for XPU? Well, we might need to. It's likely that upstream pytorch XPU will get implementation for one or both of attention algorithms and here I am not sure that these will behave same as CUDA. Plus, there is also IPEX for XPU which might behave different. And all that might be version dependent....
Anyhow, whether we reuse CUDA thresholds or not is a call we need to make here. I would in any case start by copying CUDA thresholds to XPU specific location. Note that we might need few iterations to settle everything down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sound reason able to reuse (for MATH) threshold now. Thank you for explaining!
I ran
Above being said, we synced with @Faany offline and she did try this PR on her side. I believe she saw tests passing for her w/o IPEX, but she mentioned that some tests are failing for her with IPEX. The later point is different from test results I have. I suspect that's due to the different IPEX versions we tried: I think @faaany tried IPEX with later version (she's offline now, I will check with her later on that). Note that without this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again!
@faaany Do you have any further comments?
For myself: need to follow up with #34941 (sdpa tests for beit) which is being reviewed in parallel. |
Failure on ci seems unrelated. I also see some flakiness on main ci results. I tried to rebase couple times, but this did not help. No changes from last review. |
yes we do have some flaky tests. (trying to fix them but in a slow peace) |
…ds.cuda.sdp_kernel Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
As of PyTorch 2.5 XPU backend supports only torch.nn.attention.SDPBackend.MATH which is implemented on PyTorch level using aten operators and is device agnostic with respect to implementation of each aten operator. Thus, we can reuse CUDA (or CPU) MATH weights for XPU. Fixes: huggingface#34888 Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
…in nemotron Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
There is no need to have a green CI if you think some failure are irrelevant. Ping me for a double check then we could wait for a core maintainer's review. |
Yeah, I think that's the case here. I rebased today w/o any changes. Another test failed this time, I believe unrelated as well. Sorry, I am just paranoid about green ci - I consider that's my responsibility to achieve that on my PRs.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to merge if it's alright with you @ydshieh
@@ -607,7 +607,7 @@ def get_mean_reldiff(failcase, x, ref, atol, rtol): | |||
|
|||
# TODO: test gradients as well (& for FA2 as well!) | |||
with torch.no_grad(): | |||
with torch.backends.cuda.sdp_kernel( | |||
with sdpa_kernel( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for updating!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…e#34889) * Use torch.nn.attention.sdpa_kernel instead of deprecated torch.backends.cuda.sdp_kernel Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com> * Fix test_eager_matches_sdpa_inference for XPU backend As of PyTorch 2.5 XPU backend supports only torch.nn.attention.SDPBackend.MATH which is implemented on PyTorch level using aten operators and is device agnostic with respect to implementation of each aten operator. Thus, we can reuse CUDA (or CPU) MATH weights for XPU. Fixes: huggingface#34888 Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com> * Use torch.amp.autocast instead of deprecated torch.cuda.amp.autocast in nemotron Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com> --------- Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
…e#34889) * Use torch.nn.attention.sdpa_kernel instead of deprecated torch.backends.cuda.sdp_kernel Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com> * Fix test_eager_matches_sdpa_inference for XPU backend As of PyTorch 2.5 XPU backend supports only torch.nn.attention.SDPBackend.MATH which is implemented on PyTorch level using aten operators and is device agnostic with respect to implementation of each aten operator. Thus, we can reuse CUDA (or CPU) MATH weights for XPU. Fixes: huggingface#34888 Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com> * Use torch.amp.autocast instead of deprecated torch.cuda.amp.autocast in nemotron Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com> --------- Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
Included fixes:
torch.nn.attention.sdpa_kernel
instead of deprecatedtorch.backends.cuda.sdp_kernel
torch.amp.autocast
instead of deprecatedtorch.cuda.amp.autocast
in nemotrontorch.nn.attention.SDPBackend.MATH
)Fixes: #34888
CC: @amyeroberts @ydshieh