-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix FSDP resume Initialization issue #34032
Conversation
Thanks for the PR ! Could you explain a bit more why this PR fixes the issue that you linked ? Thanks |
Yeah, sure. There is a similar issue in Pytorch (pytorch/pytorch#113496) which causes the same error. Reason being, initialization error in the forward pass, which causes FSDP to fail. The Fix seems fairly simple, as we just have to run forward pass once using dummy values, before initializing FSDP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fix makes sense to me, thanks for the explanation. Could you document and add the link to that issue on top of the _init_fsdp
func so we can fully trace why this is needed?
Also please do pip install -e .[quality]; make fixup
and this will fix the quality tests.
@muellerzr @SunMarc It would be great if you guys can see to it once and if there's anything from my end that needs to be done?! Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I left a comment to show what to fix in order to pass the CI !
@@ -4911,3 +4911,33 @@ def test_get_optimizer_group(self): | |||
param = next(model.parameters()) | |||
group = trainer.get_optimizer_group(param) | |||
self.assertIn(param, group["params"]) | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to pass a @require_cuda decorator for this test !
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR broke the test in tests/trainer/test_trainer_fsdp.py
, which actually tests initializing a trainer using FSDP.
Given that there are also some flaws in the logic of this PR, it might be worth reverting this so it can be properly relanded.
dtype=torch.long, | ||
device=device, | ||
) | ||
for name in model.forward.__code__.co_varnames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the variable names inside of forward... not the parameters to forward. I think you probably meant to do something like inspect.signature
.
name: torch.ones( | ||
(1, 512), | ||
dtype=torch.long, | ||
device=device, | ||
) | ||
for name in model.forward.__code__.co_varnames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not every parameter to forward is a tensor, but you are sending in a tensor for every value.
Regression as result of this merge for
|
Thanks for the heads-up @Qubitium @ringohoffman ! I will revert this PR ! |
Thanks for info @Qubitium @ringohoffman on the PR. I'll try to resolve the errors. |
* Fix FSDP Initialization for resume training * Added init_fsdp function to work with dummy values * Fix FSDP initialization for resuming training * Added CUDA decorator for tests * Added torch_gpu decorator to FSDP tests * Fixup for failing code quality tests
Revert "Fix FSDP resume Initialization issue (huggingface#34032)" This reverts commit 4de1bdb.
* Fix FSDP Initialization for resume training * Added init_fsdp function to work with dummy values * Fix FSDP initialization for resuming training * Added CUDA decorator for tests * Added torch_gpu decorator to FSDP tests * Fixup for failing code quality tests
Revert "Fix FSDP resume Initialization issue (huggingface#34032)" This reverts commit 4de1bdb.
* Fix FSDP Initialization for resume training * Added init_fsdp function to work with dummy values * Fix FSDP initialization for resuming training * Added CUDA decorator for tests * Added torch_gpu decorator to FSDP tests * Fixup for failing code quality tests
Revert "Fix FSDP resume Initialization issue (huggingface#34032)" This reverts commit 4de1bdb.
Addresses the issue with Fully Sharded Data Parallel (FSDP) initialization when resuming training from a checkpoint. It implements a solution by adding a dummy forward pass during the initialization process.
Fixes #31892
Added tests in the test_trainer.py file to ensure proper FSDP initialization
@muellerzr @SunMarc I am creating a draft PR, let me know if there anymore changes that I can make