Fix FSDP resume Initialization issue #34032

Itssshikhar · 2024-10-09T08:38:24Z

Addresses the issue with Fully Sharded Data Parallel (FSDP) initialization when resuming training from a checkpoint. It implements a solution by adding a dummy forward pass during the initialization process.

Fixes #31892

Added tests in the test_trainer.py file to ensure proper FSDP initialization

@muellerzr @SunMarc I am creating a draft PR, let me know if there anymore changes that I can make

SunMarc · 2024-10-09T13:21:56Z

Thanks for the PR ! Could you explain a bit more why this PR fixes the issue that you linked ? Thanks

Itssshikhar · 2024-10-09T14:03:26Z

Yeah, sure.

There is a similar issue in Pytorch (pytorch/pytorch#113496) which causes the same error. Reason being, initialization error in the forward pass, which causes FSDP to fail.

The Fix seems fairly simple, as we just have to run forward pass once using dummy values, before initializing FSDP.

muellerzr

Thanks! Fix makes sense to me, thanks for the explanation. Could you document and add the link to that issue on top of the _init_fsdp func so we can fully trace why this is needed?

Also please do pip install -e .[quality]; make fixup and this will fix the quality tests.

Itssshikhar · 2024-10-15T07:50:04Z

@muellerzr @SunMarc
All the tests have passed, but one remains tests_non_models that requires to have CUDA.

It would be great if you guys can see to it once and if there's anything from my end that needs to be done?!

Thanks!

SunMarc

LGTM! I left a comment to show what to fix in order to pass the CI !

SunMarc · 2024-10-15T10:19:52Z

tests/trainer/test_trainer.py

@@ -4911,3 +4911,33 @@ def test_get_optimizer_group(self):
            param = next(model.parameters())
            group = trainer.get_optimizer_group(param)
            self.assertIn(param, group["params"])
+
+


You need to pass a @require_cuda decorator for this test !

HuggingFaceDocBuilderDev · 2024-10-15T12:14:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ringohoffman

This PR broke the test in tests/trainer/test_trainer_fsdp.py, which actually tests initializing a trainer using FSDP.

Default synced_gpus to True when using FullyShardedDataParallel #33483 (comment)

Given that there are also some flaws in the logic of this PR, it might be worth reverting this so it can be properly relanded.

@SunMarc

ringohoffman · 2024-10-16T05:56:34Z

src/transformers/trainer.py

+                dtype=torch.long,
+                device=device,
+            )
+            for name in model.forward.__code__.co_varnames


These are the variable names inside of forward... not the parameters to forward. I think you probably meant to do something like inspect.signature.

ringohoffman · 2024-10-16T05:57:33Z

src/transformers/trainer.py

+            name: torch.ones(
+                (1, 512),
+                dtype=torch.long,
+                device=device,
+            )
+            for name in model.forward.__code__.co_varnames


Not every parameter to forward is a tensor, but you are sending in a tensor for every value.

Qubitium · 2024-10-16T06:41:17Z

Regression as result of this merge for trl/sft + fsdp training on 2x gpu.

TypeError: LlamaForCausalLM.forward() got an unexpected keyword argument 'args'

trl 0.11.4
accelerate 1.0.1
transformers 4.46.0.dev

File "/python/ai/train/sft_trainer.py", line 380, in <module>
    trainer = SFTTrainer(
              ^^^^^^^^^^^
  File "/python/ai/train/sft_trainer.py", line 380, in <module>
    trainer = SFTTrainer(
              ^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 401, in __init__
    super().__init__(
  File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 401, in __init__
    super().__init__(
  File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 639, in __init__
    self.model = _init_fsdp(self.model, self.accelerator, self.args.device)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 305, in _init_fsdp
    _ = model(**dummy_input)
        ^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 639, in __init__
    self.model = _init_fsdp(self.model, self.accelerator, self.args.device)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 305, in _init_fsdp
    _ = model(**dummy_input)
        ^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 863, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 863, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/accelerate/utils/operations.py", line 820, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/accelerate/utils/operations.py", line 808, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/accelerate/utils/operations.py", line 820, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
TypeError: LlamaForCausalLM.forward() got an unexpected keyword argument 'args'
  File "/root/miniconda3/lib/python3.11/site-packages/accelerate/utils/operations.py", line 808, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
TypeError: LlamaForCausalLM.forward() got an unexpected keyword argument 'args'

SunMarc · 2024-10-16T10:31:37Z

Thanks for the heads-up @Qubitium @ringohoffman ! I will revert this PR !

This reverts commit 4de1bdb.

Itssshikhar · 2024-10-16T15:08:30Z

Thanks for info @Qubitium @ringohoffman on the PR. I'll try to resolve the errors.

Revert "Fix FSDP resume Initialization issue (#34032)" This reverts commit 4de1bdb.

* Fix FSDP Initialization for resume training * Added init_fsdp function to work with dummy values * Fix FSDP initialization for resuming training * Added CUDA decorator for tests * Added torch_gpu decorator to FSDP tests * Fixup for failing code quality tests

Revert "Fix FSDP resume Initialization issue (huggingface#34032)" This reverts commit 4de1bdb.

* Fix FSDP Initialization for resume training * Added init_fsdp function to work with dummy values * Fix FSDP initialization for resuming training * Added CUDA decorator for tests * Added torch_gpu decorator to FSDP tests * Fixup for failing code quality tests

Revert "Fix FSDP resume Initialization issue (huggingface#34032)" This reverts commit 4de1bdb.

* Fix FSDP Initialization for resume training * Added init_fsdp function to work with dummy values * Fix FSDP initialization for resuming training * Added CUDA decorator for tests * Added torch_gpu decorator to FSDP tests * Fixup for failing code quality tests

Revert "Fix FSDP resume Initialization issue (huggingface#34032)" This reverts commit 4de1bdb.

Itssshikhar added 2 commits October 9, 2024 11:33

Fix FSDP Initialization for resume training

c541ce2

Added init_fsdp function to work with dummy values

9b95b64

muellerzr approved these changes Oct 10, 2024

View reviewed changes

Fix FSDP initialization for resuming training

263245c

Itssshikhar marked this pull request as ready for review October 15, 2024 07:04

LysandreJik approved these changes Oct 15, 2024

View reviewed changes

SunMarc approved these changes Oct 15, 2024

View reviewed changes

Itssshikhar added 3 commits October 15, 2024 16:10

Added CUDA decorator for tests

83ec986

Added torch_gpu decorator to FSDP tests

f49b50d

Fixup for failing code quality tests

b5918d9

SunMarc merged commit 4de1bdb into huggingface:main Oct 15, 2024
24 of 25 checks passed

ringohoffman mentioned this pull request Oct 16, 2024

Default synced_gpus to True when using FullyShardedDataParallel #33483

Merged

5 tasks

ringohoffman reviewed Oct 16, 2024

View reviewed changes

SunMarc added a commit that referenced this pull request Oct 16, 2024

Revert "Fix FSDP resume Initialization issue (#34032)"

98879c5

This reverts commit 4de1bdb.

SunMarc mentioned this pull request Oct 16, 2024

Revert "Fix FSDP resume Initialization issue" #34193

Merged

muellerzr pushed a commit that referenced this pull request Oct 16, 2024

Revert "Fix FSDP resume Initialization issue" (#34193)

3f06f95

Revert "Fix FSDP resume Initialization issue (#34032)" This reverts commit 4de1bdb.

NielsRogge pushed a commit to NielsRogge/transformers that referenced this pull request Oct 21, 2024

Revert "Fix FSDP resume Initialization issue" (huggingface#34193)

2da966d

Revert "Fix FSDP resume Initialization issue (huggingface#34032)" This reverts commit 4de1bdb.

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024

Revert "Fix FSDP resume Initialization issue" (huggingface#34193)

f96c777

Revert "Fix FSDP resume Initialization issue (huggingface#34032)" This reverts commit 4de1bdb.

BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024

Revert "Fix FSDP resume Initialization issue" (huggingface#34193)

c7d64cd

Revert "Fix FSDP resume Initialization issue (huggingface#34032)" This reverts commit 4de1bdb.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FSDP resume Initialization issue #34032

Fix FSDP resume Initialization issue #34032

Itssshikhar commented Oct 9, 2024

SunMarc commented Oct 9, 2024

Itssshikhar commented Oct 9, 2024

muellerzr left a comment

Itssshikhar commented Oct 15, 2024

SunMarc left a comment

SunMarc Oct 15, 2024

HuggingFaceDocBuilderDev commented Oct 15, 2024

ringohoffman left a comment •

edited

Loading

ringohoffman Oct 16, 2024 •

edited

Loading

ringohoffman Oct 16, 2024

Qubitium commented Oct 16, 2024

SunMarc commented Oct 16, 2024

Itssshikhar commented Oct 16, 2024

Fix FSDP resume Initialization issue #34032

Fix FSDP resume Initialization issue #34032

Conversation

Itssshikhar commented Oct 9, 2024

SunMarc commented Oct 9, 2024

Itssshikhar commented Oct 9, 2024

muellerzr left a comment

Choose a reason for hiding this comment

Itssshikhar commented Oct 15, 2024

SunMarc left a comment

Choose a reason for hiding this comment

SunMarc Oct 15, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Oct 15, 2024

ringohoffman left a comment • edited Loading

Choose a reason for hiding this comment

ringohoffman Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

ringohoffman Oct 16, 2024

Choose a reason for hiding this comment

Qubitium commented Oct 16, 2024

SunMarc commented Oct 16, 2024

Itssshikhar commented Oct 16, 2024

ringohoffman left a comment •

edited

Loading

ringohoffman Oct 16, 2024 •

edited

Loading