Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hyperparameter search when optuna+deepseed #34642

Merged
merged 2 commits into from
Nov 20, 2024

Conversation

corentin-ryr
Copy link
Contributor

Problem Overview

In a distributed DeepSpeed training setting, using Optuna for hyperparameter optimization (HPO) causes hyperparameter inconsistencies across processes. The root of the problem is that trainer._hp_search_setup, which applies the hyperparameters for each trial, only executes on the main process. This causes discrepancies in settings such as gradient_accumulation_steps across processes, leading to timeout errors due to process misalignment during training.

The current code broadcasts the trainer arguments and applies them to the auxiliary processes but doesn't execute the code to update the deepspeed/accelerate configuration. I also think that the current code would not account for hyperparameter used in the model_init on the auxiliary processes.

Solution

To ensure consistent hyperparameter settings across all processes, the following changes were implemented:

  1. Broadcast a FixedTrial object: After sampling a trial on the main process, we create a FixedTrial object on the main process and broadcast it to the other processes.

  2. Update _report_to_hp_search: The _report_to_hp_search method was modified to not update the study in case we are on a process that got a FixedTrial (the FixedTrial doesn't have a reference to the study object which is necessary to report to Optuna).

Implementation Details

In integration_utils.py:

After initializing the trial on the main process, we create a FixedTrial object with the same hyperparameters, we serialize it and broadcast it to all other processes. The train method is called with the trial argument.

On the main process we sample a trial (using the hp_space function) and create the FixedTrial object (lines 240):

if trainer.args.world_size > 1:
    if trainer.args.parallel_mode != ParallelMode.DISTRIBUTED:
        raise RuntimeError("only support DDP optuna HPO for ParallelMode.DISTRIBUTED currently.")
    trainer.hp_space(trial)
    fixed_trial = optuna.trial.FixedTrial(trial.params, trial.number)
    trial_main_rank_list = [fixed_trial]
    torch.distributed.broadcast_object_list(trial_main_rank_list, src=0)
    trainer.train(resume_from_checkpoint=checkpoint, trial=trial)

On the auxiliary processes we retrieve the FixedTrial and call train by passing the trial as an argument (lines 269):

for i in range(n_trials):
    trainer.objective = None
    trial_main_rank_list = [None]
    if trainer.args.parallel_mode != ParallelMode.DISTRIBUTED:
        raise RuntimeError("only support DDP optuna HPO for ParallelMode.DISTRIBUTED currently.")
    torch.distributed.broadcast_object_list(trial_main_rank_list, src=0)
    trainer.train(resume_from_checkpoint=None, trial=trial_main_rank_list[0])
    # If there hasn't been any evaluation during the training loop.
    if getattr(trainer, "objective", None) is None:
        metrics = trainer.evaluate()
        trainer.objective = trainer.compute_objective(metrics)

Other changes:

The _hp_search_setup method now checks if it receives a dictionary of hyperparameters directly, allowing each process to configure itself independently (line 1751):

- if not trial.study._is_multi_objective():
+ if hasattr(trial, "study") and not trial.study._is_multi_objective():

In the hp_params the check for the trial object class is broaden to BaseTrial (line 211):

- if isinstance(trial, optuna.Trial):
+ if isinstance(trial, optuna.trial.BaseTrial):

Additional Notes

This change should ensures consistent hyperparameter application and prevents deadlocks during distributed training.

Let me know whether there might be a more efficient or cleaner way to handle the broadcast and reinitialization across processes. I think Marc Sun worked on the Optuna integration before.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Thanks for fixing this ! Just a nit. Could you try to see if the optuna tests from trainer are passing ? Would you like to have a look @sywangyi as you contributed to a lot of optuna ?

Comment on lines +1728 to +1730

self.accelerator.free_memory()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have you added that here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had an issue where the memory allocated by the Deepspeed engine was not properly released. I found that calling engine.destroy() fixed the problem partially (accelerator.free_memory() calls engine.destroy()).

At the moment, we still have memory leakage but I believe the issue is on Deepspeed side. I added some del statements in the destroy function of the deepspeed engine which solved the issue and I will try to propose a PR there.

I put the call to free_memory here because the engine is reset just after that (in self.create_accelerator_and_postprocess()).

@corentin-ryr
Copy link
Contributor Author

Hey,
Thanks for the review.

The tests in tests/trainer/test_trainer.py run without issues but I'm having trouble running the deepspeed tests (due to NCCL it seems).

@SunMarc SunMarc requested a review from LysandreJik November 18, 2024 14:18
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for a reservation around the free_memorycall

@@ -1725,6 +1725,9 @@ def _hp_search_setup(self, trial: Union["optuna.Trial", Dict[str, Any]]):
if self.is_deepspeed_enabled:
if self.args.deepspeed is None:
raise ValueError("For sweeps with deepspeed, `args.deepspeed` must be set")

self.accelerator.free_memory()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@muellerzr is this statement here safe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep it's safe and being used how it should.

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this solution!

@SunMarc SunMarc merged commit bf42c3b into huggingface:main Nov 20, 2024
24 checks passed
@corentin-ryr corentin-ryr deleted the fix-hp-search-ddp branch November 21, 2024 14:26
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* Fix hyperparameter search when optuna+deepseed

* Adding free_memory to the search setup

---------

Co-authored-by: Corentin-Royer <corentin.royer@ibm.com>
BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024
* Fix hyperparameter search when optuna+deepseed

* Adding free_memory to the search setup

---------

Co-authored-by: Corentin-Royer <corentin.royer@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
  NODES
COMMUNITY 2
innovation 1
Note 1
Project 5
USERS 1