Fix hyperparameter search when optuna+deepseed #34642

corentin-ryr · 2024-11-07T14:46:17Z

Problem Overview

In a distributed DeepSpeed training setting, using Optuna for hyperparameter optimization (HPO) causes hyperparameter inconsistencies across processes. The root of the problem is that trainer._hp_search_setup, which applies the hyperparameters for each trial, only executes on the main process. This causes discrepancies in settings such as gradient_accumulation_steps across processes, leading to timeout errors due to process misalignment during training.

The current code broadcasts the trainer arguments and applies them to the auxiliary processes but doesn't execute the code to update the deepspeed/accelerate configuration. I also think that the current code would not account for hyperparameter used in the model_init on the auxiliary processes.

Solution

To ensure consistent hyperparameter settings across all processes, the following changes were implemented:

Broadcast a FixedTrial object: After sampling a trial on the main process, we create a FixedTrial object on the main process and broadcast it to the other processes.
Update _report_to_hp_search: The _report_to_hp_search method was modified to not update the study in case we are on a process that got a FixedTrial (the FixedTrial doesn't have a reference to the study object which is necessary to report to Optuna).

Implementation Details

In integration_utils.py:

After initializing the trial on the main process, we create a FixedTrial object with the same hyperparameters, we serialize it and broadcast it to all other processes. The train method is called with the trial argument.

On the main process we sample a trial (using the hp_space function) and create the FixedTrial object (lines 240):

if trainer.args.world_size > 1:
    if trainer.args.parallel_mode != ParallelMode.DISTRIBUTED:
        raise RuntimeError("only support DDP optuna HPO for ParallelMode.DISTRIBUTED currently.")
    trainer.hp_space(trial)
    fixed_trial = optuna.trial.FixedTrial(trial.params, trial.number)
    trial_main_rank_list = [fixed_trial]
    torch.distributed.broadcast_object_list(trial_main_rank_list, src=0)
    trainer.train(resume_from_checkpoint=checkpoint, trial=trial)

On the auxiliary processes we retrieve the FixedTrial and call train by passing the trial as an argument (lines 269):

for i in range(n_trials):
    trainer.objective = None
    trial_main_rank_list = [None]
    if trainer.args.parallel_mode != ParallelMode.DISTRIBUTED:
        raise RuntimeError("only support DDP optuna HPO for ParallelMode.DISTRIBUTED currently.")
    torch.distributed.broadcast_object_list(trial_main_rank_list, src=0)
    trainer.train(resume_from_checkpoint=None, trial=trial_main_rank_list[0])
    # If there hasn't been any evaluation during the training loop.
    if getattr(trainer, "objective", None) is None:
        metrics = trainer.evaluate()
        trainer.objective = trainer.compute_objective(metrics)

Other changes:

The _hp_search_setup method now checks if it receives a dictionary of hyperparameters directly, allowing each process to configure itself independently (line 1751):

- if not trial.study._is_multi_objective():
+ if hasattr(trial, "study") and not trial.study._is_multi_objective():

In the hp_params the check for the trial object class is broaden to BaseTrial (line 211):

- if isinstance(trial, optuna.Trial):
+ if isinstance(trial, optuna.trial.BaseTrial):

Additional Notes

This change should ensures consistent hyperparameter application and prevents deadlocks during distributed training.

Let me know whether there might be a more efficient or cleaner way to handle the broadcast and reinitialization across processes. I think Marc Sun worked on the Optuna integration before.

SunMarc

LGTM ! Thanks for fixing this ! Just a nit. Could you try to see if the optuna tests from trainer are passing ? Would you like to have a look @sywangyi as you contributed to a lot of optuna ?

SunMarc · 2024-11-15T16:11:17Z

src/transformers/trainer.py

+
+            self.accelerator.free_memory()
+


why have you added that here ?

I had an issue where the memory allocated by the Deepspeed engine was not properly released. I found that calling engine.destroy() fixed the problem partially (accelerator.free_memory() calls engine.destroy()).

At the moment, we still have memory leakage but I believe the issue is on Deepspeed side. I added some del statements in the destroy function of the deepspeed engine which solved the issue and I will try to propose a PR there.

I put the call to free_memory here because the engine is reset just after that (in self.create_accelerator_and_postprocess()).

corentin-ryr · 2024-11-18T12:20:36Z

Hey,
Thanks for the review.

The tests in tests/trainer/test_trainer.py run without issues but I'm having trouble running the deepspeed tests (due to NCCL it seems).

HuggingFaceDocBuilderDev · 2024-11-18T14:45:14Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

LysandreJik

LGTM except for a reservation around the free_memorycall

LysandreJik · 2024-11-19T10:06:43Z

src/transformers/trainer.py

@@ -1725,6 +1725,9 @@ def _hp_search_setup(self, trial: Union["optuna.Trial", Dict[str, Any]]):
        if self.is_deepspeed_enabled:
            if self.args.deepspeed is None:
                raise ValueError("For sweeps with deepspeed, `args.deepspeed` must be set")
+
+            self.accelerator.free_memory()


@muellerzr is this statement here safe?

Yep it's safe and being used how it should.

muellerzr

Thanks for this solution!

* Fix hyperparameter search when optuna+deepseed * Adding free_memory to the search setup --------- Co-authored-by: Corentin-Royer <corentin.royer@ibm.com>

corentin-ryr mentioned this pull request Nov 7, 2024

Inconsistent Hyperparameter Application in Distributed DeepSpeed + Optuna Setup for Hyperparameter Search #34641

Closed

4 tasks

corentin-ryr force-pushed the fix-hp-search-ddp branch from 3864814 to 6a9d459 Compare November 12, 2024 09:55

LysandreJik requested review from SunMarc and muellerzr and removed request for SunMarc November 15, 2024 14:20

SunMarc approved these changes Nov 15, 2024

View reviewed changes

Corentin-Royer added 2 commits November 16, 2024 11:33

Fix hyperparameter search when optuna+deepseed

f896244

Adding free_memory to the search setup

5704d08

corentin-ryr force-pushed the fix-hp-search-ddp branch from 6a9d459 to 5704d08 Compare November 16, 2024 10:33

sywangyi approved these changes Nov 18, 2024

View reviewed changes

SunMarc requested a review from LysandreJik November 18, 2024 14:18

LysandreJik approved these changes Nov 19, 2024

View reviewed changes

muellerzr approved these changes Nov 20, 2024

View reviewed changes

SunMarc merged commit bf42c3b into huggingface:main Nov 20, 2024
24 checks passed

corentin-ryr deleted the fix-hp-search-ddp branch November 21, 2024 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hyperparameter search when optuna+deepseed #34642

Fix hyperparameter search when optuna+deepseed #34642

corentin-ryr commented Nov 7, 2024

SunMarc left a comment

SunMarc Nov 15, 2024

corentin-ryr Nov 18, 2024

corentin-ryr commented Nov 18, 2024

HuggingFaceDocBuilderDev commented Nov 18, 2024

LysandreJik left a comment

LysandreJik Nov 19, 2024

muellerzr Nov 20, 2024

muellerzr left a comment

Fix hyperparameter search when optuna+deepseed #34642

Fix hyperparameter search when optuna+deepseed #34642

Conversation

corentin-ryr commented Nov 7, 2024

Problem Overview

Solution

Implementation Details

Additional Notes

SunMarc left a comment

Choose a reason for hiding this comment

SunMarc Nov 15, 2024

Choose a reason for hiding this comment

corentin-ryr Nov 18, 2024

Choose a reason for hiding this comment

corentin-ryr commented Nov 18, 2024

HuggingFaceDocBuilderDev commented Nov 18, 2024

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Nov 19, 2024

Choose a reason for hiding this comment

muellerzr Nov 20, 2024

Choose a reason for hiding this comment

muellerzr left a comment

Choose a reason for hiding this comment