-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hyperparameter search when optuna+deepseed #34642
Conversation
3864814
to
6a9d459
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! Thanks for fixing this ! Just a nit. Could you try to see if the optuna tests from trainer are passing ? Would you like to have a look @sywangyi as you contributed to a lot of optuna ?
|
||
self.accelerator.free_memory() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why have you added that here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had an issue where the memory allocated by the Deepspeed engine was not properly released. I found that calling engine.destroy()
fixed the problem partially (accelerator.free_memory()
calls engine.destroy()
).
At the moment, we still have memory leakage but I believe the issue is on Deepspeed side. I added some del
statements in the destroy
function of the deepspeed engine which solved the issue and I will try to propose a PR there.
I put the call to free_memory
here because the engine is reset just after that (in self.create_accelerator_and_postprocess()
).
6a9d459
to
5704d08
Compare
Hey, The tests in |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for a reservation around the free_memory
call
@@ -1725,6 +1725,9 @@ def _hp_search_setup(self, trial: Union["optuna.Trial", Dict[str, Any]]): | |||
if self.is_deepspeed_enabled: | |||
if self.args.deepspeed is None: | |||
raise ValueError("For sweeps with deepspeed, `args.deepspeed` must be set") | |||
|
|||
self.accelerator.free_memory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@muellerzr is this statement here safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep it's safe and being used how it should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this solution!
* Fix hyperparameter search when optuna+deepseed * Adding free_memory to the search setup --------- Co-authored-by: Corentin-Royer <corentin.royer@ibm.com>
* Fix hyperparameter search when optuna+deepseed * Adding free_memory to the search setup --------- Co-authored-by: Corentin-Royer <corentin.royer@ibm.com>
Problem Overview
In a distributed DeepSpeed training setting, using Optuna for hyperparameter optimization (HPO) causes hyperparameter inconsistencies across processes. The root of the problem is that
trainer._hp_search_setup
, which applies the hyperparameters for each trial, only executes on the main process. This causes discrepancies in settings such asgradient_accumulation_steps
across processes, leading to timeout errors due to process misalignment during training.The current code broadcasts the trainer arguments and applies them to the auxiliary processes but doesn't execute the code to update the deepspeed/accelerate configuration. I also think that the current code would not account for hyperparameter used in the
model_init
on the auxiliary processes.Solution
To ensure consistent hyperparameter settings across all processes, the following changes were implemented:
Broadcast a FixedTrial object: After sampling a trial on the main process, we create a
FixedTrial
object on the main process and broadcast it to the other processes.Update
_report_to_hp_search
: The_report_to_hp_search
method was modified to not update the study in case we are on a process that got a FixedTrial (the FixedTrial doesn't have a reference to the study object which is necessary to report to Optuna).Implementation Details
In
integration_utils.py
:After initializing the trial on the main process, we create a
FixedTrial
object with the same hyperparameters, we serialize it and broadcast it to all other processes. The train method is called with the trial argument.On the main process we sample a trial (using the
hp_space
function) and create theFixedTrial
object (lines 240):On the auxiliary processes we retrieve the
FixedTrial
and calltrain
by passing the trial as an argument (lines 269):Other changes:
The
_hp_search_setup
method now checks if it receives a dictionary of hyperparameters directly, allowing each process to configure itself independently (line 1751):In the
hp_params
the check for the trial object class is broaden toBaseTrial
(line 211):Additional Notes
This change should ensures consistent hyperparameter application and prevents deadlocks during distributed training.
Let me know whether there might be a more efficient or cleaner way to handle the broadcast and reinitialization across processes. I think Marc Sun worked on the Optuna integration before.