You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running the code on a single machine with multiple GPUs, I encountered an error, whether using DeepSpeed ZeRO or ZeRO-3. Specifically, the error occurred when initializing the distributed model using Accelerate, resulting in a recursive error. For context, I used Llama3 as both the critic model and policy model, aiming simply to run the entire code end-to-end.
Code Modifications
I only made changes to a portion of the data preprocessing code and did not alter the model class code.
Troubleshooting Steps Tried
I attempted to resolve the issue by changing the DeepSpeed version, but this did not resolve the problem.
Traceback (most recent call last):
File "train_ppo.py", line 228, in <module>
main(opt)
File "train_ppo.py", line 220, in main
trainer = PPOTrainer(opt, policy_model, ref_model, critic_model, reward_model, accelerator)
File "/hy-tmp/My_MOSS-RLHF/ppo/ppo_trainer.py", line 116, in __init__
self.model, self.optimizer, self.scheduler = self.accelerator.prepare(self.model, self.optimizer, self.scheduler)
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/accelerate/accelerator.py", line 1344, in prepare
result = self._prepare_deepspeed(*args)
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/accelerate/accelerator.py", line 1851, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs)
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/deepspeed/__init__.py", line 193, in initialize
engine = DeepSpeedEngine(args=args,
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 271, in __init__
self._configure_distributed_model(model)
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1160, in _configure_distributed_model
self.module.bfloat16()
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 856, in bfloat16
return self._apply(lambda t: t.bfloat16() if t.is_floating_point() else t)
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
[Previous line repeated 982 more times]
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 663, in _apply
with torch.no_grad():
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 133, in __enter__
torch.set_grad_enabled(False)
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 228, in __init__
self.prev = torch.is_grad_enabled()
RecursionError: maximum recursion depth exceeded while calling a Python object
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58428 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58430 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58431 closing signal SIGTERM
The text was updated successfully, but these errors were encountered:
When running the code on a single machine with multiple GPUs, I encountered an error, whether using DeepSpeed ZeRO or ZeRO-3. Specifically, the error occurred when initializing the distributed model using Accelerate, resulting in a recursive error. For context, I used Llama3 as both the critic model and policy model, aiming simply to run the entire code end-to-end.
Code Modifications
I only made changes to a portion of the data preprocessing code and did not alter the model class code.
Troubleshooting Steps Tried
I attempted to resolve the issue by changing the DeepSpeed version, but this did not resolve the problem.
The text was updated successfully, but these errors were encountered: