Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecursionError: maximum recursion depth exceeded while calling a Python object #59

Open
KKKe2922 opened this issue Jan 16, 2025 · 0 comments

Comments

@KKKe2922
Copy link

When running the code on a single machine with multiple GPUs, I encountered an error, whether using DeepSpeed ZeRO or ZeRO-3. Specifically, the error occurred when initializing the distributed model using Accelerate, resulting in a recursive error. For context, I used Llama3 as both the critic model and policy model, aiming simply to run the entire code end-to-end.

Code Modifications

I only made changes to a portion of the data preprocessing code and did not alter the model class code.

Troubleshooting Steps Tried

I attempted to resolve the issue by changing the DeepSpeed version, but this did not resolve the problem.

Traceback (most recent call last):
  File "train_ppo.py", line 228, in <module>
    main(opt)
  File "train_ppo.py", line 220, in main
    trainer = PPOTrainer(opt, policy_model, ref_model, critic_model, reward_model, accelerator)
  File "/hy-tmp/My_MOSS-RLHF/ppo/ppo_trainer.py", line 116, in __init__
    self.model, self.optimizer, self.scheduler = self.accelerator.prepare(self.model, self.optimizer, self.scheduler)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/accelerate/accelerator.py", line 1344, in prepare
    result = self._prepare_deepspeed(*args)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/accelerate/accelerator.py", line 1851, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/deepspeed/__init__.py", line 193, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 271, in __init__
    self._configure_distributed_model(model)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1160, in _configure_distributed_model
    self.module.bfloat16()
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 856, in bfloat16
    return self._apply(lambda t: t.bfloat16() if t.is_floating_point() else t)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  [Previous line repeated 982 more times]
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 663, in _apply
    with torch.no_grad():
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 133, in __enter__
    torch.set_grad_enabled(False)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 228, in __init__
    self.prev = torch.is_grad_enabled()
RecursionError: maximum recursion depth exceeded while calling a Python object
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58428 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58430 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58431 closing signal SIGTERM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant