-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZBVZeroBubble error #774
Comments
if i replace 'loop' to 'v' style:
|
Can you include your full repro instructions? (at least: whole toml file, command-line, branch/commit of torchtitan and pytorch being used) Just sharing the pipeline-specific snippet from the config isn't enough for me to run and reproduce your issue. |
@wconstab I think this feature is missing a proper interface in torchtitan. Here's what @H-Huang told me: ZBVSchedule requires that the stages on each rank to be ordered differently (v-formation). In titan we do interleaved by default so something like: Rank 0: 0, 4, 8 For allocating the stages in a V shape, it requires something like: Rank 0: 0 7 8 Did you mean to do InterleavedZeroBubSchedule instead? Or were you trying out the ZBVschedule? ZBV is new and I still haven't added a test for ZBV in titan and need to figure out a UX friendly way to do it. |
Compared to its importance, I hope there is a way to address #773, as ZeroBubble TGS is quite slow, which renders it ineffective. |
I found that the |
I think one way to address this is to
@H-Huang wdyt? If we need to add the arg for (3), then we could also do that first to unblock titan users and then work on the (1, 2) steps as a cleanup |
Sorry, I started a PR addressing this by adding the cmd arg, let me get it into a good state for review: #787 |
We use `stage_index_to_group_rank` in the stage to determine what send/recv ops and in the schedule for IR generation. However, we don't need to expose this as an argument in our schedule class, so this stack of PRs is to remove it. This PR creates a `stage_index_to_group_rank` utility function and removes the arg for the ZBVschedule. In a following PR I will add code to infer the `stage_index_to_group_rank` for the CSV schedule path and we will be able to remove this argument from our classes entirely. Related comment from wconstab pytorch/torchtitan#774 (comment) cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
We use `stage_index_to_group_rank` in the stage to determine what send/recv ops and in the schedule for IR generation. However, we don't need to expose this as an argument in our schedule class, so this stack of PRs is to remove it. This PR creates a `stage_index_to_group_rank` utility function and removes the arg for the ZBVschedule. In a following PR I will add code to infer the `stage_index_to_group_rank` for the CSV schedule path and we will be able to remove this argument from our classes entirely. Related comment from @wconstab pytorch/torchtitan#774 (comment) Pull Request resolved: #146193 Approved by: https://github.com/wconstab
world-size is 8
The text was updated successfully, but these errors were encountered: