-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable GPU execution of atm_advance_acoustic_step via OpenACC #1251
base: develop
Are you sure you want to change the base?
Enable GPU execution of atm_advance_acoustic_step via OpenACC #1251
Conversation
NOTE: This PR is paused. I am sorting out the merge conflicts and a run-time error. I will notify again when this PR is ready for review. |
58ba84a
to
0730fa2
Compare
0730fa2
to
09e60a5
Compare
Force-push 0730fa2 to 09e60a5 to consistently add new invariant fields at the end of sections in @mgduda and @abishekg7 this should be ready for review! |
|
||
!MGD this loop will not be very load balanced with if-test below | ||
|
||
!$acc parallel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add default(present)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, addressed now by fixup d7109c1
end if | ||
|
||
!$OMP BARRIER | ||
|
||
!$acc parallel | ||
!$acc loop gang private(ts,rs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it help to specify gang worker
here instead of only gang
? I tried it out and it improves performance marginally, but also wondering if there's a reason we want to keep this as gang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think worker
wasn't specified in case I needed that level in this big loop. I can add it easily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well after "easily" turned out to be harder than I thought, this is now addressed by fixup 2030ecc
!$acc loop seq | ||
do i=1,nEdgesOnCell(iCell) | ||
iEdge = edgesOnCell(i,iCell) | ||
cell1 = cellsOnEdge(1,iEdge) | ||
cell2 = cellsOnEdge(2,iEdge) | ||
!DIR$ IVDEP | ||
do k=1,nVertLevels | ||
flux = edgesOnCell_sign(i,iCell)*dts*dvEdge(iEdge)*ru_p(k,iEdge) * invAreaCell(iCell) | ||
rs(k) = rs(k)-flux | ||
ts(k) = ts(k)-flux*0.5*(theta_m(k,cell2)+theta_m(k,cell1)) | ||
!$acc loop vector | ||
do k=1,nVertLevels | ||
flux = edgesOnCell_sign(i,iCell)*dts*dvEdge(iEdge)*ru_p(k,iEdge) * invAreaCell(iCell) | ||
rs(k) = rs(k)-flux | ||
ts(k) = ts(k)-flux*0.5*(theta_m(k,cell2)+theta_m(k,cell1)) | ||
end do | ||
end do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for @mgduda and @gdicker1 : We could perhaps rewrite this loop with an acc reduction(+: flux.. etc)
instead of acc loop seq
, but would our priority right now be to reorder the code as little as possible? I also have similar loops in the PR I'm working on.
(I did quickly try using reduction for this loop, but it didn't really result in any performance improvement. But perhaps there might be something off in my code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's value in trying to modify the code as little as possible during this phase of the porting work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with both points - ts
and rs
seems like good candidates for a reduction clause here and we also shouldn't re-write the loop (yet). I'll keep this in mind for our optimization phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. I'll also follow a similar approach for now with my PRs.
@mgduda and @abishekg7 this is ready for review now if you want. I plan to squash this into one commit like #1237 later today if you'd rather review that. |
2030ecc
to
6ec497f
Compare
@mgduda and @abishekg7, force-push 2030ecc to 6ec497f squashed this to one commit. Let me know what you think! EDIT: caught a typo of mine, this second force-push fixed it. |
6ec497f
to
68253c3
Compare
<<<<<<< HEAD | ||
!$acc loop gang worker vector collapse(2) | ||
======= | ||
!$acc loop collapse(2) | ||
>>>>>>> d7109c12a (fixup! Add acc data movement to atm_advance_acoustic_step_work) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a merge conflict here fyi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting it. I'll get that sorted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<<<<<<< HEAD | ||
!$acc loop gang worker private(ts,rs) | ||
======= | ||
!$acc loop gang private(ts,rs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enables the GPU execution of the atm_advance_acoustic_step_work subroutine by adding OpenACC directives. In order to discount the time spent to transfer data between CPU and GPU within this routine, the new timer 'atm_advance_acoustic_step [ACC_data_xfer]' has been added to the log file. Changes include: - Preparing the routine for porting. Modifying whitespace to make regions clear, changing implicit loop assignments to be explicit, and fusing some loops. - Adding OpenACC parallel and loop directives to the do-loops. - Managing the invariant fields needed for this routine in mpas_atm_dynamics_{init,finalize} so they are available across timesteps. - Managing the other fields needed in the routine with OpenACC directives and using default(present) to ensure data isn't missed. default(present) clauses cause a run-time error if data isn't present.
68253c3
to
6e40864
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with both limited area and J-W baroclinic cases, and get bit identical results with the develop branch. Looks good.
This PR makes small code modifications and adds OpenACC directives so the
atm_advance_acoustic_step_work
routine can execute on GPU(s).Timing information for the OpenACC data transfers in this routine is captured in the log file by a new timer:
atm_advance_acoustic_step [ACC_data_xfer]
.Invariant fields used in this routine are also copied to the device within
mpas_atm_dynamics_init
and are deleted inmpas_atm_dynamics_finalize
.