Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable GPU execution of atm_advance_acoustic_step via OpenACC #1251

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

gdicker1
Copy link
Collaborator

@gdicker1 gdicker1 commented Dec 6, 2024

This PR makes small code modifications and adds OpenACC directives so the atm_advance_acoustic_step_work routine can execute on GPU(s).

Timing information for the OpenACC data transfers in this routine is captured in the log file by a new timer: atm_advance_acoustic_step [ACC_data_xfer].

Invariant fields used in this routine are also copied to the device within mpas_atm_dynamics_init and are deleted in mpas_atm_dynamics_finalize.

@mgduda mgduda added Atmosphere OpenACC Work related to OpenACC acceleration of code labels Dec 13, 2024
@mgduda mgduda requested review from mgduda and abishekg7 December 13, 2024 20:35
@gdicker1
Copy link
Collaborator Author

NOTE: This PR is paused. I am sorting out the merge conflicts and a run-time error. I will notify again when this PR is ready for review.

@gdicker1 gdicker1 force-pushed the atmosphere/acc_advance_acoustic_step branch from 58ba84a to 0730fa2 Compare January 7, 2025 20:52
@gdicker1
Copy link
Collaborator Author

gdicker1 commented Jan 7, 2025

Force push from 58ba84a to 0730fa2 is to bring this more in line with the develop branch. This PR should now be ready for review, @mgduda!

@gdicker1 gdicker1 force-pushed the atmosphere/acc_advance_acoustic_step branch from 0730fa2 to 09e60a5 Compare January 10, 2025 19:35
@gdicker1
Copy link
Collaborator Author

Force-push 0730fa2 to 09e60a5 to consistently add new invariant fields at the end of sections in mpas_atm_dynamics_{init,finalize}.

@mgduda and @abishekg7 this should be ready for review!


!MGD this loop will not be very load balanced with if-test below

!$acc parallel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add default(present) here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, addressed now by fixup d7109c1

end if

!$OMP BARRIER

!$acc parallel
!$acc loop gang private(ts,rs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it help to specify gang worker here instead of only gang? I tried it out and it improves performance marginally, but also wondering if there's a reason we want to keep this as gang

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering the same.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think worker wasn't specified in case I needed that level in this big loop. I can add it easily

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well after "easily" turned out to be harder than I thought, this is now addressed by fixup 2030ecc

Comment on lines +2585 to 2597
!$acc loop seq
do i=1,nEdgesOnCell(iCell)
iEdge = edgesOnCell(i,iCell)
cell1 = cellsOnEdge(1,iEdge)
cell2 = cellsOnEdge(2,iEdge)
!DIR$ IVDEP
do k=1,nVertLevels
flux = edgesOnCell_sign(i,iCell)*dts*dvEdge(iEdge)*ru_p(k,iEdge) * invAreaCell(iCell)
rs(k) = rs(k)-flux
ts(k) = ts(k)-flux*0.5*(theta_m(k,cell2)+theta_m(k,cell1))
!$acc loop vector
do k=1,nVertLevels
flux = edgesOnCell_sign(i,iCell)*dts*dvEdge(iEdge)*ru_p(k,iEdge) * invAreaCell(iCell)
rs(k) = rs(k)-flux
ts(k) = ts(k)-flux*0.5*(theta_m(k,cell2)+theta_m(k,cell1))
end do
end do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for @mgduda and @gdicker1 : We could perhaps rewrite this loop with an acc reduction(+: flux.. etc) instead of acc loop seq, but would our priority right now be to reorder the code as little as possible? I also have similar loops in the PR I'm working on.

(I did quickly try using reduction for this loop, but it didn't really result in any performance improvement. But perhaps there might be something off in my code)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's value in trying to modify the code as little as possible during this phase of the porting work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with both points - ts and rs seems like good candidates for a reduction clause here and we also shouldn't re-write the loop (yet). I'll keep this in mind for our optimization phase.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. I'll also follow a similar approach for now with my PRs.

@gdicker1 gdicker1 changed the title Enable GPU exection of atm_advance_acoustic_step via OpenACC Enable GPU execution of atm_advance_acoustic_step via OpenACC Jan 17, 2025
@gdicker1
Copy link
Collaborator Author

@mgduda and @abishekg7 this is ready for review now if you want. I plan to squash this into one commit like #1237 later today if you'd rather review that.

@gdicker1 gdicker1 force-pushed the atmosphere/acc_advance_acoustic_step branch from 2030ecc to 6ec497f Compare January 17, 2025 23:01
@gdicker1
Copy link
Collaborator Author

gdicker1 commented Jan 17, 2025

@mgduda and @abishekg7, force-push 2030ecc to 6ec497f squashed this to one commit. Let me know what you think!

EDIT: caught a typo of mine, this second force-push fixed it.

@gdicker1 gdicker1 force-pushed the atmosphere/acc_advance_acoustic_step branch from 6ec497f to 68253c3 Compare January 17, 2025 23:11
Comment on lines 2541 to 2545
<<<<<<< HEAD
!$acc loop gang worker vector collapse(2)
=======
!$acc loop collapse(2)
>>>>>>> d7109c12a (fixup! Add acc data movement to atm_advance_acoustic_step_work)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a merge conflict here fyi.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting it. I'll get that sorted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by force-push 68253c3 to 6e40864.

Comment on lines 2578 to 2581
<<<<<<< HEAD
!$acc loop gang worker private(ts,rs)
=======
!$acc loop gang private(ts,rs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again. Addressed by force-push 68253c3 to 6e40864.

Enables the GPU execution of the atm_advance_acoustic_step_work subroutine by
adding OpenACC directives. In order to discount the time spent to transfer data
between CPU and GPU within this routine, the new timer
'atm_advance_acoustic_step [ACC_data_xfer]' has been added to the log file.

Changes include:
- Preparing the routine for porting. Modifying whitespace to make regions clear,
  changing implicit loop assignments to be explicit, and fusing some loops.
- Adding OpenACC parallel and loop directives to the do-loops.
- Managing the invariant fields needed for this routine in
  mpas_atm_dynamics_{init,finalize} so they are available across timesteps.
- Managing the other fields needed in the routine with OpenACC directives and
  using default(present) to ensure data isn't missed. default(present) clauses
  cause a run-time error if data isn't present.
@gdicker1 gdicker1 force-pushed the atmosphere/acc_advance_acoustic_step branch from 68253c3 to 6e40864 Compare January 21, 2025 18:59
Copy link
Collaborator

@abishekg7 abishekg7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with both limited area and J-W baroclinic cases, and get bit identical results with the develop branch. Looks good.

@mgduda mgduda self-requested a review January 23, 2025 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Atmosphere OpenACC Work related to OpenACC acceleration of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants