Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Bloch GPU #537

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Faster Bloch GPU #537

wants to merge 5 commits into from

Conversation

rkierulf
Copy link
Collaborator

To keep my GPU programming skills sharp while applying to jobs, I've been working on a faster version of the Bloch GPU code I wrote this past summer. The first thing I noticed while working on this was that my earlier profiling wasn't 100% correct - in Julia, GPU kernel launches are all asynchronous, but they appear to be synchronous since kernels execute in the same order as they are submitted. So I needed to also add KA.synchronize(backend) before each @signpost_event to see which parts were taking the longest. The main parts taking the longest were these lines:

Δϕ .= (Bz[:,2:end] .+ Bz[:,1:end-1]) .* Δt .* T(-π .* γ)

pre.Mxy[:,2:end] .= M.xy .* exp.(-seq_block.tp_ADC' ./ p.T2) .* cis.(ϕ_ADC)

pre.φ .= T(-π .* γ) .* (pre.B[:,1:end-1] .* seq.Δt')

I think with the logical indexing of the Bz, B, and M.xy matrices, it is too hard for the compiler to generate efficient kernels for these operations that don't require allocating new memory. So writing simple kernels for these operations will be faster.

The other major thing I noticed was that the previous approach of trying to distribute work to one thread per time point and spin index is likely not optimal, since the previous implementation was memory bound. Each time we are creating a new matrix like in the line below:

pre.ΔT1 .= exp.(-seq.Δt' ./ p.T1)

this requires number of spins x number of time points writes to global GPU memory, plus the same number of reads later. In theory, only number of spins + number of time points global memory reads are needed. Consider the following cumsum example:

using Metal, BenchmarkTools, KernelAbstractions

A = MtlArray(rand(Float32, (10000, 1000)))
B = similar(A)

@kernel function naive_cumsum!(B, @Const(A))
     i = @index(Global)
     T = eltype(A)

    cur_val = zero(T)
    for k ∈ 1:size(A, 2)
        @inbounds cur_val += A[i, k]
        @inbounds B[i, k] = cur_val
    end
end

@benchmark Metal.@sync cumsum!(B, A; dims=2)

BenchmarkTools.Trial: 402 samples with 1 evaluation per sample.
 Range (min … max):  11.520 ms …  21.151 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.301 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.433 ms ± 688.393 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▁▃ ▄█▄▇▂▆▄         ▁  ▁     ▅▁                  
  ▃▃▆▄▁▁▁▁▃▁▁▃▄▃██▆████████▇▅▅▄▆▅▄▆█████▆▆▅▇███▅▃▃▁▃▃▃▁▁▃▃▁▃▁▃ ▄
  11.5 ms         Histogram: frequency by time         13.4 ms <

 Memory estimate: 7.33 KiB, allocs estimate: 309.

@benchmark Metal.@sync naive_cumsum!(MetalBackend())(B, A, ndrange=size(A,1))

BenchmarkTools.Trial: 1836 samples with 1 evaluation per sample.
 Range (min … max):  1.690 ms …   3.963 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.864 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.713 ms ± 547.521 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                    ▇█                         
  ▂▂▄▂▃▃▂▃▄▅▅▅▄▄▇▅▃▃▃▂▃▃▂▂▂▂▂▃▄▄▄▃▄▄██▆▄▄▃▂▃▂▃▃▁▂▂▂▂▃▃▂▂▂▂▃▅▃ ▂
  1.69 ms         Histogram: frequency by time        3.76 ms <

 Memory estimate: 5.16 KiB, allocs estimate: 216.

The "naive" version outperforms the default Metal accumulate function which does a tree-based reduction even though it uses far less threads, since both kernels are memory bound and the naive version has a more efficient memory access pattern for this example.

With this in mind, I've rewritten the run_spin_precession! and run_spin_excitation! functions with a focus on minimizing the total number of global memory reads, so that in each block only the phantom and sequence arrays are read from and temporary values calculated from these arrays are stored in registers, not GPU global memory. For the signal output, we do still need to use a matrix of size number of spins x number of time points in global memory, but this can be reduced by a factor of the block size by doing a block-level reduction in shared memory before writing to the output.

Testing the changes on my Mac M1, the benchmarks are about 3-5x faster, but they seem to make the most difference with a large phantom object - for example, if I get rid of the [1:10,000] part setting up the slice selection benchmark so it is using the full 3D brain phantom, then it is over 10x faster. There is also a strange issue where two of the tests are failing but only in my test environment - if I copy paste the test body into its own script then it produces the expected results. I haven't been able to figure out the reason for this yet.

@rkierulf rkierulf requested a review from cncastillo as a code owner January 23, 2025 23:49
@cncastillo
Copy link
Member

Wow amazing! Thanks @rkierulf ! Currently the CI is a little messed up, so I cannot merge it straight away. I benchmarked the previous implementation, with this code:

GPU_Benchmarks_ISMRM2025.zip

And got:
image

The previous method didn't scale very well. Do you think the changes you have made can improve this? There are clearly two regimes, flat and linear growth, not sure if they correspond to the memory- and compute-bound regimes. What are your thoughts on this?

@rkierulf
Copy link
Collaborator Author

rkierulf commented Jan 24, 2025

Interesting! My guess is that the transition from flat to linear growth begins when the simulation becomes large enough to fully saturate the resources of the GPU. After that, due to the memory bound nature of the problem the growth in time should be proportional to the number of additional reads + writes to GPU global memory per added spin index. Since the new implementation has far fewer GPU memory operations per spin index, I would expect the slope to be less steep. And hopefully, it is more efficient overall so the curve would shift somewhat to the right as well. I'd be curious to see the results of this script run again with the new changes.

Regarding the CI Failures:

  • OneAPI and non-Buildkite CI failures: appear to be preexisting?
  • Metal failures: these are the same two tests I was referencing that were only failing in my test environment, except here they are failing on Julia 1.10. Very strange, I'll try to investigate some more
  • CUDA and AMD failures: both are failing on 1.10 while calling KA.synchronize() in run_sim_time_iter!. Maybe I need to be more conservative and call KA.synchronize() immediately after the kernels finish, I'll see if this fixes the issue EDIT: The issue doesn't appear to be synchronization related, it is just that it checks for device errors during synchronization. The tests need to be run with -g2 to see the full error stacktrace: (https://discourse.julialang.org/t/cuda-kernel-error/31876/2)

@cncastillo
Copy link
Member

cncastillo commented Jan 25, 2025

Thanks for investigating. I will rerun the benchmarks using this PR when I have more time (moving to Stanford tomorrow, and then I have a conference in Washington DC).

I believe the problem with Julia 1.11 and oneAPI is: JuliaGPU/oneAPI.jl#473.

Recently they changed the internals of all the GPU packages to be based on KA.jl. onAPI.jl v2.0 should include these changes, so bumping the compat to include 2 might fix than error, but not sure.

Don't mind the GitHub CI errors, @Stockless is currently fixing those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants