Faster Bloch GPU #537

rkierulf · 2025-01-23T23:49:20Z

To keep my GPU programming skills sharp while applying to jobs, I've been working on a faster version of the Bloch GPU code I wrote this past summer. The first thing I noticed while working on this was that my earlier profiling wasn't 100% correct - in Julia, GPU kernel launches are all asynchronous, but they appear to be synchronous since kernels execute in the same order as they are submitted. So I needed to also add KA.synchronize(backend) before each @signpost_event to see which parts were taking the longest. The main parts taking the longest were these lines:

Δϕ .= (Bz[:,2:end] .+ Bz[:,1:end-1]) .* Δt .* T(-π .* γ)

pre.Mxy[:,2:end] .= M.xy .* exp.(-seq_block.tp_ADC' ./ p.T2) .* cis.(ϕ_ADC)

pre.φ .= T(-π .* γ) .* (pre.B[:,1:end-1] .* seq.Δt')

I think with the logical indexing of the Bz, B, and M.xy matrices, it is too hard for the compiler to generate efficient kernels for these operations that don't require allocating new memory. So writing simple kernels for these operations will be faster.

The other major thing I noticed was that the previous approach of trying to distribute work to one thread per time point and spin index is likely not optimal, since the previous implementation was memory bound. Each time we are creating a new matrix like in the line below:

pre.ΔT1 .= exp.(-seq.Δt' ./ p.T1)

this requires number of spins x number of time points writes to global GPU memory, plus the same number of reads later. In theory, only number of spins + number of time points global memory reads are needed. Consider the following cumsum example:

using Metal, BenchmarkTools, KernelAbstractions

A = MtlArray(rand(Float32, (10000, 1000)))
B = similar(A)

@kernel function naive_cumsum!(B, @Const(A))
     i = @index(Global)
     T = eltype(A)

    cur_val = zero(T)
    for k ∈ 1:size(A, 2)
        @inbounds cur_val += A[i, k]
        @inbounds B[i, k] = cur_val
    end
end

@benchmark Metal.@sync cumsum!(B, A; dims=2)

BenchmarkTools.Trial: 402 samples with 1 evaluation per sample.
 Range (min … max):  11.520 ms …  21.151 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.301 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.433 ms ± 688.393 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▁▃ ▄█▄▇▂▆▄         ▁  ▁     ▅▁                  
  ▃▃▆▄▁▁▁▁▃▁▁▃▄▃██▆████████▇▅▅▄▆▅▄▆█████▆▆▅▇███▅▃▃▁▃▃▃▁▁▃▃▁▃▁▃ ▄
  11.5 ms         Histogram: frequency by time         13.4 ms <

 Memory estimate: 7.33 KiB, allocs estimate: 309.

@benchmark Metal.@sync naive_cumsum!(MetalBackend())(B, A, ndrange=size(A,1))

BenchmarkTools.Trial: 1836 samples with 1 evaluation per sample.
 Range (min … max):  1.690 ms …   3.963 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.864 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.713 ms ± 547.521 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                    ▇█                         
  ▂▂▄▂▃▃▂▃▄▅▅▅▄▄▇▅▃▃▃▂▃▃▂▂▂▂▂▃▄▄▄▃▄▄██▆▄▄▃▂▃▂▃▃▁▂▂▂▂▃▃▂▂▂▂▃▅▃ ▂
  1.69 ms         Histogram: frequency by time        3.76 ms <

 Memory estimate: 5.16 KiB, allocs estimate: 216.

The "naive" version outperforms the default Metal accumulate function which does a tree-based reduction even though it uses far less threads, since both kernels are memory bound and the naive version has a more efficient memory access pattern for this example.

With this in mind, I've rewritten the run_spin_precession! and run_spin_excitation! functions with a focus on minimizing the total number of global memory reads, so that in each block only the phantom and sequence arrays are read from and temporary values calculated from these arrays are stored in registers, not GPU global memory. For the signal output, we do still need to use a matrix of size number of spins x number of time points in global memory, but this can be reduced by a factor of the block size by doing a block-level reduction in shared memory before writing to the output.

Testing the changes on my Mac M1, the benchmarks are about 3-5x faster, but they seem to make the most difference with a large phantom object - for example, if I get rid of the [1:10,000] part setting up the slice selection benchmark so it is using the full 3D brain phantom, then it is over 10x faster. There is also a strange issue where two of the tests are failing but only in my test environment - if I copy paste the test body into its own script then it produces the expected results. I haven't been able to figure out the reason for this yet.

cncastillo · 2025-01-24T14:23:40Z

Wow amazing! Thanks @rkierulf ! Currently the CI is a little messed up, so I cannot merge it straight away. I benchmarked the previous implementation, with this code:

GPU_Benchmarks_ISMRM2025.zip

And got:

The previous method didn't scale very well. Do you think the changes you have made can improve this? There are clearly two regimes, flat and linear growth, not sure if they correspond to the memory- and compute-bound regimes. What are your thoughts on this?

rkierulf · 2025-01-24T22:03:21Z

Interesting! My guess is that the transition from flat to linear growth begins when the simulation becomes large enough to fully saturate the resources of the GPU. After that, due to the memory bound nature of the problem the growth in time should be proportional to the number of additional reads + writes to GPU global memory per added spin index. Since the new implementation has far fewer GPU memory operations per spin index, I would expect the slope to be less steep. And hopefully, it is more efficient overall so the curve would shift somewhat to the right as well. I'd be curious to see the results of this script run again with the new changes.

Regarding the CI Failures:

OneAPI and non-Buildkite CI failures: appear to be preexisting?
Metal failures: these are the same two tests I was referencing that were only failing in my test environment, except here they are failing on Julia 1.10. Very strange, I'll try to investigate some more
CUDA and AMD failures: both are failing on 1.10 while calling KA.synchronize() in run_sim_time_iter!. ~~Maybe I need to be more conservative and call KA.synchronize() immediately after the kernels finish, I'll see if this fixes the issue~~ EDIT: The issue doesn't appear to be synchronization related, it is just that it checks for device errors during synchronization. The tests need to be run with -g2 to see the full error stacktrace: (https://discourse.julialang.org/t/cuda-kernel-error/31876/2)

…into fast-gpu

cncastillo · 2025-01-25T01:47:56Z

Thanks for investigating. I will rerun the benchmarks using this PR when I have more time (moving to Stanford tomorrow, and then I have a conference in Washington DC).

I believe the problem with Julia 1.11 and oneAPI is: JuliaGPU/oneAPI.jl#473.

Recently they changed the internals of all the GPU packages to be based on KA.jl. onAPI.jl v2.0 should include these changes, so bumping the compat to include 2 might fix than error, but not sure.

Don't mind the GitHub CI errors, @Stockless is currently fixing those.

Initial commit

7eb2902

rkierulf requested a review from cncastillo as a code owner January 23, 2025 23:49

rkierulf added 4 commits January 24, 2025 15:22

synchronize immediately after kernels finish

fa9abfb

add -g2 to AMD and CUDA tests to see error

a2b8e87

Merge branch 'fast-gpu' of https://github.com/JuliaHealth/KomaMRI.jl …

143e555

…into fast-gpu

change synchronization back

69a3d71

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Bloch GPU #537

Faster Bloch GPU #537

rkierulf commented Jan 23, 2025

cncastillo commented Jan 24, 2025

rkierulf commented Jan 24, 2025 •

edited

Loading

cncastillo commented Jan 25, 2025 •

edited

Loading

Faster Bloch GPU #537

Are you sure you want to change the base?

Faster Bloch GPU #537

Conversation

rkierulf commented Jan 23, 2025

cncastillo commented Jan 24, 2025

rkierulf commented Jan 24, 2025 • edited Loading

cncastillo commented Jan 25, 2025 • edited Loading

rkierulf commented Jan 24, 2025 •

edited

Loading

cncastillo commented Jan 25, 2025 •

edited

Loading