-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't keep box muller transform state between kernel launches #649
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #649 +/- ##
==========================================
+ Coverage 88.62% 88.97% +0.35%
==========================================
Files 111 108 -3
Lines 15062 14768 -294
==========================================
- Hits 13348 13140 -208
+ Misses 1714 1628 -86 ☔ View full report in Codecov by Sentry. |
e61dc4b
to
1294128
Compare
…ase`` entries as well as initialisers for cleanup code
…IP backend * Write struct to definitions * Use new class for allocating memory and struct fields * Reimplemented population RNG preamble and postamble in ``BackendSIMT::getPopulationRNG`` using new destructor mechanism to copy from and to internal struct
… generated in correct scope
1294128
to
beb19d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow ... what a detail and what an effect. It doesn't necessarily look very elegant but the savings seem worth it!
This builds on the new hybrid HIP/CUDA backend from #647 so review that first!
As described in #648, beyond its (arguably excessive) 192 bit of state, the curand XORWOW RNG used to provide randomness for neuron and custom connectivity updates also stores 160 bits of box muller transform state in the
curandState
struct (BM draws two numbers and produces two normally distributed values so this state is used to cache one of those results for subsequent calls tocurand_normal
).In this PR, when using CUDA and HIP backends, we create our own
XORWowStateInternal
struct in definitions.h without the BM state and store this in memory. At the start of the neuron and custom connectivity update kernels, we copy the fields from theXORWowStateInternal
struct into a localcurandState
and, at the end, we copy them back.Excitingly, because we are very memory bandwidth bound, this makes the neuron kernel on the cortical microcircuit model about 60% faster. On my A5000 (running for 1 second):