Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building cuda.bindings in parallel might fail non-deterministically #271

Closed
leofang opened this issue Dec 6, 2024 · 4 comments · Fixed by #361
Closed

Building cuda.bindings in parallel might fail non-deterministically #271

leofang opened this issue Dec 6, 2024 · 4 comments · Fixed by #361
Assignees
Labels
bug Something isn't working cuda.bindings Everything related to the cuda.bindings module P0 High priority - Must do!

Comments

@leofang
Copy link
Member

leofang commented Dec 6, 2024

Our new CI (#267) immediately exposes a longstanding, widely known setuptools/distutils issue when PARALLEL_LEVEL is set, xref:

    D:\a\cuda-python\cuda-python\cuda_bindings\cuda\bindings\_bindings\loader.cpp : fatal error C1083: Cannot open compiler generated file: 'D:\a\cuda-python\cuda-python\cuda_bindings\build\temp.win-amd64-cpython-310\Release\cuda\bindings\_bindings\loader.obj': Permission denied
@leofang
Copy link
Member Author

leofang commented Dec 6, 2024

tl;dr for the above threads is: The way that we build/link the same .c/.cpp file to multiple shared extension modules, i.e.,

# private
["cuda/bindings/_bindings/*.pyx", "cuda/bindings/_bindings/loader.cpp"],

is the root cause, as it triggers setuptools's race condition (which is a bug that would unlikely be fixed).

@leofang leofang added bug Something isn't working P1 Medium priority - Should do cuda.bindings Everything related to the cuda.bindings module labels Dec 6, 2024
@leofang leofang added this to the cuda-python 12-next, 11-next milestone Dec 6, 2024
@leofang
Copy link
Member Author

leofang commented Dec 8, 2024

Bumping this to P0 as it now shows up in virtually every single (build) CI runs. @vzhurba01 please prioritize for fixing issues caught by the CI. Thanks!

@leofang leofang added P0 High priority - Must do! and removed P1 Medium priority - Should do labels Dec 8, 2024
@leofang
Copy link
Member Author

leofang commented Dec 28, 2024

Bumping this to P0 as it now shows up in virtually every single (build) CI runs.

I notice the situation gets worse after CI 2.0 was merged. Now when we rerun the failed jobs, not only the failed Windows build jobs are rerun, but also all downstream (test/doc) jobs, because as noted earlier there's no way in GHA to declare dependency on specific matrix elements of a job, only on the whole job.
Image

@leofang
Copy link
Member Author

leofang commented Jan 9, 2025

Looks like we fixed it! The build workflow has been run many times today, but I haven't observed any flaky failures happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda.bindings Everything related to the cuda.bindings module P0 High priority - Must do!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants