-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster shutdown hangs in batch mode on Linux Python >3.8 #87
Comments
Note that even with the merge of #89, these errors still occur. |
One potential solution to this is to not use an initialize()
with Client() as c:
# dask stuff with the initialize()
with Client() as c:
# dask stuff
c.shutdown() with the def finalize(...):
with Client() as c:
c.shutdown() so that the dask-mpi client script might look like: initialize()
with Client() as c:
# dask stuff
# maybe more stuff
finalize() However, this definitely breaks backwards compatibility. It does fix the above hanging, though. Is this acceptable? |
Possible API changesIf we have to break backwards compatibility, then I want to suggest some other changes that might work with the suggested changes above. With @joezuntz's changes to allow users to pass into Dask-MPI an existing MPI communicator object and turn off explicit "shutdown" of the Dask cluster, there is now a foreseeable mechanism for stopping/restarting a Dask-MPI cluster from within a client batch process. With my understanding, I think this requires 3 components:
The general outline of how to use these three functions together would be like so: from dask_mpi import initialize, is_client, finalize
initialize() # Blocks scheduler and worker MPI ranks HERE!!!
if is_client():
# Do your client dask operations here
# When the scheduler and worker MPI ranks unblock due to the finalize call below,
# this section will be skipped by the scheduler and worker MPI ranks
finalize() # Everything after this can act like a normal mpi4py script ...This whole thing is starting to look to me like a context manager, no? |
I wonder if using a context manager would feel more natural here. def initialize(...):
with Client() as c:
... |
@jacobtomlinson: Yeah. I was thinking of something like: with MPICluster(...) as cluster, Client(cluster) as c:
if cluster.is_client():
... where the ...but I don't like the nested |
This is kind of where I was going with the However @mrocklin suggested that |
same issue here. I will try python3.8 for now |
Experiencing the same, python 3.8. Can we think of some temporary workaround here? Maybe closing the MPI cluster manually after the c.shutdown or "request killing the workers/check worker count/repeat again if needed/close scheduler"? I do already implement exiting logic manually, so that I don't mind adding a few more lines... |
Yeah. If you have time for a PR, that would be great, @evgri243! |
What happened:
When Dask-MPI is used in batch mode (i.e., using
initialize()
) on Linux with Python >3.8, it does not properly shut down the scheduler and worker processes when the client script completes. It hangs during shutdown. This means that the Python 3.9 and Python 3.10 tests ofdask_mpi/tests/test_core.py
anddask_mpi/teststest_no_exit.py
hangs and never finish on CI.Note that this only occurs on Linux. MacOS executes without hanging.
What you expected to happen:
When the client script completes, the scheduler and worker processes should be shut down without error or hanging.
Minimal Complete Verifiable Example:
Manually executing the
dask_mpi/tests/core_basic.py
script, with Python 3.9+ on Linux, like so:mpirun -l -np 4 python dask_mpi/tests/core_basic.py
results in:
Full Logs
HANGS HERE!!! Requires CTRL-C to exit.
Anything else we need to know?:
I believe this is due to changes in
asyncio
that occurred with the release of Python 3.9+. In particular, it seems that theasyncio.wait_for
function blocks when cancelling a task due to timeout until the task has finished cancellation. (See the Python 3.9 release notes) This appears to be due to thedask_mpi.initialize()
shutdown procedure depending upon anasyncio
call taking place in anatexit
handler. It seems that at the time theatexit
handler is called, theasyncio
loop has been closed, resulting in theRuntimeError: cannot schedule new futures after interpreter shutdown
and the subsequent hanging.Environment:
2022.4.1
3.9.12
The text was updated successfully, but these errors were encountered: