Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related to #425
Recently I'm testing udocker on Google Colab and found an issue caused by the copy logic below:
udocker/udocker/engine/nvidia.py
Lines 66 to 80 in 638bc42
When copying nvidia-related libraries and executables, if the source file
srcname
is a symbolic link, the program will retrieve the targetlinkto
ofsrcname
and then directly create a symbolic link fromdstname
tolinkto
.On the Google Colab platform, Nvidia's library files are stored under
/usr/lib64-nvidia
. Using thels
command, we can observe the following situation:Assuming
srcname='/usr/lib64-nvidia/libOpenCL.so.1.0'
and the correspondingdstname='xxx/.udocker/containers/deb/ROOT//usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0'
, then:linkto='libOpenCL.so.1.0.0'
os.symlink
will create a symbolic link like this:xxx/.udocker/containers/deb/ROOT//usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0 -> libOpenCL.so.1.0.0
libOpenCL.so.1.0.0
is a filename, which resolves toxxx/.udocker/containers/deb/ROOT/usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0
when accessed.libOpenCL.so.1.0.0
is a regular file that will be copied byshutil.copy2
, so within the container, accessinglibOpenCL.so.1.0
will correctly resolve tolibOpenCL.so.1.0.0
.However, in Google Colab, the nvidia-related executables in the
/usr/bin
directory are also symbolic links, for example:srcname='/usr/bin/nvidia-smi'
dstname='xxx/.udocker/containers/deb/ROOT//usr/bin/nvidia-smi'
linkto='/opt/bin/.nvidia/nvidia-smi'
As a result, a symbolic link
/usr/bin/nvidia-smi -> /opt/bin/.nvidia/nvidia-smi
will be created in the container.Since the program only creates the symbolic link and does not copy the actual file,
/opt/bin/.nvidia/nvidia-smi
does not exist in the container's file system (xxx/.udocker/containers/deb/ROOT/opt/bin/.nvidia/nvidia-smi
not exists). Therefore, in the shell, we cannot usenvidia-smi
to view GPU information.Other files may also be affected by this logic and may be missing in the container. That's why I tried to fixed this issue. After the fix, the container can be successfully started following the steps in #425, and
nvidia-smi
can be executed. Additionally, the test about PyTorch was successful:/content/test.py
:Note 1:
urun
is an alias forsu somebottle -l -c
Note 2: Since the libraries of NVIDIA can be correctly copied, PyTorch can access GPU before this fix.
Looking forward to your reply! Thank you for your great work.