Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nvidia mode setup logic #438

Open
wants to merge 4 commits into
base: devel3
Choose a base branch
from
Open

Conversation

SomeBottle
Copy link

@SomeBottle SomeBottle commented Jan 25, 2025

Related to #425

Recently I'm testing udocker on Google Colab and found an issue caused by the copy logic below:

if os.path.islink(srcname):
linkto = os.readlink(srcname)
os.symlink(linkto, dstname)
Msg().out("Debug: is link", srcname, "to", dstname, l=Msg.DBG)
elif os.path.isfile(srcname):
shutil.copy2(srcname, dstname)
Msg().out("Debug: is file", srcname, "to", dstname, l=Msg.DBG)
try:
mask = stat.S_IMODE(os.stat(srcname).st_mode) | \
stat.S_IWUSR | stat.S_IRUSR
if os.access(srcname, os.X_OK):
mask = mask | stat.S_IXUSR
os.chmod(dstname, mask)
except (IOError, OSError) as error:
Msg().err("Error: change mask of nvidia file", error)

  • Tested with T4 GPU Runtime.

When copying nvidia-related libraries and executables, if the source file srcname is a symbolic link, the program will retrieve the target linkto of srcname and then directly create a symbolic link from dstname to linkto.

On the Google Colab platform, Nvidia's library files are stored under /usr/lib64-nvidia. Using the ls command, we can observe the following situation:

image

Assuming srcname='/usr/lib64-nvidia/libOpenCL.so.1.0' and the corresponding dstname='xxx/.udocker/containers/deb/ROOT//usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0', then:

  • linkto='libOpenCL.so.1.0.0'
  • os.symlink will create a symbolic link like this: xxx/.udocker/containers/deb/ROOT//usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0 -> libOpenCL.so.1.0.0

libOpenCL.so.1.0.0 is a filename, which resolves to xxx/.udocker/containers/deb/ROOT/usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0 when accessed.

libOpenCL.so.1.0.0 is a regular file that will be copied by shutil.copy2, so within the container, accessing libOpenCL.so.1.0 will correctly resolve to libOpenCL.so.1.0.0.


However, in Google Colab, the nvidia-related executables in the /usr/bin directory are also symbolic links, for example:

/usr/bin/nvidia-smi -> /opt/bin/.nvidia/nvidia-smi  
  • srcname='/usr/bin/nvidia-smi'
  • dstname='xxx/.udocker/containers/deb/ROOT//usr/bin/nvidia-smi'
  • linkto='/opt/bin/.nvidia/nvidia-smi'

As a result, a symbolic link /usr/bin/nvidia-smi -> /opt/bin/.nvidia/nvidia-smi will be created in the container.

Since the program only creates the symbolic link and does not copy the actual file, /opt/bin/.nvidia/nvidia-smi does not exist in the container's file system (xxx/.udocker/containers/deb/ROOT/opt/bin/.nvidia/nvidia-smi not exists). Therefore, in the shell, we cannot use nvidia-smi to view GPU information.

image

Other files may also be affected by this logic and may be missing in the container. That's why I tried to fixed this issue. After the fix, the container can be successfully started following the steps in #425, and nvidia-smi can be executed. Additionally, the test about PyTorch was successful:

image

/content/test.py:

import torch
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"GPU 可用,设备: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("GPU 不可用,设备: CPU")
tensor = torch.tensor([1.0, 2.0, 3.0])
tensor = tensor.to(device)
print(f"张量: {tensor}")
print(f"张量所在设备: {tensor.device}")
# compute
result = tensor * 2
print(result)

Note 1: urun is an alias for su somebottle -l -c

Note 2: Since the libraries of NVIDIA can be correctly copied, PyTorch can access GPU before this fix.


Looking forward to your reply! Thank you for your great work.

@SomeBottle
Copy link
Author

I just added some unit tests. Looking forward to your response.

@SomeBottle SomeBottle changed the title Fix file copy logic in nvidia mode setup Fix nvidia mode setup logic Jan 28, 2025
@SomeBottle
Copy link
Author

SomeBottle commented Jan 28, 2025

In the latest commit, I tried to let udocker add NVIDIA library path to LD_LIBRARY_PATH before running a container(if nvidia mode is set).

  • In udocker container, NVIDIA library path could be /usr/lib/x86_64-linux-gnu or /usr/lib64.

Although /usr/lib/x86_64-linux-gnu and /usr/lib64 are common default locations for the linker to find libraries, some programs still detect the environment depending on LD_LIBRARY_PATH.

For example, when using the image mentioned in #425 (pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime), the container created from it will throw a WARNING message upon startup:

image

By checking the entrypoint scripts, I found that the issue mainly arises from the absence of the library path mentioned above in LD_LIBRARY_PATH. (Or ldconfig -p not updated, but it's not common to run ldconfig while starting a container)

After the fix, it won't print the warning message anymore:

image


In my test, without this commit, PyTorch can still access GPU normally(It may use default searching locations). But I'm not sure if there are applications that mainly depend on the LD_LIBRARY_PATH to detect the libraries. In any case, at least this can help avoid some warning messages, and it won't disrupt the search for shared libraries.

I apologize if I’ve been a bit verbose, but I truly want to solve this issue. Thank you for taking the time to review this, and I appreciate your effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant