Fix nvidia mode setup logic #438

SomeBottle · 2025-01-25T15:31:57Z

Related to #425

Recently I'm testing udocker on Google Colab and found an issue caused by the copy logic below:

Lines 66 to 80 in 638bc42

    
           if os.path.islink(srcname): 
        
               linkto = os.readlink(srcname) 
        
               os.symlink(linkto, dstname) 
        
               Msg().out("Debug: is link", srcname, "to", dstname, l=Msg.DBG) 
        
           elif os.path.isfile(srcname): 
        
               shutil.copy2(srcname, dstname) 
        
               Msg().out("Debug: is file", srcname, "to", dstname, l=Msg.DBG) 
        
               try: 
        
                   mask = stat.S_IMODE(os.stat(srcname).st_mode) | \ 
        
                                       stat.S_IWUSR | stat.S_IRUSR 
        
                   if os.access(srcname, os.X_OK): 
        
                       mask = mask | stat.S_IXUSR 
        
                   os.chmod(dstname, mask) 
        
               except (IOError, OSError) as error: 
        
                   Msg().err("Error: change mask of nvidia file", error)

Tested with T4 GPU Runtime.

When copying nvidia-related libraries and executables, if the source file srcname is a symbolic link, the program will retrieve the target linkto of srcname and then directly create a symbolic link from dstname to linkto.

On the Google Colab platform, Nvidia's library files are stored under /usr/lib64-nvidia. Using the ls command, we can observe the following situation:

Assuming srcname='/usr/lib64-nvidia/libOpenCL.so.1.0' and the corresponding dstname='xxx/.udocker/containers/deb/ROOT//usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0', then:

linkto='libOpenCL.so.1.0.0'
os.symlink will create a symbolic link like this: xxx/.udocker/containers/deb/ROOT//usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0 -> libOpenCL.so.1.0.0

libOpenCL.so.1.0.0 is a filename, which resolves to xxx/.udocker/containers/deb/ROOT/usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0 when accessed.

libOpenCL.so.1.0.0 is a regular file that will be copied by shutil.copy2, so within the container, accessing libOpenCL.so.1.0 will correctly resolve to libOpenCL.so.1.0.0.

However, in Google Colab, the nvidia-related executables in the /usr/bin directory are also symbolic links, for example:

/usr/bin/nvidia-smi -> /opt/bin/.nvidia/nvidia-smi

srcname='/usr/bin/nvidia-smi'
dstname='xxx/.udocker/containers/deb/ROOT//usr/bin/nvidia-smi'
linkto='/opt/bin/.nvidia/nvidia-smi'

As a result, a symbolic link /usr/bin/nvidia-smi -> /opt/bin/.nvidia/nvidia-smi will be created in the container.

Since the program only creates the symbolic link and does not copy the actual file, /opt/bin/.nvidia/nvidia-smi does not exist in the container's file system (xxx/.udocker/containers/deb/ROOT/opt/bin/.nvidia/nvidia-smi not exists). Therefore, in the shell, we cannot use nvidia-smi to view GPU information.

Other files may also be affected by this logic and may be missing in the container. That's why I tried to fixed this issue. After the fix, the container can be successfully started following the steps in #425, and nvidia-smi can be executed. Additionally, the test about PyTorch was successful:

/content/test.py:

import torch
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"GPU 可用，设备: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("GPU 不可用，设备: CPU")
tensor = torch.tensor([1.0, 2.0, 3.0])
tensor = tensor.to(device)
print(f"张量: {tensor}")
print(f"张量所在设备: {tensor.device}")
# compute
result = tensor * 2
print(result)

Note 1: urun is an alias for su somebottle -l -c

Note 2: Since the libraries of NVIDIA can be correctly copied, PyTorch can access GPU before this fix.

Looking forward to your reply! Thank you for your great work.

SomeBottle · 2025-01-26T09:17:56Z

I just added some unit tests. Looking forward to your response.

SomeBottle · 2025-01-28T17:26:54Z

In the latest commit, I tried to let udocker add NVIDIA library path to LD_LIBRARY_PATH before running a container(if nvidia mode is set).

In udocker container, NVIDIA library path could be /usr/lib/x86_64-linux-gnu or /usr/lib64.

Although /usr/lib/x86_64-linux-gnu and /usr/lib64 are common default locations for the linker to find libraries, some programs still detect the environment depending on LD_LIBRARY_PATH.

For example, when using the image mentioned in #425 (pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime), the container created from it will throw a WARNING message upon startup:

By checking the entrypoint scripts, I found that the issue mainly arises from the absence of the library path mentioned above in LD_LIBRARY_PATH. (Or ldconfig -p not updated, but it's not common to run ldconfig while starting a container)

After the fix, it won't print the warning message anymore:

In my test, without this commit, PyTorch can still access GPU normally(It may use default searching locations). But I'm not sure if there are applications that mainly depend on the LD_LIBRARY_PATH to detect the libraries. In any case, at least this can help avoid some warning messages, and it won't disrupt the search for shared libraries.

I apologize if I’ve been a bit verbose, but I truly want to solve this issue. Thank you for taking the time to review this, and I appreciate your effort.

fix file copy logic in nvidia mode

9aed02c

SomeBottle mentioned this pull request Jan 25, 2025

Accessing GPU in Google Colab with udocker #425

Open

SomeBottle added 2 commits January 26, 2025 12:20

improve dir creation logic

cecce40

add unit tests for nvidia file copy logic

d35e1e0

SomeBottle changed the title ~~Fix file copy logic in nvidia mode setup~~ Fix nvidia mode setup logic Jan 28, 2025

add nvidia lib path to env before running

7f0c93d

SomeBottle force-pushed the devel3 branch from 01e2a96 to 7f0c93d Compare January 29, 2025 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix nvidia mode setup logic #438

Fix nvidia mode setup logic #438

SomeBottle commented Jan 25, 2025 •

edited

Loading

SomeBottle commented Jan 26, 2025

SomeBottle commented Jan 28, 2025 •

edited

Loading

	if os.path.islink(srcname):
	linkto = os.readlink(srcname)
	os.symlink(linkto, dstname)
	Msg().out("Debug: is link", srcname, "to", dstname, l=Msg.DBG)
	elif os.path.isfile(srcname):
	shutil.copy2(srcname, dstname)
	Msg().out("Debug: is file", srcname, "to", dstname, l=Msg.DBG)
	try:
	mask = stat.S_IMODE(os.stat(srcname).st_mode) \| \
	stat.S_IWUSR \| stat.S_IRUSR
	if os.access(srcname, os.X_OK):
	mask = mask \| stat.S_IXUSR
	os.chmod(dstname, mask)
	except (IOError, OSError) as error:
	Msg().err("Error: change mask of nvidia file", error)

Fix nvidia mode setup logic #438

Are you sure you want to change the base?

Fix nvidia mode setup logic #438

Conversation

SomeBottle commented Jan 25, 2025 • edited Loading

SomeBottle commented Jan 26, 2025

SomeBottle commented Jan 28, 2025 • edited Loading

SomeBottle commented Jan 25, 2025 •

edited

Loading

SomeBottle commented Jan 28, 2025 •

edited

Loading