Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a GPU Device ID label to metrics #427

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

VariableExp0rt
Copy link

@VariableExp0rt VariableExp0rt commented Jan 21, 2025

This PR adds support for the device (numeric identifier) metrics label, similar to what is provided by the dcgm-exporter component, instead of just the GPU UUIDs. The primary motivation for this is to reduce the cardinality and have static device numbers even when machines are recycled (such as when using cluster-autoscaler within Kubernetes).

Since there were comments in the code around the reliability of the device count returned by the DCGM API, this uses the device number within the loop over the cuda_gpu_count. Due to that, it moves out the gpu_labels out of the first loop over dcgm_gpu_count.

Signed-off-by: liam.baker <liam.baker@sage.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant