Add a GPU Device ID label to metrics #427
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds support for the
device
(numeric identifier) metrics label, similar to what is provided by thedcgm-exporter
component, instead of just the GPU UUIDs. The primary motivation for this is to reduce the cardinality and have static device numbers even when machines are recycled (such as when using cluster-autoscaler within Kubernetes).Since there were comments in the code around the reliability of the device count returned by the DCGM API, this uses the device number within the loop over the
cuda_gpu_count
. Due to that, it moves out thegpu_labels
out of the first loop overdcgm_gpu_count
.