Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updated user-mode data collection and support for flux resource manager #133

Merged
merged 64 commits into from
Dec 12, 2024

Conversation

koomie
Copy link
Collaborator

@koomie koomie commented Dec 9, 2024

This PR incudes two main additions:

  1. an updated user-mode data collection process that leverages a "push" model with VictoriaMetrics as the underlying back end as opposed to the previous "pull" model with Prometheus. This approach leverages existing data collector families as is but implements a local polling loop to query the data and cache the results. At periodic intervals (default of 5 minutes), the cached results are pushed to a VictoriaMetrics server running on the master compute node. The results can be queried the same way via a prometheus endpoint provided by Victoria and there is no change required for user-mode Grafana. Adopting a cached push model, GPU telemetry metrics can be sampled at a higher rates (~e.g. down to 10-50 milliseconds).

  2. adds support for the Flux resource manager (in addition to SLURM). This impacts the collector_rms data collector and enables job identification support (along with job steps) on systems running flux. Note that flux jobids are not ordinal integers so there is some corresponding impact to Grafana dashboard configuration.

Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
@koomie koomie added the enhancement New feature or request label Dec 9, 2024
@koomie koomie added this to the 1.1 milestone Dec 9, 2024
koomie added 25 commits December 9, 2024 12:12
Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
check during metrics push

Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
logging output to a file which hostname prepended

Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
victoria metrics back-end. Current enablement hard-coded via
victoriaMode=True setting in main().

Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
… port

Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
command-line setting

Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
…tepFile

Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
annotation file

Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
execution

Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
victoriametrics during a job; threading added to support flask
endpoint that can be used to terminate the data collector and push
final data (previous file-based termination removed).

Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
"shutdown" endpoint; restrict max number of go processes for
victormetrics server

Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
getMetrics() method to take timestamp in millisecs directly

Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
(deep)copied and shipped to a separate thread to push the data. This
minimizes blocking of main polling loop.

Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
victoriametrics; remove unused remotewrite configuration for
prometheus

Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
victoriametrics

Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
names for victoriametrics settings

Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
checking for victoriametrics path; tweak shutdown timeout for exporter
when using victoriametrics

Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
victoriametrics examples

Signed-off-by: Karl W Schulz <karl.schulz@amd.com>
koomie added 11 commits December 9, 2024 12:13
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
time in the Job Step panel; additional queries and transformations
added to sort by the job step time (since we cannot assume the job
step is an ordinal number, thank you flux). Enabled missing legend in
the Average GPU Power panel.

Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
using prometheus)

Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
koomie added 15 commits December 9, 2024 17:14
example, account for binary name in ubuntu)

Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
push-based using victoria

Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
with victoria

Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
@koomie koomie merged commit cf49e2f into main Dec 12, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants