Skip to content

Commit

Permalink
Support recipes and scheduler in Hyperpod CLI (#41)
Browse files Browse the repository at this point in the history
* add recipes feature for distributed training

* improve unit test coverage for recipes feature

* add support recipes along with command line args

* add recipes

* Crescendo helm chart for role and rolebinding (#17)

* update the helm chart to create team level roles and bindings

* revert unrelated changes

* Rename quotaAllocationTarget to computeQuotaTarget

* remove kueue related resources from helm chart

* Remove parameters of kueue from chart

* flip the team role creation to false

* Revise readme to add instructions to create the role and binding

* add changelog for distributed training

* change to public submodules

* QuotaAllocation support for Hyperpod CLI (#12)

* QuotaAllocation support for Hyperpod CLI

---------

Co-authored-by: Amazon GitHub Automation <54958958+amazon-auto@users.noreply.github.com>
Co-authored-by: Song Jiang <jiangsongbz@gmail.com>
Co-authored-by: Baiyang Li <baiyanl@amazon.com>
Co-authored-by: baiyli <105086653+baiyli@users.noreply.github.com>

* Remove custom_launcher folder

* sync with mainline

---------

Co-authored-by: cansun <80425164+can-sun@users.noreply.github.com>
Co-authored-by: Amazon GitHub Automation <54958958+amazon-auto@users.noreply.github.com>
Co-authored-by: Song Jiang <jiangsongbz@gmail.com>
Co-authored-by: Baiyang Li <baiyanl@amazon.com>
Co-authored-by: baiyli <105086653+baiyli@users.noreply.github.com>
Co-authored-by: Can Sun <sucan@amazon.com>
  • Loading branch information
7 people authored Dec 4, 2024
1 parent 87f5660 commit bb25aed
Show file tree
Hide file tree
Showing 69 changed files with 3,475 additions and 2,392 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,7 @@ __pycache__/

/doc/_apidoc/
/build

# Ignore all contents of result and results directories
/result/
/results/
7 changes: 3 additions & 4 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
[submodule "src/hyperpod_cli/custom_launcher/launcher/nemo/nemo_framework_launcher"]
path = src/hyperpod_cli/custom_launcher/launcher/nemo/nemo_framework_launcher
url = https://github.com/NVIDIA/NeMo-Framework-Launcher.git
branch = 3d41c31
[submodule "src/hyperpod_cli/sagemaker_hyperpod_recipes"]
path = src/hyperpod_cli/sagemaker_hyperpod_recipes
url = https://github.com/aws/sagemaker-hyperpod-recipes.git
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# Changelog

## v2.0.0 (2024-12-04)

### Features

- feature: The HyperPod CLI now support ([Hyperpod recipes](https://github.com/aws/sagemaker-hyperpod-recipes.git)). The HyperPod recipes enable customers to get started training and fine-tuning popular publicly-available foundation models like Llama 3.1 405B in minutes. Learn more ([here](https://github.com/aws/sagemaker-hyperpod-recipes.git)).

## v1.0.0 (2024-09-09)

### Features
Expand Down
46 changes: 33 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ This documentation serves as a reference for the available HyperPod CLI commands

## Overview

The SageMaker HyperPod CLI is a tool that helps submit training jobs to the Amazon SageMaker HyperPod clusters orchestrated by Amazon EKS. It provides a set of commands for managing the full lifecycle of training jobs, including submitting, describing, listing, and canceling jobs, as well as accessing logs and executing commands within the job's containers. The CLI is designed to abstract away the complexity of working directly with Kubernetes for these core actions of managing jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS.
The SageMaker HyperPod CLI is a tool that helps submit training jobs to the Amazon SageMaker HyperPod clusters orchestrated by Amazon EKS. It provides a set of commands for managing the full lifecycle of training jobs, including submitting, describing, listing, patching and canceling jobs, as well as accessing logs and executing commands within the job's containers. The CLI is designed to abstract away the complexity of working directly with Kubernetes for these core actions of managing jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS.

## Prerequisites

Expand Down Expand Up @@ -76,6 +76,10 @@ SageMaker HyperPod CLI currently supports start training job with:
```
hyperpod get-clusters
```
- Get your HyperPod clusters to show their capacities and quota allocation info for a team.
```
hyperpod get-clusters -n hyperpod-ns-<team-name>
```
- Connect to one HyperPod cluster and specify a namespace you have access to.
```
hyperpod connect-cluster --cluster-name <cluster-name>
Expand Down Expand Up @@ -104,11 +108,12 @@ The HyperPod CLI provides the following commands:
This command lists the available SageMaker HyperPod clusters and their capacity information.
```
hyperpod get-clusters [--region <region>] [--clusters <cluster1,cluster2>] [--orchestrator <eks>] [--output <json|table>]
hyperpod get-clusters [--region <region>] [--clusters <cluster1,cluster2>] [--namespace <namespace>] [--orchestrator <eks>] [--output <json|table>]
```
* `region` (string) - Optional. The region that the SageMaker HyperPod and EKS clusters are located. If not specified, it will be set to the region from the current AWS account credentials.
* `clusters` (list[string]) - Optional. A list of SageMaker HyperPod cluster names that users want to check the capacity for. This is useful for users who know some of their most commonly used clusters and want to check the capacity status of the clusters in the AWS account.
* `namespace` (string) - Optional. The namespace that users want to check the quota with. Only the SageMaker managed namespaces are supported.
* `orchestrator` (enum) - Optional. The orchestrator type for the cluster. Currently, `'eks'` is the only available option.
* `output` (enum) - Optional. The output format. Available values are `table` and `json`. The default value is `json`.
Expand All @@ -122,19 +127,19 @@ hyperpod connect-cluster --cluster-name <cluster-name> [--region <region>] [--na
* `cluster-name` (string) - Required. The SageMaker HyperPod cluster name to configure with.
* `region` (string) - Optional. The region that the SageMaker HyperPod and EKS clusters are located. If not specified, it will be set to the region from the current AWS account credentials.
* `namespace` (string) - Optional. The namespace that you want to connect to. If not specified, this command uses the [Kubernetes namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/) of the Amazon EKS cluster associated with the SageMaker HyperPod cluster in your AWS account.
* `namespace` (string) - Optional. The namespace that you want to connect to. If not specified, Hyperpod cli commands will auto discover the accessible namespace.
### Submitting a Job
This command submits a new training job to the connected SageMaker HyperPod cluster.
```
hyperpod start-job --job-name <job-name> [--namespace <namespace>] [--job-kind <kubeflow/PyTorchJob>] [--image <image>] [--command <command>] [--entry-script <script>] [--script-args <arg1 arg2>] [--environment <key=value>] [--pull-policy <Always|IfNotPresent|Never>] [--instance-type <instance-type>] [--node-count <count>] [--tasks-per-node <count>] [--label-selector <key=value>] [--deep-health-check-passed-nodes-only] [--scheduler-type <Kueue>] [--queue-name <queue-name>] [--priority <priority>] [--auto-resume] [--max-retry <count>] [--restart-policy <Always|OnFailure|Never|ExitCode>] [--volumes <volume1,volume2>] [--persistent-volume-claims <claim1:/mount/path,claim2:/mount/path>] [--results-dir <dir>] [--service-account-name <account>]
hyperpod start-job --job-name <job-name> [--namespace <namespace>] [--job-kind <kubeflow/PyTorchJob>] [--image <image>] [--command <command>] [--entry-script <script>] [--script-args <arg1 arg2>] [--environment <key=value>] [--pull-policy <Always|IfNotPresent|Never>] [--instance-type <instance-type>] [--node-count <count>] [--tasks-per-node <count>] [--label-selector <key=value>] [--deep-health-check-passed-nodes-only] [--scheduler-type <Kueue SageMaker None>] [--queue-name <queue-name>] [--priority <priority>] [--auto-resume] [--max-retry <count>] [--restart-policy <Always|OnFailure|Never|ExitCode>] [--volumes <volume1,volume2>] [--persistent-volume-claims <claim1:/mount/path,claim2:/mount/path>] [--results-dir <dir>] [--service-account-name <account>]
```
* `job-name` (string) - Required. The name of the job.
* `job-kind` (string) - Optional. The training job kind. The job type currently supported is `kubeflow/PyTorchJob`.
* `namespace` (string) - Optional. The namespace to use. If not specified, this command uses the [Kubernetes namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/) of the Amazon EKS cluster associated with the SageMaker HyperPod cluster in your AWS account.
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
* `image` (string) - Required. The image used when creating the training job.
* `pull-policy` (enum) - Optional. The policy to pull the container image. Valid values are `Always`, `IfNotPresent`, and `Never`, as available from the PyTorchJob. The default is `Always`.
* `command` (string) - Optional. The command to run the entrypoint script. Currently, only `torchrun` is supported.
Expand All @@ -146,7 +151,7 @@ hyperpod start-job --job-name <job-name> [--namespace <namespace>] [--job-kind <
* `tasks-per-node` (int) - Optional. The number of devices to use per instance.
* `label-selector` (dict[string, list[string]]) - Optional. A dictionary of labels and their values that will override the predefined node selection rules based on the SageMaker HyperPod `node-health-status` label and values. If users provide this field, the CLI will launch the job with this customized label selection.
* `deep-health-check-passed-nodes-only` (bool) - Optional. If set to `true`, the job will be launched only on nodes that have the `deep-health-check-status` label with the value `passed`.
* `scheduler-type` (enum) - Optional. The scheduler type to use. Currently, only `Kueue` is supported.
* `scheduler-type` (enum) - Optional. The scheduler type to use which can be `SageMaker`, `Kueue` or `None`. Default value is `SageMaker`.
* `queue-name` (string) - Optional. The name of the queue to submit the job to, which is created by the cluster admin users in your AWS account.
* `priority` (string) - Optional. The priority for the job, which needs to be created by the cluster admin users and match the name in the cluster.
* `auto-resume` (bool) - Optional. The flag to enable HyperPod resilience job auto resume. If set to `true`, the job will automatically resume after pod or node failure. To enable `auto-resume`, you also should set `restart-policy` to `OnFailure`.
Expand All @@ -167,7 +172,7 @@ hyperpod get-job --job-name <job-name> [--namespace <namespace>] [--verbose]
```
* `job-name` (string) - Required. The name of the job.
* `namespace` (string) - Optional. The namespace to describe the job in. If not provided, the CLI will try to describe the job in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will describe the job from the specified namespace.
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
* `verbose` (flag) - Optional. If set to `True`, the command enables verbose mode and prints out more detailed output with additional fields.
### Listing Jobs
Expand All @@ -178,7 +183,7 @@ This command lists all the training jobs in the connected SageMaker HyperPod clu
hyperpod list-jobs [--namespace <namespace>] [--all-namespaces] [--selector <key=value>]
```
* `namespace` (string) - Optional. The namespace to list the jobs in. If not provided, this command lists the jobs in the namespace specified during connecting to the cluster. If the namespace is provided and if the user has access to the namespace, this command lists the jobs from the specified namespace.
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
* `all-namespaces` (flag) - Optional. If set, this command lists jobs from all namespaces the data scientist users have access to. The namespace in the current AWS account credentials will be ignored, even if specified with the `--namespace` option.
* `selector` (string) - Optional. A label selector to filter the listed jobs. The selector supports the '=', '==', and '!=' operators (e.g., `-l key1=value1,key2=value2`).
Expand All @@ -191,7 +196,7 @@ hyperpod cancel-job --job-name <job-name> [--namespace <namespace>]
```
* `job-name` (string) - Required. The name of the job to cancel.
* `namespace` (string) - Optional. The namespace to cancel the job in. If not provided, the CLI will try to cancel the job in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will cancel the job from the specified namespace.
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
### Listing Pods
Expand All @@ -202,7 +207,7 @@ hyperpod list-pods --job-name <job-name> [--namespace <namespace>]
```
* `job-name` (string) - Required. The name of the job to list pods for.
* `namespace` (string) - Optional. The namespace to list the pods in. If not provided, the CLI will list the pods in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will list the pods from the specified namespace.
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
### Accessing Logs
Expand All @@ -214,7 +219,7 @@ hyperpod get-log --job-name <job-name> --pod <pod-name> [--namespace <namespace>
* `job-name` (string) - Required. The name of the job to get the log for.
* `pod` (string) - Required. The name of the pod to get the log from.
* `namespace` (string) - Optional. The namespace to get the log from. If not provided, the CLI will get the log from the pod in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will get the log from the pod in the specified namespace.
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
### Executing Commands
Expand All @@ -226,6 +231,21 @@ hyperpod exec --job-name <job-name> [-p <pod-name>] [--all-pods] -- <command>
* `job-name` (string) - Required. The name of the job to execute the command within the container of a pod associated with a training job.
* `bash-command` (string) - Required. The bash command(s) to run.
* `namespace` (string) - Optional. The namespace to execute the command in. If not provided, the CLI will try to execute the command in the pod in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will execute the command in the pod from the specified namespace.
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
* `pod` (string) - Optional. The name of the pod to execute the command in. You must provide either `--pod` or `--all-pods`.
* `all-pods` (flag) - Optional. If set, the command will be executed in all pods associated with the job.
* `all-pods` (flag) - Optional. If set, the command will be executed in all pods associated with the job.
### Patch Jobs
This command patches a job with certain operation. Currently only `suspend` and `unsuspend` are supported.
```
hyperpod patch-job suspend --job-name <job-name> [--namespace <namespace>]
```
```
hyperpod patch-job unsuspend --job-name <job-name> [--namespace <namespace>]
```
* `job-name` (string) - Required. The name of the job to be patched.
* `namespace` (string) - Optional. The namespace to use. If not specified, this command will first use the namespace when connecting the cluster. Otherwise if namespace is not configured when connecting to the cluster, a namespace that is managed by SageMaker will be auto discovered.
8 changes: 7 additions & 1 deletion examples/basic-job-example-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -64,8 +64,10 @@ cluster:
# Mapping to '--namespace' argument in 'start-job' command.
namespace: kubeflow
# custom_labels: Optional. Used to specify the name of the queue, which is created by the cluster admin users.
# The priority class label is mapped to '--priority' argument in 'start-job' command if your scheduler type is 'SageMaker'.
# custom_labels:
# kueue.x-k8s.io/queue-name: low-priority-queue2
# kueue.x-k8s.io/priority-class: sample-priority
custom_labels: null
# priority_class_name: Optional. The priority for the job, which is created by the cluster admin users.
# Mapping to '--priority' argument in 'start-job' command.
Expand Down Expand Up @@ -96,7 +98,11 @@ cluster:
# To use SageMaker Hyperpod AutoResume functionality, please set it to OnFailure.
# Mapping to '--restart-policy' argument in 'start-job' command.
restartPolicy: OnFailure

# scheduler_type: Optional. Used to decide which type of scheduler to use. Default value is 'SageMaker' which makes
# the job only scheduled on queues created via SageMaker. Another valid value is 'Kueue', with this option, queue name
# and namespace has to be manually filled out.
# scheduler_type: Kueue
scheduler_type: SageMaker
# base_results_dir: Optional. Location to store the results, checkpoints and logs.
# Mapping to '--results-dir' argument in 'start-job' command.
base_results_dir: ./result
Expand Down
10 changes: 5 additions & 5 deletions helm_chart/HyperPodHelmChart/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,11 +42,7 @@ dependencies:
- name: neuron-device-plugin
version: "0.1.0"
repository: "file://charts/neuron-device-plugin"
condition: neuron-device-plugin.devicePlugin.enabled
- name: kueue
version: "0.1.0"
repository: "file://charts/kueue"
condition: kueue.enabled
condition: neuron-device-plugin.devicePlugin.enabled
- name: storage
version: "0.1.0"
repository: "file://charts/storage"
Expand Down Expand Up @@ -75,3 +71,7 @@ dependencies:
version: "0.1.0"
repository: "file://charts/namespaced-role-and-bindings"
condition: namespaced-role-and-bindings.enabled
- name: team-role-and-bindings
version: "0.1.0"
repository: "file://charts/team-role-and-bindings"
condition: team-role-and-bindings.enabled
7 changes: 0 additions & 7 deletions helm_chart/HyperPodHelmChart/charts/kueue/Chart.yaml

This file was deleted.

This file was deleted.

17 changes: 0 additions & 17 deletions helm_chart/HyperPodHelmChart/charts/kueue/templates/queue.yaml

This file was deleted.

21 changes: 0 additions & 21 deletions helm_chart/HyperPodHelmChart/charts/kueue/values.yaml

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
apiVersion: v2
name: team-role-and-bindings
version: 0.1.0
appVersion: 1.0
description: This chart installs the namespaced and cluster roles and bindings for team members
Loading

0 comments on commit bb25aed

Please sign in to comment.