Recipe development #38

xiaoxshe · 2024-10-21T22:48:27Z

PR Approval Steps

What's this PR about

This PR adds recipes feature into Hyperpod CLI.
Usage

CX like below:
hyperpod start-job --recipe <recipe_name>
Where:
<recipe_name> is the name of the recipe (e.g. fine-tuning/llama/hf_llama3_405b_seq131072_gpu_qlora)

You can override any parameters in the recipe: [Recipes parameters:] or [Recipe Cluster Config File]

hyperpod start-job --recipe <recipe_name> --override-parameters /
trainer.max_steps=2000
cluster.instance_type="ml.g5.8xlarge"

custom recipe format like below

hyperpod start-job --recipe <absolute path to recipe.yaml>

modify recipes directly: go to launcher/recipes folder to modify the actual recipes yaml file and use above command to trigger

Testing

Help menu

hyperpod start-job --help

  --recipe TEXT                   Optional. Recipe which accelerates
                                  distributed training jobs. Current supported
                                  recipes are as follows:

                                  fine-tuning/llama/hf_llama3_8b_seq8192_gpu

                                  fine-
                                  tuning/llama/hf_llama3_8b_seq8192_gpu_lora

                                  fine-
                                  tuning/llama/hf_llama3_70b_seq8192_gpu_lora

                                  fine-tuning/llama/hf_llama3_405b_seq8192_gpu
                                  _qlora

                                  fine-tuning/llama/hf_llama3_405b_seq131072_g
                                  pu_qlora

                                  training/llama/hf_llama3_7B_config_trainium

                                  training/llama/hf_llama3_8b_seq8192_gpu

                                  training/llama/llama2_7b_nemo

                                  training/llama/megatron_llama_7B_config

                                  training/mistral/hf_mistral_gpu

                                  training/mixtral/hf_mixtral_gpu

Validation

If recipe name doesn't match or can't be found, CLI will throw validation

hyperpod start-job --recipe fine-tuning/llama/hf_llama3_8b_seq8192_gpu
2024-10-21 09:04:28 - hyperpod_cli.validators.job_validator - ERROR - Recipe file 'fine-tuning/llama/hf_llama3_8b_seq8192_gpu.yaml' not found in ./src/hyperpod_cli/private_sagemaker_training_launcher/recipes_collection/recipes

Job submission

For the recipe.

Command

hyperpod start-job --recipe training/mistral/hf_mistral_gpu

result. launcher_cmd.log

./src/hyperpod_cli/private_sagemaker_training_launcher/main.py \
  recipes=training/mistral/hf_mistral_gpu \
  cluster=k8s \
  cluster_type=k8s \
  base_results_dir=/Users/xiaoxshe/Documents/GitHub/sagemaker-hyperpod-cli/results

mistral yaml:
https://paste.amazon.com/show/xiaoxshe/1729547671

Override parameters

hyperpod start-job --recipe training/mistral/hf_mistral_gpu --override-parameters recipes.trainer.num_nodes=1

Result:

./src/hyperpod_cli/private_sagemaker_training_launcher/main.py \
  recipes=training/mistral/hf_mistral_gpu \
  cluster=k8s \
  cluster_type=k8s \
  base_results_dir=/Users/xiaoxshe/Documents/GitHub/sagemaker-hyperpod-cli/results \
  recipes.trainer.num_nodes=1

run:
  name: mistral
  results_dir: /Users/xiaoxshe/Documents/GitHub/sagemaker-hyperpod-cli/results/mistral
  time_limit: 6-00:00:00
  model_type: hf
trainer:
  devices: 8
  num_nodes: 1
  accelerator: gpu
  precision: bf16
  max_steps: 50
  log_every_n_steps: 1
  val_check_interval: -1
  limit_val_batches: 0

For custom script

hyperpod start-job --config-file examples/basic-job-example-config.yaml

Result

image:
  trainingImage: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-bc09cfd
  pullPolicy: IfNotPresent
trainingConfig:
  jobName: hyperpod-cli-test
  namespace: kubeflow
  scriptPath: /opt/pytorch-mnist/mnist.py
  scriptArgs: ''
  customScript: true
  annotations: null
  customLabels: null
  priority_class_name: null
  device: cpu
  numEFADevices: 0
  numNeuronDevices: null
  ntasksPerNode: 1
  nodes: 2
  restartPolicy: OnFailure
  wandbKey: nil
  serviceAccountName: null
  compile: 0
  persistentVolumeClaims: null

Command

hyperpod start-job --debug --job-name hyperpod-cli-test-3 --job-kind kubeflow/PyTorchJob --image docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-bc09cfd --entry-script /opt/pytorch-mnist/mnist.py --pull-policy IfNotPresent --instance-type ml.g5.2xlarge --node-count 1 --tasks-per-node 1

Result

image:
  trainingImage: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-bc09cfd
  pullPolicy: IfNotPresent
trainingConfig:
  jobName: hyperpod-cli-test-3
  namespace: default
  scriptPath: /opt/pytorch-mnist/mnist.py
  scriptArgs: ''
  customScript: true
  annotations: null
  customLabels: null
  priority_class_name: null
  device: gpu
  numEFADevices: 0
  numNeuronDevices: null
  ntasksPerNode: 1

Alignment

Change all the usage of hyperpod launcher into submodule
pip install will replicate all the recipes files in the local.
Both custom script and recipe will use main.py to submit the jobs.

Pending action item

Replace private launcher repo with hyperpod recipes during launch day.

For Reviewer

Go through For Requester section to double check each item.
Request Changes or Approve the PR:
1. If the PR is ready to be merged, click Review changes and select Approve.
2. If changes are required, select Request changes and provide feedback. Be constructive and clear in your feedback.
Merging the PR
1. Check the Merge Method:
  1. Decide on the appropriate merge method based on your repository's guidelines (e.g., Squash and merge, Rebase and merge, or Merge).
2. Merge the PR:
  1. Click the Merge pull request button.
  2. Confirm the merge by clicking Confirm merge.

xiaoxshe added 2 commits October 21, 2024 14:13

add recipe feature in hyperpod CLI

06089c2

add override parameters in recipe feature

ce8c67a

xiaoxshe temporarily deployed to auto-approve October 21, 2024 22:48 — with GitHub Actions Inactive

xiaoxshe closed this Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recipe development #38

Recipe development #38

xiaoxshe commented Oct 21, 2024

Recipe development #38

Recipe development #38

Conversation

xiaoxshe commented Oct 21, 2024

PR Approval Steps

What's this PR about

Testing

Help menu

Validation

Job submission

Alignment

Pending action item

For Reviewer