Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recipe development #38

Closed
wants to merge 2 commits into from
Closed

Recipe development #38

wants to merge 2 commits into from

Conversation

xiaoxshe
Copy link
Contributor

PR Approval Steps

What's this PR about

  1. This PR adds recipes feature into Hyperpod CLI.
  2. Usage
CX like below:
hyperpod start-job --recipe <recipe_name>
Where:
<recipe_name> is the name of the recipe (e.g. fine-tuning/llama/hf_llama3_405b_seq131072_gpu_qlora)

You can override any parameters in the recipe: [Recipes parameters:] or [Recipe Cluster Config File]

hyperpod start-job --recipe <recipe_name> --override-parameters /
trainer.max_steps=2000
cluster.instance_type="ml.g5.8xlarge"

custom recipe format like below

hyperpod start-job --recipe <absolute path to recipe.yaml>

modify recipes directly: go to launcher/recipes folder to modify the actual recipes yaml file and use above command to trigger

Testing

Help menu

hyperpod start-job --help

  --recipe TEXT                   Optional. Recipe which accelerates
                                  distributed training jobs. Current supported
                                  recipes are as follows:

                                  fine-tuning/llama/hf_llama3_8b_seq8192_gpu

                                  fine-
                                  tuning/llama/hf_llama3_8b_seq8192_gpu_lora

                                  fine-
                                  tuning/llama/hf_llama3_70b_seq8192_gpu_lora

                                  fine-tuning/llama/hf_llama3_405b_seq8192_gpu
                                  _qlora

                                  fine-tuning/llama/hf_llama3_405b_seq131072_g
                                  pu_qlora

                                  training/llama/hf_llama3_7B_config_trainium

                                  training/llama/hf_llama3_8b_seq8192_gpu

                                  training/llama/llama2_7b_nemo

                                  training/llama/megatron_llama_7B_config

                                  training/mistral/hf_mistral_gpu

                                  training/mixtral/hf_mixtral_gpu

Validation

If recipe name doesn't match or can't be found, CLI will throw validation

hyperpod start-job --recipe fine-tuning/llama/hf_llama3_8b_seq8192_gpu
2024-10-21 09:04:28 - hyperpod_cli.validators.job_validator - ERROR - Recipe file 'fine-tuning/llama/hf_llama3_8b_seq8192_gpu.yaml' not found in ./src/hyperpod_cli/private_sagemaker_training_launcher/recipes_collection/recipes

Job submission

  1. For the recipe.

Command

hyperpod start-job --recipe training/mistral/hf_mistral_gpu

result. launcher_cmd.log

./src/hyperpod_cli/private_sagemaker_training_launcher/main.py \
  recipes=training/mistral/hf_mistral_gpu \
  cluster=k8s \
  cluster_type=k8s \
  base_results_dir=/Users/xiaoxshe/Documents/GitHub/sagemaker-hyperpod-cli/results

mistral yaml:
https://paste.amazon.com/show/xiaoxshe/1729547671

Override parameters

hyperpod start-job --recipe training/mistral/hf_mistral_gpu --override-parameters recipes.trainer.num_nodes=1

Result:

./src/hyperpod_cli/private_sagemaker_training_launcher/main.py \
  recipes=training/mistral/hf_mistral_gpu \
  cluster=k8s \
  cluster_type=k8s \
  base_results_dir=/Users/xiaoxshe/Documents/GitHub/sagemaker-hyperpod-cli/results \
  recipes.trainer.num_nodes=1
run:
  name: mistral
  results_dir: /Users/xiaoxshe/Documents/GitHub/sagemaker-hyperpod-cli/results/mistral
  time_limit: 6-00:00:00
  model_type: hf
trainer:
  devices: 8
  num_nodes: 1
  accelerator: gpu
  precision: bf16
  max_steps: 50
  log_every_n_steps: 1
  val_check_interval: -1
  limit_val_batches: 0
  1. For custom script
hyperpod start-job --config-file examples/basic-job-example-config.yaml

Result

image:
  trainingImage: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-bc09cfd
  pullPolicy: IfNotPresent
trainingConfig:
  jobName: hyperpod-cli-test
  namespace: kubeflow
  scriptPath: /opt/pytorch-mnist/mnist.py
  scriptArgs: ''
  customScript: true
  annotations: null
  customLabels: null
  priority_class_name: null
  device: cpu
  numEFADevices: 0
  numNeuronDevices: null
  ntasksPerNode: 1
  nodes: 2
  restartPolicy: OnFailure
  wandbKey: nil
  serviceAccountName: null
  compile: 0
  persistentVolumeClaims: null

Command

hyperpod start-job --debug --job-name hyperpod-cli-test-3 --job-kind kubeflow/PyTorchJob --image docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-bc09cfd --entry-script /opt/pytorch-mnist/mnist.py --pull-policy IfNotPresent --instance-type ml.g5.2xlarge --node-count 1 --tasks-per-node 1

Result

image:
  trainingImage: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-bc09cfd
  pullPolicy: IfNotPresent
trainingConfig:
  jobName: hyperpod-cli-test-3
  namespace: default
  scriptPath: /opt/pytorch-mnist/mnist.py
  scriptArgs: ''
  customScript: true
  annotations: null
  customLabels: null
  priority_class_name: null
  device: gpu
  numEFADevices: 0
  numNeuronDevices: null
  ntasksPerNode: 1

Alignment

  1. Change all the usage of hyperpod launcher into submodule
  2. pip install will replicate all the recipes files in the local.
  3. Both custom script and recipe will use main.py to submit the jobs.

Pending action item

  1. Replace private launcher repo with hyperpod recipes during launch day.

For Reviewer

  1. Go through For Requester section to double check each item.
  2. Request Changes or Approve the PR:
    1. If the PR is ready to be merged, click Review changes and select Approve.
    2. If changes are required, select Request changes and provide feedback. Be constructive and clear in your feedback.
  3. Merging the PR
    1. Check the Merge Method:
      1. Decide on the appropriate merge method based on your repository's guidelines (e.g., Squash and merge, Rebase and merge, or Merge).
    2. Merge the PR:
      1. Click the Merge pull request button.
      2. Confirm the merge by clicking Confirm merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant