-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster update fails in 3.10.0, 3.9.3 #6339
Comments
might be related to #6329 |
a bit of update: i rolled back very same cluster config to PC3.9.3 and tried very same 'pcluster update' successfully. The issue is clearly PC 3.10.? specific |
Hi snemir2, We have been investigating the issue with the failed update of your AWS ParallelCluster. Our initial findings from the
The failure to restart
This error might stem from:
The sequence of events suggests that the To assist us in further diagnosing the problem and providing a resolution, we kindly request the following:
These details will help us better understand the configuration and environment. We will continue our investigation and keep you informed of any progress. Best regards, |
Hi @hehe7318 - This is the failed api call from cloudtrail.
|
Wow got this issue as well. simply adding a new instanceType for a slurm cluster queue triggers the creation of a new HeadNode in cloudformation, parallelcluster 3.10.0 ., and the error stems from cloudformation wanting to create a new Head Node while the ENI is still attached to the old head node . Now why this triggers a new HeadNode resource, I don't know, but with my own work in CustomResources, the Cloudformation custom resource is returning a new ID When an update is triggered which tells CF that the HeadNode needs to be replaced. |
Apologies for the late reply. Can you share your original and updated cluster configuration YAML file? That could help us to reproduce the error. I tried to add a new |
This comment was marked as off-topic.
This comment was marked as off-topic.
Hello! I'm seeing this issue as well when trying to update our customAMI. Have their been any updates here? I'm happy to share our before and after configuration if that would be helpful? This is on cluster version 3.9.3
After Update:
|
Still have this issue... |
Hi @samcofer, Apologies for the late reply! You were trying to update To update AMIs on compute nodes, use Thank you, |
Hi @snemir2, Could you share cluster configuration files before/after the update? Thank you, |
actually, config files do not change (the update logic is within the OnHeadnodeUpdated script, thus forcing the update) |
In the screenshot you provided, the head node was being updated. The head node shouldn't be updated if the configuration file didn't change. Could you double check whether the configuration file didn't change (e.g. can you provide us the output of Thank you! |
If you have an active AWS support contract, please open a case with AWS Premium Support team using the below documentation to report the issue:
https://docs.aws.amazon.com/awssupport/latest/user/case-management.html
Before submitting a new issue, please search through open GitHub Issues and check out the troubleshooting documentation.
Please make sure to add the following data in order to facilitate the root cause detection.
Required Info:
pcluster describe-cluster
command.Bug description and how to reproduce:
A clear and concise description of what the bug is and the steps to reproduce the behavior.
Cluster repeatedly fails to update and from cloud-formation point of view goes to "rollback complete" . (custom routines do not appear even to get called)
If you are reporting issues about scaling or job failure:
We cannot work on issues without proper logs. We STRONGLY recommend following this guide and attach the complete cluster log archive with the ticket.
For issues with Slurm scheduler, please attach the following logs:
/var/log/parallelcluster/clustermgtd
,/var/log/parallelcluster/clusterstatusmgtd
(if version >= 3.2.0),/var/log/parallelcluster/slurm_resume.log
,/var/log/parallelcluster/slurm_suspend.log
,/var/log/parallelcluster/slurm_fleet_status_manager.log
(if version >= 3.2.0) and/var/log/slurmctld.log
./var/log/parallelcluster/computemgtd.log
and/var/log/slurmd.log
.If you are reporting issues about cluster creation failure or node failure:
If the cluster fails creation, please re-execute
create-cluster
action using--rollback-on-failure false
option.We cannot work on issues without proper logs. We STRONGLY recommend following this guide and attach the complete cluster log archive with the ticket.
Please be sure to attach the following logs:
/var/log/cloud-init.log
,/var/log/cfn-init.log
and/var/log/chef-client.log
(attached)
/var/log/cloud-init-output.log
.NA
logs.tgz <-headnode logs
Additional context:
Any other context about the problem. E.g.:
~/.parallelcluster/pcluster-cli.log
The text was updated successfully, but these errors were encountered: