cluster update fails in 3.10.0, 3.9.3 #6339

snemir2 · 2024-07-09T21:42:06Z

If you have an active AWS support contract, please open a case with AWS Premium Support team using the below documentation to report the issue:
https://docs.aws.amazon.com/awssupport/latest/user/case-management.html

Before submitting a new issue, please search through open GitHub Issues and check out the troubleshooting documentation.

Please make sure to add the following data in order to facilitate the root cause detection.

Required Info:

AWS ParallelCluster version [e.g. 3.1.1]: 3.10.0
Full cluster configuration without any credentials or personal data.
Cluster name: A2AiClustertesting
Output of pcluster describe-cluster command.

 pcluster describe-cluster -n A2AiClustertesting -r us-east-2
{
  "creationTime": "2024-07-09T18:49:23.141Z",
  "headNode": {
    "launchTime": "2024-07-09T18:53:32.000Z",
    "instanceId": "i-0976556062851f6ca",
    "instanceType": "r6i.xlarge",
    "state": "running",
    "privateIpAddress": "10.2.46.69"
  },
  "version": "3.10.0",
  "clusterConfiguration": {
    "url": "https://parallelcluster-97a8b56da16cbe1e-v1-do-not-delete.s3.us-east-2.amazonaws.com/parallelcluster/3.10.0/clusters/a2aiclustertesting-3vtysbdo94ne39ev/configs/cluster-config.yaml?versionId=rhu_16ixsYqMF2ctGyWRVVtNGS.jwCQI&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAZQUXECJHGFFWWCDI%2F20240709%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20240709T212526Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEG0aCXVzLWVhc3QtMiJHMEUCIFlopqMIbJ6IffFlaCwfCvGgUh0RIeHnnlInHBRnLECNAiEAnl07d3CZ39jLquycKjIjGyDcuvvzOR%2FiJ7vAVG%2FBRDAq%2FgIINxAAGgw2NTQyMjU3MDc1OTgiDNQTAke7TLCxmdDjFirbAlrYusjjQ3oD4XjjeNPyzpzxeX6as8JfiomXPRwmzsHOJl7ttg11miKNZ1h4h%2Fgt2MN%2FVucaVJoc%2BnWfHiXHQ8PTfWqjisZ698iw2QrMzLYzatufZSuwpfumz93eH1E8UCtNctjCvUdIqsr6vwTXFKoPqXhKm5KgZ5pfgqK381VNQmFP1xxPqnflpyL0pRnIRBC76XWdaD1zNAZluzp0Zxce75MiXjPT1NPqqu%2Fcux3VSTHgvPbuJfF2yri5pfRpp7n7KiLHgBus8OAfM%2FEwFMLvtnNPP61Hk%2BU0YvWZvuuXF6lLisxqxw4wZYNB0zR7zF3GecXDvuW4ZS%2Bapdme8hzOCk4xh4XG271G1p6Ch%2FG%2BvIWF4roQGgJBu3mOWrOEERzihvgeEDCZUsyIhJnUrJjSPuYsfAlf7aDDvEgru4sW2tKjCsShGpph%2F3cQa2hw4Y1k0DaUSCHPVZ5GMMXVtrQGOqcBeG9WoHb5rdd%2FG0uUI3pfUDWLVFC%2FyswoY22gb0Rkk8GIb3bBSm9SrYZOmxXw5lz%2FOP8X446KfcLMzInE2WeSZ9cijK5RT%2FAywuCQm4yXCglbra%2B0OG0r%2BWc%2BX0MkFDrtepKJRKeMH7pesvzqm8MWkqWpUUC59r9u%2BKa58HZQ6jjx0Icl00MUxa17OqYzQ0vUZqADUxggW9QddoJcdcpKLSlKUkEfjSY%3D&X-Amz-Signature=63e0e16188b1f9e6c696e5b980610f33e775c3fd801df5c1b4d618362eba722a"
  },
  "tags": [
    {
      "value": "branch/release-v4.0",
      "key": "A2AI:a2ai-cloud-version"
    },
    {
      "value": "mig8KP4B19EMB",
      "key": "map-migrated"
    },
    {
      "value": "3.10.0",
      "key": "parallelcluster:version"
    },
    {
      "value": "A2AiClustertesting",
      "key": "parallelcluster:cluster-name"
    },
    {
      "value": "sergey",
      "key": "A2AI:creator"
    },
    {
      "value": "dev",
      "key": "A2AI:a2ai-cloud-env"
    }
  ],
  "cloudFormationStackStatus": "UPDATE_ROLLBACK_COMPLETE",
  "clusterName": "A2AiClustertesting",
  "computeFleetStatus": "RUNNING",
  "cloudformationStackArn": "arn:aws:cloudformation:us-east-2:654225707598:stack/A2AiClustertesting/f531bdf0-3e23-11ef-997c-06835d7b2d0f",
  "lastUpdatedTime": "2024-07-09T19:32:18.912Z",
  "region": "us-east-2",
  "clusterStatus": "UPDATE_FAILED",
  "scheduler": {
    "type": "slurm"
  }
}

[Optional] Arn of the cluster CloudFormation main stack:

Bug description and how to reproduce:
A clear and concise description of what the bug is and the steps to reproduce the behavior.

Cluster repeatedly fails to update and from cloud-formation point of view goes to "rollback complete" . (custom routines do not appear even to get called)

If you are reporting issues about scaling or job failure:
We cannot work on issues without proper logs. We STRONGLY recommend following this guide and attach the complete cluster log archive with the ticket.

For issues with Slurm scheduler, please attach the following logs:

From Head node: /var/log/parallelcluster/clustermgtd, /var/log/parallelcluster/clusterstatusmgtd (if version >= 3.2.0), /var/log/parallelcluster/slurm_resume.log, /var/log/parallelcluster/slurm_suspend.log, /var/log/parallelcluster/slurm_fleet_status_manager.log (if version >= 3.2.0) and/var/log/slurmctld.log.
From Compute node: /var/log/parallelcluster/computemgtd.log and /var/log/slurmd.log.

If you are reporting issues about cluster creation failure or node failure:

If the cluster fails creation, please re-execute create-cluster action using --rollback-on-failure false option.

We cannot work on issues without proper logs. We STRONGLY recommend following this guide and attach the complete cluster log archive with the ticket.

Please be sure to attach the following logs:

From Head node: /var/log/cloud-init.log, /var/log/cfn-init.log and /var/log/chef-client.log
(attached)
From Compute node: /var/log/cloud-init-output.log.
NA
logs.tgz <-headnode logs

Additional context:
Any other context about the problem. E.g.:

CLI logs: ~/.parallelcluster/pcluster-cli.log
Custom bootstrap scripts, if any
Screenshots, if useful.

The text was updated successfully, but these errors were encountered:

snemir2 · 2024-07-09T23:58:16Z

might be related to #6329

snemir2 · 2024-07-10T15:29:32Z

a bit of update: i rolled back very same cluster config to PC3.9.3 and tried very same 'pcluster update' successfully. The issue is clearly PC 3.10.? specific

hehe7318 · 2024-07-31T19:06:32Z

Hi snemir2,

Can you share your original and updated pcluster configuration yaml file? That could help us to reproduce the error.

might be related to #6329

Seems not related.

Best regards,
Xuanqi He

hehe7318 · 2024-07-31T19:49:34Z

Hi snemir2,

We have been investigating the issue with the failed update of your AWS ParallelCluster. Our initial findings from the cfn-init.log file suggest that a critical point of failure might be related to the portkey.service. The log shows:

              + systemctl restart portkey
              Warning: The unit file, source configuration file, or drop-ins of portkey.service changed on disk. Run 'systemctl daemon-reload' to reload units.
              Job for portkey.service failed because the control process exited with error code.
              See "systemctl status portkey.service" and "journalctl -xeu portkey.service" for details.
              + echo 'skipping portkey restart'
              skipping portkey restart

The failure to restart portkey.service could have impacted subsequent operations, such as accessing the S3 bucket, as seen in the error message below:

Error executing action `run` on resource 'bash[configure_portkey]'
Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'
---- Begin output of "bash"  ----
STDOUT: 
STDERR: + mkdir -p /etc/portkey
+ aws s3 ls s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/a2ai/branch/release-v4.0/portkey/
---- End output of "bash"  ----
Ran "bash"  returned 1
Failed to execute OnNodeUpdated script 1 s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v4.0/download_and_run_cookbook.sh, return code: 1.
CloudFormation signaled successfully with status FAILURE

This error might stem from:

S3 Bucket Path or Permissions: The path might be incorrect, or the IAM role might not have the necessary permissions to access the S3 bucket.
Network Configuration Issues: These might involve the ENI (Elastic Network Interface), which could prevent S3 bucket access.

The sequence of events suggests that the portkey.service restart failure may have been the root cause, leading to issues with accessing the S3 bucket. We recommend running systemctl daemon-reload to reload the unit files and then retrying the update process.

To assist us in further diagnosing the problem and providing a resolution, we kindly request the following:

Chef Client Log: Please provide the chef-client.log file from the affected instances for more detailed insights.
ParallelCluster Configuration Files:
- The original parallelcluster configuration YAML file before the update.
- The updated parallelcluster configuration YAML file used during the failed update.

These details will help us better understand the configuration and environment. We will continue our investigation and keep you informed of any progress.

Best regards,
Xuanqi He

snemir2 · 2024-08-05T11:17:55Z

Hi @hehe7318 -
Thank you for looking into the issue. I am almost sure that the error above you are referring to is a consequence and not the source of the problem (took the portkey out of solution but had same failure) . As you can see from the cloud formation/cloud trail, it (for some strange reason) tries to do "runInstances" api call, fails to re-provision ENI on the head node and fails. That causes networking problems on the instance that you are seeing.
FYI -- also seeing this problem on 3.9.3

This is the failed api call from cloudtrail.

{
    "eventVersion": "1.09",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "AROAZQUXECJHPLP3U6GDM:sergey@a2-ai.com",
        "arn": "arn:aws:sts::654225707598:assumed-role/AWSReservedSSO_AWSAdministratorAccess_6cb29b3b61ef9620/sergey@a2-ai.com",
        "accountId": "654225707598",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "AROAZQUXECJHPLP3U6GDM",
                "arn": "arn:aws:iam::654225707598:role/aws-reserved/sso.amazonaws.com/us-east-2/AWSReservedSSO_AWSAdministratorAccess_6cb29b3b61ef9620",
                "accountId": "654225707598",
                "userName": "AWSReservedSSO_AWSAdministratorAccess_6cb29b3b61ef9620"
            },
            "attributes": {
                "creationDate": "2024-08-05T11:03:25Z",
                "mfaAuthenticated": "false"
            }
        },
        "invokedBy": "cloudformation.amazonaws.com"
    },
    "eventTime": "2024-08-05T11:04:28Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "RunInstances",
    "awsRegion": "us-east-2",
    "sourceIPAddress": "cloudformation.amazonaws.com",
    "userAgent": "cloudformation.amazonaws.com",
    "errorCode": "Client.InvalidNetworkInterface.InUse",
    "errorMessage": "Interface: [eni-07c94d1a9cd037f2f] in use.",
    "requestParameters": {
        "instancesSet": {
            "items": [
                {
                    "minCount": 1,
                    "maxCount": 1
                }
            ]
        },
        "blockDeviceMapping": {},
        "monitoring": {
            "enabled": false
        },
        "disableApiTermination": false,
        "disableApiStop": false,
        "clientToken": "14c7a9db-2c7d-217d-48d5-262ade2651c6",
        "ebsOptimized": false,
        "tagSpecificationSet": {
            "items": [
                {
                    "resourceType": "instance",
                    "tags": [
                        {
                            "key": "parallelcluster:version",
                            "value": "3.9.3"
                        },
                        {
                            "key": "aws:cloudformation:stack-name",
                            "value": "A2AiClustertesting"
                        },
                        {
                            "key": "aws:cloudformation:stack-id",
                            "value": "arn:aws:cloudformation:us-east-2:654225707598:stack/A2AiClustertesting/4a6dc1c0-50e3-11ef-9498-021a39b766eb"
                        },
                        {
                            "key": "A2AI:creator",
                            "value": "sergey"
                        },
                        {
                            "key": "parallelcluster:networking",
                            "value": "EFA=NONE"
                        },
                        {
                            "key": "parallelcluster:filesystem",
                            "value": "efs=1, multiebs=1, raid=0, fsx=0"
                        },
                        {
                            "key": "Name",
                            "value": "HeadNode"
                        },
                        {
                            "key": "map-migrated",
                            "value": "mig8KP4B19EMB"
                        },
                        {
                            "key": "A2AI:a2ai-cloud-version",
                            "value": "branch/release-v4.0"
                        },
                        {
                            "key": "parallelcluster:cluster-name",
                            "value": "A2AiClustertesting"
                        },
                        {
                            "key": "aws:cloudformation:logical-id",
                            "value": "HeadNode"
                        },
                        {
                            "key": "parallelcluster:node-type",
                            "value": "HeadNode"
                        },
                        {
                            "key": "parallelcluster:attributes",
                            "value": "ubuntu2204, slurm, 3.9.3, x86_64"
                        },
                        {
                            "key": "A2AI:a2ai-cloud-env",
                            "value": "dev"
                        }
                    ]
                }
            ]
        },
        "launchTemplate": {
            "launchTemplateId": "lt-093388f1a2c21ab1a",
            "version": "2"
        }
    },
    "responseElements": null,
    "requestID": "9ea4e084-bdb9-4d8c-a268-f388506ee1ea",
    "eventID": "3a98e509-4f97-4b03-a55f-704a1303439d",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "654225707598",
    "eventCategory": "Management"
}

francisreyes-tfs · 2024-08-05T19:54:51Z

Wow got this issue as well. simply adding a new instanceType for a slurm cluster queue triggers the creation of a new HeadNode in cloudformation, parallelcluster 3.10.0 ., and the error stems from cloudformation wanting to create a new Head Node while the ENI is still attached to the old head node . Now why this triggers a new HeadNode resource, I don't know, but with my own work in CustomResources, the Cloudformation custom resource is returning a new ID When an update is triggered which tells CF that the HeadNode needs to be replaced.

hanwen-pcluste · 2024-08-14T19:22:53Z

Apologies for the late reply.

Can you share your original and updated cluster configuration YAML file? That could help us to reproduce the error. I tried to add a new InstanceType and successfully updated my cluster.

samcofer · 2024-09-10T21:56:46Z

Hello! I'm seeing this issue as well when trying to update our customAMI. Have their been any updates here? I'm happy to share our before and after configuration if that would be helpful?

This is on cluster version 3.9.3

HeadNode: 
  CustomActions: 
    OnNodeConfigured: 
      Script: "s3://OBSCURED/install-pwb-config.sh"
  Iam: 
    S3Access: 
      - BucketName: OBSCURED
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      - Policy: arn:aws:iam::637485797898:policy/elbaccess-c5f9897
  InstanceType: t3.xlarge
  Networking: 
    SubnetId: subnet-0a937bc9f0c04ad8b
    AdditionalSecurityGroups: 
      - sg-09e341de4a0f2773e
  LocalStorage:
    RootVolume:
      Size: 120 
  SharedStorageType: Efs
  Ssh:
    KeyName: OBSCURED
Image: 
  Os: ubuntu2004
  CustomAmi: ami-043eba58b4b8131c6
Region: eu-west-1
Scheduling: 
  Scheduler: slurm
  SlurmSettings:
    EnableMemoryBasedScheduling: true
    Database:
      Uri: slurm-OBSCURED.eu-west-1.rds.amazonaws.com:3306
      UserName: slurm_db_admin
      PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED
      DatabaseName: slurm    
  SlurmQueues: 

    - Name: interactive
      ComputeResources:
        - Name: rstudio 
          InstanceType: t3.xlarge
          MaxCount: 20 
          MinCount: 1
          Efa:
            Enabled: FALSE
      CustomSlurmSettings:
        OverSubscribe: FORCE:2
      CustomActions:
        OnNodeConfigured:
          Script: "s3://OBSCURED/config-compute.sh"
      Iam:
        S3Access:
          - BucketName: OBSCURED
      Networking:
        PlacementGroup:
          Enabled: FALSE
        SubnetIds:
          - subnet-0a937bc9f0c04ad8b

    - Name: all 
      ComputeResources:
        - Name: rstudio 
          InstanceType: t3.xlarge
          MaxCount: 10
          MinCount: 0 
          Efa:
            Enabled: FALSE
      CustomActions:
        OnNodeConfigured:
          Script: "s3://OBSCURED/config-compute.sh"
      Iam:
        S3Access:
          - BucketName: OBSCURED
      Networking:
        PlacementGroup:
          Enabled: FALSE
        SubnetIds:
          - subnet-0a937bc9f0c04ad8b

    - Name: gpu 
      ComputeResources:
        - Name: large
          InstanceType: p3.2xlarge
          MaxCount: 1
          MinCount: 0
          Efa:
            Enabled: FALSE
      CustomActions:
        OnNodeConfigured:
          Script: "s3://OBSCURED/config-compute.sh"
      Iam:
        S3Access:
          - BucketName: OBSCURED
      Networking:
        PlacementGroup:
          Enabled: FALSE
        SubnetIds: 
          - subnet-0a937bc9f0c04ad8b


LoginNodes:
  Pools:
    - Name: login
      Count: 2 
      InstanceType: t3.xlarge
      Networking:
        AdditionalSecurityGroups: 
          - sg-09ca531e5331195f1
        SubnetIds: 
          - subnet-0a937bc9f0c04ad8b
      Ssh:
        KeyName: OBSCURED

DevSettings:
  Timeouts:
    HeadNodeBootstrapTimeout: 7200  # timeout in seconds
    ComputeNodeBootstrapTimeout: 7200  # timeout in seconds

SharedStorage:
  - MountDir: /home
    Name: home
    StorageType: FsxLustre
    FsxLustreSettings:
      StorageCapacity: 1200
      DeploymentType: SCRATCH_2
  - MountDir: /opt/rstudio
    Name: rstudio
    StorageType: Efs
  - MountDir: /opt/apps
    Name: appstack
    StorageType: Efs
    EfsSettings:
      FileSystemId: fs-OBSCURED

DirectoryService:
  DomainName: OBSCURED
  DomainAddr: ldap://OBSCURED
  PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED
  DomainReadOnlyUser: cn=Administrator,cn=Users,dc=pwb,dc=posit,dc=co
  GenerateSshKeysForUsers: true
  AdditionalSssdConfigs: 
    override_homedir : /home/%u
    ldap_id_use_start_tls : false
    ldap_tls_reqcert : never
    ldap_auth_disable_tls_never_use_in_production : true

Tags:
  - Key: rs:environment
    Value: development
  - Key: rs:owner
    Value: OBSCURED 
  - Key: rs:project
    Value: solutions
  - Key: rs:subsystem
    Value: ukhsa

After Update:

HeadNode: 
  CustomActions: 
    OnNodeConfigured: 
      Script: "s3://OBSCURED/install-pwb-config.sh"
  Iam: 
    S3Access: 
      - BucketName: OBSCURED
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      - Policy: arn:aws:iam::637485797898:policy/elbaccess-c5f9897
  InstanceType: t3.xlarge
  Networking: 
    SubnetId: subnet-0a937bc9f0c04ad8b
    AdditionalSecurityGroups: 
      - sg-09e341de4a0f2773e
  LocalStorage:
    RootVolume:
      Size: 120 
  SharedStorageType: Efs
  Ssh:
    KeyName: OBSCURED
Image: 
  Os: ubuntu2004
  CustomAmi: ami-092b5633346d89b54
Region: eu-west-1
Scheduling: 
  Scheduler: slurm
  SlurmSettings:
    EnableMemoryBasedScheduling: true
    Database:
      Uri: slurm-OBSCURED.eu-west-1.rds.amazonaws.com:3306
      UserName: slurm_db_admin
      PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED
      DatabaseName: slurm    
  SlurmQueues: 

    - Name: interactive
      ComputeResources:
        - Name: rstudio 
          InstanceType: t3.xlarge
          MaxCount: 20 
          MinCount: 1
          Efa:
            Enabled: FALSE
      CustomSlurmSettings:
      CustomActions:
        OnNodeConfigured:
          Script: "s3://OBSCURED/config-compute.sh"
      Iam:
        S3Access:
          - BucketName: OBSCURED
      Networking:
        PlacementGroup:
          Enabled: FALSE
        SubnetIds:
          - subnet-0a937bc9f0c04ad8b

    - Name: all 
      ComputeResources:
        - Name: rstudio 
          InstanceType: t3.xlarge
          MaxCount: 10
          MinCount: 0 
          Efa:
            Enabled: FALSE
      CustomActions:
        OnNodeConfigured:
          Script: "s3://OBSCURED/config-compute.sh"
      Iam:
        S3Access:
          - BucketName: OBSCURED
      Networking:
        PlacementGroup:
          Enabled: FALSE
        SubnetIds:
          - subnet-0a937bc9f0c04ad8b

    - Name: gpu 
      ComputeResources:
        - Name: large
          InstanceType: p3.2xlarge
          MaxCount: 1
          MinCount: 0
          Efa:
            Enabled: FALSE
      CustomActions:
        OnNodeConfigured:
          Script: "s3://OBSCURED/config-compute.sh"
      Iam:
        S3Access:
          - BucketName: OBSCURED
      Networking:
        PlacementGroup:
          Enabled: FALSE
        SubnetIds: 
          - subnet-0a937bc9f0c04ad8b


LoginNodes:
  Pools:
    - Name: login
      Count: 2 
      InstanceType: t3.xlarge
      Networking:
        AdditionalSecurityGroups: 
          - sg-09ca531e5331195f1
        SubnetIds: 
          - subnet-0a937bc9f0c04ad8b
      Ssh:
        KeyName: OBSCURED

DevSettings:
  Timeouts:
    HeadNodeBootstrapTimeout: 7200  # timeout in seconds
    ComputeNodeBootstrapTimeout: 7200  # timeout in seconds

SharedStorage:
  - MountDir: /home
    Name: home
    StorageType: FsxLustre
    FsxLustreSettings:
      StorageCapacity: 1200
      DeploymentType: SCRATCH_2
  - MountDir: /opt/rstudio
    Name: rstudio
    StorageType: Efs
  - MountDir: /opt/apps
    Name: appstack
    StorageType: Efs
    EfsSettings:
      FileSystemId: fs-OBSCURED

DirectoryService:
  DomainName: OBSCURED
  DomainAddr: ldap://OBSCURED
  PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED
  DomainReadOnlyUser: cn=Administrator,cn=Users,dc=pwb,dc=posit,dc=co
  GenerateSshKeysForUsers: true
  AdditionalSssdConfigs: 
    override_homedir : /home/%u
    ldap_id_use_start_tls : false
    ldap_tls_reqcert : never
    ldap_auth_disable_tls_never_use_in_production : true

Tags:
  - Key: rs:environment
    Value: development
  - Key: rs:owner
    Value: OBSCURED 
  - Key: rs:project
    Value: solutions
  - Key: rs:subsystem
    Value: ukhsa

snemir2 · 2025-01-17T19:38:27Z

Still have this issue...
My changes are exclusively in OnNodeUpdated script (which never got called, it failed before that)

hanwen-cluster · 2025-01-21T21:16:04Z

Hi @samcofer,

Apologies for the late reply!

You were trying to update CustomAmi under the root Image section, which is not supported. Forcing the update causes head node replacement and triggers network in-use failure.

To update AMIs on compute nodes, use Image section in Queues.
To update the AMI on head node, please create a new cluster.

Thank you,

hanwen-cluster · 2025-01-21T21:19:34Z

Hi @snemir2,

Could you share cluster configuration files before/after the update?

Thank you,
Hanwen

snemir2 · 2025-01-28T16:52:25Z

actually, config files do not change (the update logic is within the OnHeadnodeUpdated script, thus forcing the update)
During update, it fails before the script ever called -- so known not to be an issue

hanwen-cluster · 2025-01-28T19:37:04Z

In the screenshot you provided, the head node was being updated. The head node shouldn't be updated if the configuration file didn't change. Could you double check whether the configuration file didn't change (e.g. can you provide us the output of pcluster update-cluster)?

Thank you!

snemir2 added the 3.x label Jul 9, 2024

snemir2 mentioned this issue Jul 10, 2024

Potential race condition on cluster deletion in 3.10.0 #6329

Closed

snemir2 changed the title ~~cluster update fails in 3.10.0~~ cluster update fails in 3.10.0, 3.9.3 Aug 5, 2024

elduds mentioned this issue Aug 6, 2024

Feature or Documentation Request - Continuous Deployment (eg. Blue/Green ) #6382

Open

This comment was marked as off-topic.

Sign in to view

snemir2 mentioned this issue Jan 17, 2025

Service: AmazonEC2; Status Code: 400; Error Code: InvalidNetworkInterface.InUse #4063

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster update fails in 3.10.0, 3.9.3 #6339

cluster update fails in 3.10.0, 3.9.3 #6339

snemir2 commented Jul 9, 2024 •

edited

Loading

snemir2 commented Jul 9, 2024

snemir2 commented Jul 10, 2024

hehe7318 commented Jul 31, 2024 •

edited

Loading

hehe7318 commented Jul 31, 2024 •

edited

Loading

snemir2 commented Aug 5, 2024 •

edited

Loading

francisreyes-tfs commented Aug 5, 2024 •

edited

Loading

hanwen-pcluste commented Aug 14, 2024

This comment was marked as off-topic.

samcofer commented Sep 10, 2024

snemir2 commented Jan 17, 2025

hanwen-cluster commented Jan 21, 2025 •

edited

Loading

hanwen-cluster commented Jan 21, 2025

snemir2 commented Jan 28, 2025

hanwen-cluster commented Jan 28, 2025

cluster update fails in 3.10.0, 3.9.3 #6339

cluster update fails in 3.10.0, 3.9.3 #6339

Comments

snemir2 commented Jul 9, 2024 • edited Loading

snemir2 commented Jul 9, 2024

snemir2 commented Jul 10, 2024

hehe7318 commented Jul 31, 2024 • edited Loading

hehe7318 commented Jul 31, 2024 • edited Loading

snemir2 commented Aug 5, 2024 • edited Loading

francisreyes-tfs commented Aug 5, 2024 • edited Loading

hanwen-pcluste commented Aug 14, 2024

This comment was marked as off-topic.

samcofer commented Sep 10, 2024

snemir2 commented Jan 17, 2025

hanwen-cluster commented Jan 21, 2025 • edited Loading

hanwen-cluster commented Jan 21, 2025

snemir2 commented Jan 28, 2025

hanwen-cluster commented Jan 28, 2025

snemir2 commented Jul 9, 2024 •

edited

Loading

hehe7318 commented Jul 31, 2024 •

edited

Loading

hehe7318 commented Jul 31, 2024 •

edited

Loading

snemir2 commented Aug 5, 2024 •

edited

Loading

francisreyes-tfs commented Aug 5, 2024 •

edited

Loading

hanwen-cluster commented Jan 21, 2025 •

edited

Loading