Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSX Timeout when changing permission after mounting #6235

Open
samagids opened this issue May 6, 2024 · 2 comments
Open

FSX Timeout when changing permission after mounting #6235

samagids opened this issue May 6, 2024 · 2 comments
Labels

Comments

@samagids
Copy link

samagids commented May 6, 2024

So during build, the cluster headnodes mounts the /scratch FSX volume but hangs while changing permission from the origin to 0777 and ownership to root. It timeout aften 600 seconds causing a cluster create failure. Increases the HeadNodeBootstrapTimeout but it still did not help. Looks like there is a hard coded FSX mount timeout of 600 seconds. Our volume is 6.8T 2.9T 3.9T 44% /scratch

@samagids samagids added the 3.x label May 6, 2024
@aws aws deleted a comment from samagids May 24, 2024
@enrico-usai
Copy link
Contributor

Hi @samagids, sorry for the delay. I hope you were able to fix the issue in the meantime.

I tried to look at the attached logs but from the logs I cannot see any error related to mount of FSx volume.
The error in the log-events-viewer-results is about a bootstrap issue of the head node:

1714755511740,"2024-05-03 16:58:31,740 [ERROR] Command chef (cinc-client --local-mode --config /etc/chef/client.rb --log_level info --logfile /var/log/chef-client.log --force-formatter --no-color --chef-zero-port 8889 --json-attributes /etc/chef/dna.json --override-runlist aws-parallelcluster-entrypoints::config) failed"
1714755511740,"2024-05-03 16:58:31,740 [DEBUG] Command chef output: "
1714755511740,"2024-05-03 16:58:31,740 [ERROR] Error encountered during build of chefConfig: Command chef failed"
1714755511740,"Traceback (most recent call last):
  File ""/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py"", line 579, in run_config
    CloudFormationCarpenter(config, self._auth_config, self.strict_mode).build(worklog)
  File ""/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py"", line 278, in build
    self._config.commands)
  File ""/usr/lib/python3.7/site-packages/cfnbootstrap/command_tool.py"", line 127, in apply
    raise ToolError(u""Command %s failed"" % name)"
1714755511740,cfnbootstrap.construction_errors.ToolError: Command chef failed
1714755511866,"2024-05-03 16:58:31,866 [ERROR] -----------------------BUILD FAILED!------------------------"

In this step the cinc-client is executing some recipes to configure the head node and it's failing. It may be because of the mentioned FSx mount issue, but to confirm this we need to look at the /var/log/chef-client.log of the HeadNode.
See troubleshooting guide.

Then I see you have an OnNodeStart and OnNodeConfigured scripts, and you have Active Directory integration too.
Can you try the creation without the custom bootstrap scripts or without the AD integration? It would be nice to verify if without them the mount timeout is still there.


I removed the attached files from the GitHub issue because in the logs there were some details about your subnets, vpcs, policies, AD settings, etc.

If you have an active AWS support contract, please open a case with AWS Premium Support team using the below documentation to report the issue:
https://docs.aws.amazon.com/awssupport/latest/user/case-management.html and attach the log files there.

Enrico

@enrico-usai
Copy link
Contributor

enrico-usai commented May 24, 2024

I saw you're passing an existing FileSystemId for FSx.

An important thing to check are the security groups to ensure the nodes are able to mount the File system. As stated in the documentation:

Make sure that traffic is allowed between the cluster and file system by doing one of the following:

Configure the security groups of the file system to allow the traffic to and from the CIDR or prefix list of cluster subnets.

Note
AWS ParallelCluster validates that ports are open and that the CIDR or prefix list is configured. AWS ParallelCluster doesn't validate the content of CIDR block or prefix list.

Set custom security groups for cluster nodes by using SlurmQueues / Networking / SecurityGroups and HeadNode / Networking / SecurityGroups. The custom security groups must be configured to allow traffic between the cluster and the file system.

Note
If all cluster nodes use custom security groups, AWS ParallelCluster only validates that the ports are open. AWS ParallelCluster doesn't validate that the source and destination are properly configured.

BTW I'd suggest to use AdditionalSecurityGroups rather than SecurityGroups. The former will add your security groups to the SGs created by Pcluster, the latter will replace all Pcluster's security groups so you should use them carefully, being sure to have all the communications enabled.


Anyway from the chef-init logs we should be able to identify where the bootstrap is blocked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants