You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm creating this issue directly from the text of a email sent on 2025-01-10 from @ZacharyWills titled "Update on SCHISM":
We (Jason, Mykel, etc.) have successfully compiled the NWMv3.0 and SCHISM models using the standard suite of intel compilers on the cloud sandbox (intel-oneapi-compilers/2023.1.0-gcc-11.2.1-3a7dxu3; intel-oneapi-mpi/2021.9.0-intel-2021.9.0-egjrbfg; netcdf-c/4.9.2-intel-2021.9.0-vznmeik; netcdf-fortran/4.6.1-intel-2021.9.0-meeveoj; parallelio/2.6.2-intel-2021.9.0-csz55zr). Considering the MPI development and the high-resolution domain configurations of these models, we've had a prerequisite on NOAA RDHPCS supercomputers (Hera cluster) to utilize node configurations containing between 3-10 GBs/CPU RAM configurations in order to successfully execute the model setups. On the cloud sandbox, we've gone ahead and tested a small SCHISM coastal model domain (700,000 elements) and we were only able to successfully execute this model setup on the x2idn.32xlarge node type (16 GBs/cpu), while the hpc6a.48xlarge (4 GBs/cpu) configuration consistently was throwing Fortran allocation errors directly from the code base itself. However, any scalability for large meshes (CONUS domain) for NWMv3 or SCHISM models have consistently thrown "BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES; KILLED BY SIGNAL: 9 (Killed)" errors upon the model initialization phase as the models were attempting to load mesh arrays. We've attempted to maximize all "ulimit" environmental variables as well during the launcher shell script, but this did not change any results from the original behavior of the system. Overall, there appears to be an issue with the environmental system settings for executing these particular coastal models that warrants a further discussion on the cloudflow end.
Jason has a block of code on the us-east-2b head node at /save/ec2-user/OWP/CoastalSandbox that demonstrates this.
Out of an abundance of caution, I created a branch from main of this repo that contains the entirety of Jason's tree (including some temp files that I could probably have filtered better).
The error here with memory allocation issues for OWP coastal models (and any model suite in general) is directly related to the Cloud-Sandbox workflow and extracting the AWS cluster host id information. The host ids are required for any mpi command to properly link and utilize the respective environments spun up in the AWS instances requested for a given user. A few example lines of cloud below can just simply demonstrate this general method in the Cloud-Sandbox workflow from cloudflow.cluster.Cluster import Cluster HOSTS = cluster.getHostsCSV()
(User will feed the HOSTS variable information into a shell launcher script for their given model) export MPIOPTS="-launcher ssh -hosts $HOSTS -np $NPROCS -ppn $PPN" ulimit -c unlimited ulimit -s unlimited mpiexec $MPIOPTS $EXEC
These steps are the critical components to properly linking the AWS instance/s environment settings to the given model executable specified by users on the Cloud-Sandbox head-node. I am going to submit a PR this week to simplify and streamline this entire process to make new model integration into the Cloud-Sandbox for new users. Looking forward to everyone's contribution to that upcoming PR! We can now go ahead and close this issue overall for this repository.
I'm creating this issue directly from the text of a email sent on 2025-01-10 from @ZacharyWills titled "Update on SCHISM":
Jason has a block of code on the us-east-2b head node at
/save/ec2-user/OWP/CoastalSandbox
that demonstrates this.Out of an abundance of caution, I created a branch from
main
of this repo that contains the entirety of Jason's tree (including some temp files that I could probably have filtered better).That branch contains a
test.out
file that describes the issues that we're experiencing where memory is not allocatable even though ulimits appear to have been set.Per Jason:
/save/ec2-user/OWP/Cloud-Sandbox/
cloudflow/workflows/nwmv3_hindcastrun.sh
is the shell script executing the model run.The text was updated successfully, but these errors were encountered: