Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCHISM memory allocation issues #99

Open
mykelalvis opened this issue Jan 22, 2025 · 1 comment
Open

SCHISM memory allocation issues #99

mykelalvis opened this issue Jan 22, 2025 · 1 comment

Comments

@mykelalvis
Copy link
Contributor

mykelalvis commented Jan 22, 2025

I'm creating this issue directly from the text of a email sent on 2025-01-10 from @ZacharyWills titled "Update on SCHISM":

We (Jason, Mykel, etc.) have successfully compiled the NWMv3.0 and SCHISM models using the standard suite of intel compilers on the cloud sandbox (intel-oneapi-compilers/2023.1.0-gcc-11.2.1-3a7dxu3; intel-oneapi-mpi/2021.9.0-intel-2021.9.0-egjrbfg; netcdf-c/4.9.2-intel-2021.9.0-vznmeik; netcdf-fortran/4.6.1-intel-2021.9.0-meeveoj; parallelio/2.6.2-intel-2021.9.0-csz55zr). Considering the MPI development and the high-resolution domain configurations of these models, we've had a prerequisite on NOAA RDHPCS supercomputers (Hera cluster) to utilize node configurations containing between 3-10 GBs/CPU RAM configurations in order to successfully execute the model setups. On the cloud sandbox, we've gone ahead and tested a small SCHISM coastal model domain (700,000 elements) and we were only able to successfully execute this model setup on the x2idn.32xlarge node type (16 GBs/cpu), while the hpc6a.48xlarge (4 GBs/cpu) configuration consistently was throwing Fortran allocation errors directly from the code base itself. However, any scalability for large meshes (CONUS domain) for NWMv3 or SCHISM models have consistently thrown "BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES; KILLED BY SIGNAL: 9 (Killed)" errors upon the model initialization phase as the models were attempting to load mesh arrays. We've attempted to maximize all "ulimit" environmental variables as well during the launcher shell script, but this did not change any results from the original behavior of the system. Overall, there appears to be an issue with the environmental system settings for executing these particular coastal models that warrants a further discussion on the cloudflow end.

Jason has a block of code on the us-east-2b head node at /save/ec2-user/OWP/CoastalSandbox that demonstrates this.

Out of an abundance of caution, I created a branch from main of this repo that contains the entirety of Jason's tree (including some temp files that I could probably have filtered better).

That branch contains a test.out file that describes the issues that we're experiencing where memory is not allocatable even though ulimits appear to have been set.

Per Jason:
/save/ec2-user/OWP/Cloud-Sandbox/cloudflow/workflows/nwmv3_hindcastrun.sh is the shell script executing the model run.

@jduckerOWP
Copy link

The error here with memory allocation issues for OWP coastal models (and any model suite in general) is directly related to the Cloud-Sandbox workflow and extracting the AWS cluster host id information. The host ids are required for any mpi command to properly link and utilize the respective environments spun up in the AWS instances requested for a given user. A few example lines of cloud below can just simply demonstrate this general method in the Cloud-Sandbox workflow
from cloudflow.cluster.Cluster import Cluster
HOSTS = cluster.getHostsCSV()
(User will feed the HOSTS variable information into a shell launcher script for their given model)
export MPIOPTS="-launcher ssh -hosts $HOSTS -np $NPROCS -ppn $PPN"
ulimit -c unlimited
ulimit -s unlimited
mpiexec $MPIOPTS $EXEC

These steps are the critical components to properly linking the AWS instance/s environment settings to the given model executable specified by users on the Cloud-Sandbox head-node. I am going to submit a PR this week to simplify and streamline this entire process to make new model integration into the Cloud-Sandbox for new users. Looking forward to everyone's contribution to that upcoming PR! We can now go ahead and close this issue overall for this repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants