SCHISM memory allocation issues #99

mykelalvis · 2025-01-22T17:15:13Z

I'm creating this issue directly from the text of a email sent on 2025-01-10 from @ZacharyWills titled "Update on SCHISM":

We (Jason, Mykel, etc.) have successfully compiled the NWMv3.0 and SCHISM models using the standard suite of intel compilers on the cloud sandbox (intel-oneapi-compilers/2023.1.0-gcc-11.2.1-3a7dxu3; intel-oneapi-mpi/2021.9.0-intel-2021.9.0-egjrbfg; netcdf-c/4.9.2-intel-2021.9.0-vznmeik; netcdf-fortran/4.6.1-intel-2021.9.0-meeveoj; parallelio/2.6.2-intel-2021.9.0-csz55zr). Considering the MPI development and the high-resolution domain configurations of these models, we've had a prerequisite on NOAA RDHPCS supercomputers (Hera cluster) to utilize node configurations containing between 3-10 GBs/CPU RAM configurations in order to successfully execute the model setups. On the cloud sandbox, we've gone ahead and tested a small SCHISM coastal model domain (700,000 elements) and we were only able to successfully execute this model setup on the x2idn.32xlarge node type (16 GBs/cpu), while the hpc6a.48xlarge (4 GBs/cpu) configuration consistently was throwing Fortran allocation errors directly from the code base itself. However, any scalability for large meshes (CONUS domain) for NWMv3 or SCHISM models have consistently thrown "BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES; KILLED BY SIGNAL: 9 (Killed)" errors upon the model initialization phase as the models were attempting to load mesh arrays. We've attempted to maximize all "ulimit" environmental variables as well during the launcher shell script, but this did not change any results from the original behavior of the system. Overall, there appears to be an issue with the environmental system settings for executing these particular coastal models that warrants a further discussion on the cloudflow end.

Jason has a block of code on the us-east-2b head node at /save/ec2-user/OWP/CoastalSandbox that demonstrates this.

Out of an abundance of caution, I created a branch from main of this repo that contains the entirety of Jason's tree (including some temp files that I could probably have filtered better).

That branch contains a test.out file that describes the issues that we're experiencing where memory is not allocatable even though ulimits appear to have been set.

Per Jason:
/save/ec2-user/OWP/Cloud-Sandbox/cloudflow/workflows/nwmv3_hindcastrun.sh is the shell script executing the model run.

The text was updated successfully, but these errors were encountered:

jduckerOWP · 2025-02-03T18:52:51Z

The error here with memory allocation issues for OWP coastal models (and any model suite in general) is directly related to the Cloud-Sandbox workflow and extracting the AWS cluster host id information. The host ids are required for any mpi command to properly link and utilize the respective environments spun up in the AWS instances requested for a given user. A few example lines of cloud below can just simply demonstrate this general method in the Cloud-Sandbox workflow
from cloudflow.cluster.Cluster import Cluster
HOSTS = cluster.getHostsCSV()
(User will feed the HOSTS variable information into a shell launcher script for their given model)
export MPIOPTS="-launcher ssh -hosts $HOSTS -np $NPROCS -ppn $PPN"
ulimit -c unlimited
ulimit -s unlimited
mpiexec $MPIOPTS $EXEC

These steps are the critical components to properly linking the AWS instance/s environment settings to the given model executable specified by users on the Cloud-Sandbox head-node. I am going to submit a PR this week to simplify and streamline this entire process to make new model integration into the Cloud-Sandbox for new users. Looking forward to everyone's contribution to that upcoming PR! We can now go ahead and close this issue overall for this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SCHISM memory allocation issues #99

SCHISM memory allocation issues #99

mykelalvis commented Jan 22, 2025 •

edited

Loading

jduckerOWP commented Feb 3, 2025

SCHISM memory allocation issues #99

SCHISM memory allocation issues #99

Comments

mykelalvis commented Jan 22, 2025 • edited Loading

jduckerOWP commented Feb 3, 2025

mykelalvis commented Jan 22, 2025 •

edited

Loading