Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad memory accesses in springCleaning() #3762

Open
trquinn opened this issue Aug 25, 2023 · 3 comments
Open

Bad memory accesses in springCleaning() #3762

trquinn opened this issue Aug 25, 2023 · 3 comments
Assignees
Labels
Bug Something isn't working

Comments

@trquinn
Copy link
Collaborator

trquinn commented Aug 25, 2023

Running ChaNGa under valgrind reports that CkArray::springCleaning() is accessing freed memory.
This is with ChaNGa version 3.5 commit v3.5-11-gc7ba57c0 and charm version v7.1.0-devel-321-g606459e74
This is built on an AMD/infiniband machine with mpi-linux-x86_64-smp with gcc v11.2.0 and mvapchi2 2.3.6

Soon after writing an output (using CkIO) valgrind reports errors like:

==3563697== Invalid read of size 8
==3563697==    at 0x7DC2DB: CkArray::staticSpringCleaning(void*) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x88A663: CcdRaiseCondition (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x88AD15: CcdCallBacks (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x88EE0B: CsdScheduleForever (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x88F234: CsdScheduler (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x8DDD46: ConverseRunPE(int) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x8DDD8A: call_startfn(void*) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x50A91CE: start_thread (in /usr/lib64/libpthread-2.28.so)
==3563697==    by 0x6CDBE72: clone (in /usr/lib64/libc-2.28.so)
==3563697==  Address 0x848f768 is 1,016 bytes inside a block of size 1,024 free'd
==3563697==    at 0x4C4AB30: free (in /apps/spack/anvil/apps/valgrind/3.15.0-gcc-11.2.0-u7tvx2t/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3563697==    by 0x7DF84B: CkIndex_CkArray::_call_ckDestroy_void(void*, void*) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x7D4ABF: CkDeliverMessageFree (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x7D8194: _processHandler(void*, CkCoreState*) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x88EDCB: CsdScheduleForever (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x88F234: CsdScheduler (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x8DDD46: ConverseRunPE(int) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x8DDD8A: call_startfn(void*) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x50A91CE: start_thread (in /usr/lib64/libpthread-2.28.so)
==3563697==    by 0x6CDBE72: clone (in /usr/lib64/libc-2.28.so)
==3563697==  Block was alloc'd at
==3563697==    at 0x4C495ED: malloc (in /apps/spack/anvil/apps/valgrind/3.15.0-gcc-11.2.0-u7tvx2t/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3563697==    by 0x7D5BCD: CkCreateLocalGroup (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x7D858E: _processHandler(void*, CkCoreState*) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x88EDCB: CsdScheduleForever (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x88F234: CsdScheduler (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x8DDD46: ConverseRunPE(int) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x8DDD8A: call_startfn(void*) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x50A91CE: start_thread (in /usr/lib64/libpthread-2.28.so)
==3563697==    by 0x6CDBE72: clone (in /usr/lib64/libc-2.28.so)
==3563697== 
@trquinn trquinn added the Bug Something isn't working label Aug 25, 2023
@trquinn
Copy link
Collaborator Author

trquinn commented Aug 25, 2023

I discovered this while investigating #3678 so it may be related.

@lvkale
Copy link
Contributor

lvkale commented Sep 22, 2023

Mathew, I am adding you to this issue simply because you are familiar with ckio. But the issue (probably) has to do with "spring cleaning" garbage collection scheme for broadcasts, applying to deleted chare arrays when it should not.

@trquinn
Copy link
Collaborator Author

trquinn commented Apr 8, 2024

This crash can be reproduced on stampede3 by compiling ChaNGa, changing to the "teststep" directory and running:
../ChaNGa.smp ++ppn 12 -n 1000 -oi 10 +setcpuaffinity +commap 0,1 +pemap 2-46:2,3-47:2 -binout 6 test_pg.param
The program will run for about 800 seconds before crashing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants