Long GC Suspend and hard faults #111201

arekpalinski · 2025-01-08T14:53:44Z

Description

We (RavenDB) have a customer that is occasionally experiencing very long GC pauses. It's not a new issue or a regression. It's experienced from long time. It was happening on .NET 6 already. Some time ago RavenDB got updated to use .NET 8 (in order to be on LTS version). More specifically version 8.0.400 is deployed on the affected environment.

The pauses affect the application requests to RavenDB. Also the monitoring shows connectivity issues between the nodes (the setup is 3 nodes RavenDB cluster).

So far we have narrowed it down to GC Suspend events taking a lot of time with the usage of PerfView.

Reproduction Steps

This is happening very occasionally, only in a production environment. RavenDB is configured as 3 nodes (machines) cluster. The issue is happening randomly on any node. The machines have 256 GB of memory, OS is Windows Server 2019 Datacenter (OS build number: 17763.5696.amd64fre.rs5_release.180914-1434). They are hosted in Azure.

Expected behavior

GC pauses aren't taking so long

Actual behavior

We'd like to share our analysis of the recent occurrence of the issue. We have three PerfView traces that were collected with the usage of the following command:

PerfView.exe /nogui /accepteula /BufferSizeMB:4096 /CircularMB:4096 /CollectMultiple:3 /StopOnGCSuspendOverMSec:5000 /KernelEvents=ThreadTime+Memory+VirtualAlloc /NoNGenRundown /Merge:true /Zip:true collect /Process:"Raven.Server" Over5SecGCSuspend

The longest GC suspend took 18,724.797 mSec (time range: 244,459.607 - 263,347.329):

The below analysis is about this GC suspend event (although other two PerfView outputs are very similar)

Events

Affected GC thread, doing the suspension is 2908

Thread Time (with Ready Threads) Stacks

Analysis of coreclr!ThreadSuspend::SuspendEE

There is a lot of READIED BY TID(0) Idle (0) CPU Wait < 1ms IdleCPUs events which they mostly point to ntoskrnl!??KiProcessExpiredTimerList (READIED_BY):
Below we can also find two RavenDB threads - 5108 and 6524:

Looking deeper we can find that both threads are performing some queries:

When looking at the stacks of 5108 thread we can see that it's reading a documents from disk (we use memory mapped files), causing a page fault:

This is about the following code:

https://github.com/ravendb/ravendb/blob/1c79a8d9131b248cfe129f7ad516495f31942584/src/Voron/Impl/LowLevelTransaction.cs#L648-L655

https://github.com/ravendb/ravendb/blob/1c79a8d9131b248cfe129f7ad516495f31942584/src/Voron/Impl/Paging/AbstractPager.cs#L332

Events (ReadyThread)

ReadyThread events for AwakenedThreadId - 2908 (GC thead doing the suspension), was awakened by mentioned Raven's thread - 5108 and 6524, but also by Idle (0) process so I assume it's System process.

CPU Stacks

From CPU Stacks we know that before long GC suspend, we have started FlushFileBuffers() on our Raven.voron file where documents are saved (threads 5108 and 6524 are reading from). We call it periodically explicitly here:

https://github.com/ravendb/ravendb/blob/1c79a8d9131b248cfe129f7ad516495f31942584/src/Voron/Platform/Win32/WindowsMemoryMapPager.cs#L344

It lasted also during the GC suspend. Under the covers we can see MmTrimSection() call, which as we understand will evict some pages, hence subsequent access to its trimmed pages will result in a page fault, requiring the system to re-read the data from the file (what we see in 5108 and 6524 threads).

Events (Suspend EE and Hard Faults)

Going back to Events view and adding Hard Faults events:

So from our analysis it looks that the GC suspend was caused by Hard Faults taking 18-19 seconds (not sure why) after recent FlushFileBuffers().

Questions

Can you please validate our analysis?
Can you please confirm our assumption about MmTrimSection()?
Is it expected / known issue that a managed thread doing a hard page fault might be a blocker for GC suspend?
Is there any kind of a OS / kernel lock that that is affecting a thread doing a read which can result in 18-19 seconds hard faults?

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

The text was updated successfully, but these errors were encountered:

dotnet-policy-service · 2025-01-08T15:15:35Z

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

dotnet-policy-service · 2025-01-08T19:04:01Z

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Maoni0 · 2025-01-08T19:04:21Z

looping in @kouvel for suspension issue. thanks for the detailed investigation!

markples · 2025-01-08T19:31:09Z

just a few observations - I don't know details about the flush/trim part, but yes, if a normal memory operation faults, then that thread won't have been put into a "GC ready" state (as opposed to the flush itself, which the DllImport mechanism can do). (glossing over details about fully interruptible code vs not)

Presumably the file size (seeing the file offset of 815,000,000,000) is part of it.

arekpalinski · 2025-01-09T13:19:30Z

Yes, the file in question is pretty large - 862GB. Here are details of a sample sync operation on that file (FlushViewOfFile + FlushFileBuffers). Note it's from a different time so it's not about GC suspension in question:

According to our own metrics we had 49.5MB of data written that were not synced. It's literally the measurement of the following piece of code:
https://github.com/ravendb/ravendb/blob/1c79a8d9131b248cfe129f7ad516495f31942584/src/Voron/Platform/Win32/WindowsMemoryMapPager.cs#L327-L349

arekpalinski · 2025-01-09T13:43:36Z

We have also another PerfView traces, collected about ~15 minutes earlier where we also had GC suspension taking 6.8 second. Also during that time we can see very long hard page faults (but meanwhile there are some, even with similar offsets but taking below 1ms. I know it doesn't mean anything but it's interesting):

Dropping HardFault events, and showing more FlushInit which happened much earlier (since the durations of hard faults were about 20 seconds, so I assume that earlier flushes had to cause them) shows us:

arekpalinski · 2025-01-09T13:52:51Z

Something that I have problems with understanding is that we see a lot of FlushInit events, but almost no FlushBuffers. Although Any Stacks view shown for a any FlushInit event, points to out FlushFileBuffers.

Can it be that, as show in one of the screenshots from yesterday, that most of our FlushFileBuffers is actually MmTrimSection?

kouvel · 2025-01-15T21:54:01Z

It seems very likely that the long-duration hard faults are causing the long GC pauses. When a page fault occurs directly from managed code (such as from LowLevelTransaction.GetPageInternal above), the thread would not be in a safe place for suspension. P/invokes are ok because the thread is taken into a safe place for suspension during the p/invoke and remains safe to suspend until it tries to return to managed code. I figure using memory-mapped files is analogous in many ways to doing synchronous IO, and accessing those mapped regions directly from managed code could lead to suspension issues depending on disk latencies.

I'm not sure as to why the hard faults are taking so long to resolve. Typically a hard fault entails reading from disk. There appears to be some kind of lock being taken in the stack trace (NtfsAcquirePagingShared), but I'm not sure if lock contention is an issue, or if there are just many hard faults overlapping and maybe there is some serialization of them underneath.

As for the trimming from the flushes, it seems like it could trim some sections of memory from the working set of the process, but I'm not sure if/why it would evict those sections from the cache. If the memory is cached but not in the working set, it would trigger a page fault but I would imagine it would be a soft fault that wouldn't take so long to resolve.

It's plausible that the flushes could be interacting by increasing disk latency. Have you looked at disk latencies around the long pauses?

Something that I have problems with understanding is that we see a lot of FlushInit events, but almost no FlushBuffers. Although Any Stacks view shown for a any FlushInit event, points to out FlushFileBuffers.

Maybe there should also be ReadInit/Read and WriteInit/Write events depending on what's happening. Events can be dropped heuristically, such as if raising them entails some overhead like page faults, disk IO, etc. It may help to pass the -InMemoryCircularBuffer switch to PerfView when collecting the profile, along with a large-enough value for -CircularMB to get enough data, that usually allows for more events to be recorded instead of being dropped. An in-memory circular buffer may be good to use anyway since there's disk IO already involved in the issue (ThreadTime events can be very verbose and affect disk IO).

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jan 8, 2025

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jan 8, 2025

huoyaoyuan added area-GC-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jan 8, 2025

Maoni0 added area-VM-coreclr and removed area-GC-coreclr untriaged New issue has not been triaged by the area owner labels Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long GC Suspend and hard faults #111201

Long GC Suspend and hard faults #111201

arekpalinski commented Jan 8, 2025

dotnet-policy-service bot commented Jan 8, 2025

dotnet-policy-service bot commented Jan 8, 2025

Maoni0 commented Jan 8, 2025

markples commented Jan 8, 2025

arekpalinski commented Jan 9, 2025

arekpalinski commented Jan 9, 2025

arekpalinski commented Jan 9, 2025

kouvel commented Jan 15, 2025

Long GC Suspend and hard faults #111201

Long GC Suspend and hard faults #111201

Comments

arekpalinski commented Jan 8, 2025

Description

Reproduction Steps

Expected behavior

Actual behavior

Events

Thread Time (with Ready Threads) Stacks

Events (ReadyThread)

CPU Stacks

Events (Suspend EE and Hard Faults)

Questions

Regression?

Known Workarounds

Configuration

Other information

dotnet-policy-service bot commented Jan 8, 2025

dotnet-policy-service bot commented Jan 8, 2025

Maoni0 commented Jan 8, 2025

markples commented Jan 8, 2025

arekpalinski commented Jan 9, 2025

arekpalinski commented Jan 9, 2025

arekpalinski commented Jan 9, 2025

kouvel commented Jan 15, 2025