Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long GC Suspend and hard faults #111201

Open
arekpalinski opened this issue Jan 8, 2025 · 8 comments
Open

Long GC Suspend and hard faults #111201

arekpalinski opened this issue Jan 8, 2025 · 8 comments

Comments

@arekpalinski
Copy link

Description

We (RavenDB) have a customer that is occasionally experiencing very long GC pauses. It's not a new issue or a regression. It's experienced from long time. It was happening on .NET 6 already. Some time ago RavenDB got updated to use .NET 8 (in order to be on LTS version). More specifically version 8.0.400 is deployed on the affected environment.

The pauses affect the application requests to RavenDB. Also the monitoring shows connectivity issues between the nodes (the setup is 3 nodes RavenDB cluster).

So far we have narrowed it down to GC Suspend events taking a lot of time with the usage of PerfView.

Reproduction Steps

This is happening very occasionally, only in a production environment. RavenDB is configured as 3 nodes (machines) cluster. The issue is happening randomly on any node. The machines have 256 GB of memory, OS is Windows Server 2019 Datacenter (OS build number: 17763.5696.amd64fre.rs5_release.180914-1434). They are hosted in Azure.

Expected behavior

GC pauses aren't taking so long

Actual behavior

We'd like to share our analysis of the recent occurrence of the issue. We have three PerfView traces that were collected with the usage of the following command:

PerfView.exe /nogui /accepteula /BufferSizeMB:4096 /CircularMB:4096 /CollectMultiple:3 /StopOnGCSuspendOverMSec:5000 /KernelEvents=ThreadTime+Memory+VirtualAlloc /NoNGenRundown /Merge:true /Zip:true collect /Process:"Raven.Server" Over5SecGCSuspend

The longest GC suspend took 18,724.797 mSec (time range: 244,459.607 - 263,347.329):

Image

The below analysis is about this GC suspend event (although other two PerfView outputs are very similar)

Events

  • Affected GC thread, doing the suspension is 2908

Image

Thread Time (with Ready Threads) Stacks

Analysis of coreclr!ThreadSuspend::SuspendEE

  • There is a lot of READIED BY TID(0) Idle (0) CPU Wait < 1ms IdleCPUs events which they mostly point to ntoskrnl!??KiProcessExpiredTimerList (READIED_BY):
    Image

  • Below we can also find two RavenDB threads - 5108 and 6524:
    Image

Looking deeper we can find that both threads are performing some queries:

Image

When looking at the stacks of 5108 thread we can see that it's reading a documents from disk (we use memory mapped files), causing a page fault:

Image

This is about the following code:

https://github.com/ravendb/ravendb/blob/1c79a8d9131b248cfe129f7ad516495f31942584/src/Voron/Impl/LowLevelTransaction.cs#L648-L655

https://github.com/ravendb/ravendb/blob/1c79a8d9131b248cfe129f7ad516495f31942584/src/Voron/Impl/Paging/AbstractPager.cs#L332

Events (ReadyThread)

ReadyThread events for AwakenedThreadId - 2908 (GC thead doing the suspension), was awakened by mentioned Raven's thread - 5108 and 6524, but also by Idle (0) process so I assume it's System process.

Image

CPU Stacks

From CPU Stacks we know that before long GC suspend, we have started FlushFileBuffers() on our Raven.voron file where documents are saved (threads 5108 and 6524 are reading from). We call it periodically explicitly here:

https://github.com/ravendb/ravendb/blob/1c79a8d9131b248cfe129f7ad516495f31942584/src/Voron/Platform/Win32/WindowsMemoryMapPager.cs#L344

It lasted also during the GC suspend. Under the covers we can see MmTrimSection() call, which as we understand will evict some pages, hence subsequent access to its trimmed pages will result in a page fault, requiring the system to re-read the data from the file (what we see in 5108 and 6524 threads).

Image

Events (Suspend EE and Hard Faults)

Going back to Events view and adding Hard Faults events:

Image

So from our analysis it looks that the GC suspend was caused by Hard Faults taking 18-19 seconds (not sure why) after recent FlushFileBuffers().

Questions

  1. Can you please validate our analysis?
  2. Can you please confirm our assumption about MmTrimSection()?
  3. Is it expected / known issue that a managed thread doing a hard page fault might be a blocker for GC suspend?
  4. Is there any kind of a OS / kernel lock that that is affecting a thread doing a read which can result in 18-19 seconds hard faults?

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jan 8, 2025
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jan 8, 2025
@huoyaoyuan huoyaoyuan added area-GC-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jan 8, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@Maoni0 Maoni0 added area-VM-coreclr and removed area-GC-coreclr untriaged New issue has not been triaged by the area owner labels Jan 8, 2025
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@Maoni0
Copy link
Member

Maoni0 commented Jan 8, 2025

looping in @kouvel for suspension issue. thanks for the detailed investigation!

@markples
Copy link
Contributor

markples commented Jan 8, 2025

just a few observations - I don't know details about the flush/trim part, but yes, if a normal memory operation faults, then that thread won't have been put into a "GC ready" state (as opposed to the flush itself, which the DllImport mechanism can do). (glossing over details about fully interruptible code vs not)

Presumably the file size (seeing the file offset of 815,000,000,000) is part of it.

@arekpalinski
Copy link
Author

Yes, the file in question is pretty large - 862GB. Here are details of a sample sync operation on that file (FlushViewOfFile + FlushFileBuffers). Note it's from a different time so it's not about GC suspension in question:

Image

According to our own metrics we had 49.5MB of data written that were not synced. It's literally the measurement of the following piece of code:
https://github.com/ravendb/ravendb/blob/1c79a8d9131b248cfe129f7ad516495f31942584/src/Voron/Platform/Win32/WindowsMemoryMapPager.cs#L327-L349

@arekpalinski
Copy link
Author

We have also another PerfView traces, collected about ~15 minutes earlier where we also had GC suspension taking 6.8 second. Also during that time we can see very long hard page faults (but meanwhile there are some, even with similar offsets but taking below 1ms. I know it doesn't mean anything but it's interesting):

Image

Dropping HardFault events, and showing more FlushInit which happened much earlier (since the durations of hard faults were about 20 seconds, so I assume that earlier flushes had to cause them) shows us:

Image

@arekpalinski
Copy link
Author

Something that I have problems with understanding is that we see a lot of FlushInit events, but almost no FlushBuffers. Although Any Stacks view shown for a any FlushInit event, points to out FlushFileBuffers.

Can it be that, as show in one of the screenshots from yesterday, that most of our FlushFileBuffers is actually MmTrimSection?

Image

@kouvel
Copy link
Member

kouvel commented Jan 15, 2025

It seems very likely that the long-duration hard faults are causing the long GC pauses. When a page fault occurs directly from managed code (such as from LowLevelTransaction.GetPageInternal above), the thread would not be in a safe place for suspension. P/invokes are ok because the thread is taken into a safe place for suspension during the p/invoke and remains safe to suspend until it tries to return to managed code. I figure using memory-mapped files is analogous in many ways to doing synchronous IO, and accessing those mapped regions directly from managed code could lead to suspension issues depending on disk latencies.

I'm not sure as to why the hard faults are taking so long to resolve. Typically a hard fault entails reading from disk. There appears to be some kind of lock being taken in the stack trace (NtfsAcquirePagingShared), but I'm not sure if lock contention is an issue, or if there are just many hard faults overlapping and maybe there is some serialization of them underneath.

As for the trimming from the flushes, it seems like it could trim some sections of memory from the working set of the process, but I'm not sure if/why it would evict those sections from the cache. If the memory is cached but not in the working set, it would trigger a page fault but I would imagine it would be a soft fault that wouldn't take so long to resolve.

It's plausible that the flushes could be interacting by increasing disk latency. Have you looked at disk latencies around the long pauses?

Something that I have problems with understanding is that we see a lot of FlushInit events, but almost no FlushBuffers. Although Any Stacks view shown for a any FlushInit event, points to out FlushFileBuffers.

Maybe there should also be ReadInit/Read and WriteInit/Write events depending on what's happening. Events can be dropped heuristically, such as if raising them entails some overhead like page faults, disk IO, etc. It may help to pass the -InMemoryCircularBuffer switch to PerfView when collecting the profile, along with a large-enough value for -CircularMB to get enough data, that usually allows for more events to be recorded instead of being dropped. An in-memory circular buffer may be good to use anyway since there's disk IO already involved in the issue (ThreadTime events can be very verbose and affect disk IO).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants