Can a thread be paused for 10 seconds? #1851

pquentin · 2020-12-31T10:08:54Z

Yes.

I'm currently reading "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems" by Martin Kleppmann and a section from Chapter 8 is particularly relevant to the issues we're seeing in #200 and I wanted to be able to reference it in the future.

There are various reasons why this could happen:

Many programming language runtimes (such as the Java Virtual Machine) have a garbage collector (GC) that occasionally needs to stop all running threads. These “stop-the-world” GC pauses have sometimes been known to last for several minutes [64]! Even so-called “concurrent” garbage collectors like the HotSpot JVM’s CMS cannot fully run in parallel with the application code—even they need to stop the world from time to time [65]. Although the pauses can often be reduced by changing allocation patterns or tuning GC settings [66], we must assume the worst if we want to offer robust guarantees.
In virtualized environments, a virtual machine can be suspended (pausing the execution of all processes and saving the contents of memory to disk) and resumed (restoring the contents of memory and continuing execution). This pause can occur at any time in a process’s execution and can last for an arbitrary length of time. This feature is sometimes used for live migration of virtual machines from one host to another without a reboot, in which case the length of the pause depends on the rate at which processes are writing to memory [67].
On end-user devices such as laptops, execution may also be suspended and resumed arbitrarily, e.g., when the user closes the lid of their laptop.
When the operating system context-switches to another thread, or when the hypervisor switches to a different virtual machine (when running in a virtual machine), the currently running thread can be paused at any arbitrary point in the code. In the case of a virtual machine, the CPU time spent in other virtual machines is known as steal time. If the machine is under heavy load—i.e., if there is a long queue of threads waiting to run—it may take some time before the paused thread gets to run again.
If the application performs synchronous disk access, a thread may be paused waiting for a slow disk I/O operation to complete [68]. In many languages, disk access can happen surprisingly, even if the code doesn’t explicitly mention file access—for example, the Java classloader lazily loads class files when they are first used, which could happen at any time in the program execution. I/O pauses and GC pauses may even conspire to combine their delays [69]. If the disk is actually a network filesystem or network block device (such as Amazon’s EBS), the I/O latency is further subject to the variability of network delays [29].
If the operating system is configured to allow swapping to disk (paging), a simple memory access may result in a page fault that requires a page from disk to be loaded into memory. The thread is paused while this slow I/O operation takes place. If memory pressure is high, this may in turn require a different page to be swapped out to disk. In extreme circumstances, the operating system may spend most of its time swapping pages in and out of memory and getting little actual work done (this is known as thrashing). To avoid this problem, paging is often disabled on server machines (if you would rather kill a process to free up memory than risk thrashing).
A Unix process can be paused by sending it the SIGSTOP signal, for example by pressing Ctrl-Z in a shell. This signal immediately stops the process from getting any more CPU cycles until it is resumed with SIGCONT, at which point it continues running where it left off. Even if your environment does not normally use SIGSTOP, it might be sent accidentally by an operations engineer.

All of these occurrences can preempt the running thread at any point and resume it at some later time, without the thread even noticing. The problem is similar to making multi-threaded code on a single machine thread-safe: you can’t assume anything about timing, because arbitrary context switches and parallelism may occur. [...] Eventually, the paused node may continue running, without even noticing that it was asleep until it checks its clock sometime later.

[29] Steve Newman: “A Systematic Look at EC2 I/O,” blog.scalyr.com, October 16, 2012.
[64] Todd Lipcon: “Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers: Part 1,” blog.cloudera.com, February 24, 2011. [65] Martin Thompson: “Java Garbage Collection Distilled,” mechanicalsympathy.blogspot.co.uk, July 16, 2013.
[66] Alexey Ragozin: “How to Tame Java GC Pauses? Surviving 16GiB Heap and Greater,” java.dzone.com, June 28, 2011.
[67] Christopher Clark, Keir Fraser, Steven Hand, et al.: “Live Migration of Virtual Machines,” at 2nd USENIX Symposium on Symposium on Networked Systems Design & Implementation (NSDI), May 2005.
[68] Mike Shaver: “fsyncers and Curveballs,” shaver.off.net, May 25, 2008.
[69] Zhenyun Zhuang and Cuong Tran: “Eliminating Large JVM GC Pauses Caused by Background IO Traffic,” engineering.linkedin.com, February 10, 2016.

The book then continues to explain that those issues can be eliminated: it's possible to get real time reliable execution but it would be slow and expensive. We prefer our systems to be fast and cheap, even with random pauses.

Timing tests are by nature flaky, especially in CI, see python-trio/trio#1851 for example.

Timing tests are notoriously unreliable (see python-trio/trio#1851) and were failing during my tests using macOS on GitHub Actions.

pquentin closed this as completed Dec 31, 2020

pquentin changed the title ~~Is it crazy to assume that a thread might be paused for 10 seconds?~~ Can a thread be paused for 10 seconds? Jan 1, 2021

pquentin added a commit to pquentin/rally that referenced this issue Mar 30, 2022

Be more lenient in timings test

230f8f6

Timing tests are by nature flaky, especially in CI, see python-trio/trio#1851 for example.

pquentin added a commit to pquentin/rally that referenced this issue Mar 30, 2022

Be more lenient in timings test

7d7f8d4

Timing tests are by nature flaky, especially in CI, see python-trio/trio#1851 for example.

pquentin added a commit to pquentin/rally that referenced this issue Feb 9, 2023

Mock time to ensure reliable results on macOS CI

c3ba45a

Timing tests are notoriously unreliable (see python-trio/trio#1851) and were failing during my tests using macOS on GitHub Actions.

pquentin mentioned this issue Feb 9, 2023

Mock time to ensure reliable results on macOS CI elastic/rally#1668

Merged

pquentin added a commit to elastic/rally that referenced this issue Feb 15, 2023

Mock time to ensure reliable results on macOS CI (#1668)

8343a17

Timing tests are notoriously unreliable (see python-trio/trio#1851) and were failing during my tests using macOS on GitHub Actions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can a thread be paused for 10 seconds? #1851

Can a thread be paused for 10 seconds? #1851

pquentin commented Dec 31, 2020 •

edited

Loading

Can a thread be paused for 10 seconds? #1851

Can a thread be paused for 10 seconds? #1851

Comments

pquentin commented Dec 31, 2020 • edited Loading

pquentin commented Dec 31, 2020 •

edited

Loading