Codegen: on demand distribution to forked processes #14273

ysbaddaden · 2024-01-30T08:19:20Z

This patch optimizes the Codegen (bc+obj) pass of the compiler by taking better advantage of multiple CPU cores.

Instead of pre-slicing the list of compilation units then having each forked process handle its personal list, with some finishing before the others, this patch adds bidirectional pipes to push the compilation units as the forked processes make progress, so that each forked process will continue to compile for as long as there is something to compile.

The forker processes are overqueued with one compilation unit to avoid latencies (they compile the next unit while the main process pushes the next).

This only improves performance when there are multiple compilation units. It won't have any effect when --single-module (implied by --release) is set, and will have little impact when a module is extra large (e.g. crystal specs) unless maybe when there are multiple of them and they were set to the same forked process.

Example: while compiling Crystal the Codegen (bc+obj) pass goes from ~18s to ~11s on an empty cache, and ~5s to ~3s on a full cache recompilation.

Related to #14227 but using forks+pipes instead of threads+channel.

Instead of pre-slicing the list of compilation units then having each forked process handle its personal list, with some finishing before the others, this patch adds bidirectional pipes to push the compilation units as the forked processes make progress, so that each forked process will continue to compile for as long as there is something to compile. The forker processes are overqueued with one compilation unit to avoid latencies (they compile the next unit while the main process pushes the next). This only improves performance when there are multiple compilation units. It won't have any effect when `--single-module` (implied by `--release`) is set, and will have little impact when a module is extra large (e.g. crystal specs) unless maybe when there are multiple of them and they were set to the same forked process.

The method exists but raises NotImplementedError which causes Process.new(pid) to fail because it doesn't exist with a `PidT`). This patch fixes the issue by checking LibC.fork directly. Weirdly, both were used before these changes and compilation did succeed.

src/compiler/crystal/compiler.cr

ysbaddaden · 2024-03-05T10:20:05Z

Let's note that we default to 8 processes, which used to be fine when we had 4 cores with hyperthreading (8 threads), but might now be much in phase with the current CPU trends where 16, 20 or 32 threads are becoming the norm, though we might not want to fork 160 processes on an Ampere Altra 😁

Maybe we could improve the default value to be System.cpu_count.clamp(8..32) so we have enough processes but not too much (I'm looking at you Ampere Altra).

Maybe we could accept a CRYSTAL_COMPILER_THREADS environment variable too?

ysbaddaden · 2024-03-05T10:21:30Z

And... now I'm wondering about what happens when we have fewer compilation units than configured threads 🤔

Maybe I should limit the number of forked processes, just in case 😅

straight-shoota · 2024-03-05T12:20:40Z

I'd like to merge this as is and fine tune the number of threads in a follow-up.

ysbaddaden · 2024-03-05T15:29:31Z

I just pushed one last commit to fix edge cases with too little (must be >= 1) or too many processes (pointless).

The comment was to leave some indication about how to go further.

straight-shoota · 2024-03-05T16:41:29Z

Yeah, that clamp makes sense for sanity.

ysbaddaden added 2 commits January 30, 2024 09:11

ysbaddaden force-pushed the fix/codegen-distribute-work-to-forked-processes branch from 3288c59 to 2f1af88 Compare January 30, 2024 09:37

straight-shoota reviewed Jan 30, 2024

View reviewed changes

src/compiler/crystal/compiler.cr Show resolved Hide resolved

src/compiler/crystal/compiler.cr Outdated Show resolved Hide resolved

straight-shoota added performance kind:refactor topic:compiler:codegen labels Jan 30, 2024

Fix: expect N responses from forked processes

434edae

straight-shoota approved these changes Feb 13, 2024

View reviewed changes

straight-shoota added this to the 1.12.0 milestone Mar 4, 2024

Fix: don't fork more processes than compilation units

73fb8fd

straight-shoota approved these changes Mar 5, 2024

View reviewed changes

straight-shoota merged commit 0606158 into crystal-lang:master Mar 6, 2024
57 checks passed

ysbaddaden deleted the fix/codegen-distribute-work-to-forked-processes branch March 7, 2024 11:00

BrewTestBot mentioned this pull request Apr 9, 2024

crystal 1.12.0 Homebrew/homebrew-core#168465

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codegen: on demand distribution to forked processes #14273

Codegen: on demand distribution to forked processes #14273

ysbaddaden commented Jan 30, 2024 •

edited

Loading

ysbaddaden commented Mar 5, 2024

ysbaddaden commented Mar 5, 2024

straight-shoota commented Mar 5, 2024

ysbaddaden commented Mar 5, 2024

straight-shoota commented Mar 5, 2024

Codegen: on demand distribution to forked processes #14273

Codegen: on demand distribution to forked processes #14273

Conversation

ysbaddaden commented Jan 30, 2024 • edited Loading

ysbaddaden commented Mar 5, 2024

ysbaddaden commented Mar 5, 2024

straight-shoota commented Mar 5, 2024

ysbaddaden commented Mar 5, 2024

straight-shoota commented Mar 5, 2024

ysbaddaden commented Jan 30, 2024 •

edited

Loading