Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codegen: on demand distribution to forked processes #14273

Conversation

ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented Jan 30, 2024

This patch optimizes the Codegen (bc+obj) pass of the compiler by taking better advantage of multiple CPU cores.

Instead of pre-slicing the list of compilation units then having each forked process handle its personal list, with some finishing before the others, this patch adds bidirectional pipes to push the compilation units as the forked processes make progress, so that each forked process will continue to compile for as long as there is something to compile.

The forker processes are overqueued with one compilation unit to avoid latencies (they compile the next unit while the main process pushes the next).

This only improves performance when there are multiple compilation units. It won't have any effect when --single-module (implied by --release) is set, and will have little impact when a module is extra large (e.g. crystal specs) unless maybe when there are multiple of them and they were set to the same forked process.

Example: while compiling Crystal the Codegen (bc+obj) pass goes from ~18s to ~11s on an empty cache, and ~5s to ~3s on a full cache recompilation.

Related to #14227 but using forks+pipes instead of threads+channel.

Instead of pre-slicing the list of compilation units then having each
forked process handle its personal list, with some finishing before the
others, this patch adds bidirectional pipes to push the compilation
units as the forked processes make progress, so that each forked process
will continue to compile for as long as there is something to compile.

The forker processes are overqueued with one compilation unit to avoid
latencies (they compile the next unit while the main process pushes the
next).

This only improves performance when there are multiple compilation
units. It won't have any effect when `--single-module` (implied by
`--release`) is set, and will have little impact when a module is extra
large (e.g. crystal specs) unless maybe when there are multiple of them
and they were set to the same forked process.
The method exists but raises NotImplementedError which causes
Process.new(pid) to fail because it doesn't exist with a `PidT`). This
patch fixes the issue by checking LibC.fork directly.

Weirdly, both were used before these changes and compilation did succeed.
@ysbaddaden ysbaddaden force-pushed the fix/codegen-distribute-work-to-forked-processes branch from 3288c59 to 2f1af88 Compare January 30, 2024 09:37
@straight-shoota straight-shoota added this to the 1.12.0 milestone Mar 4, 2024
@ysbaddaden
Copy link
Contributor Author

Let's note that we default to 8 processes, which used to be fine when we had 4 cores with hyperthreading (8 threads), but might now be much in phase with the current CPU trends where 16, 20 or 32 threads are becoming the norm, though we might not want to fork 160 processes on an Ampere Altra 😁

Maybe we could improve the default value to be System.cpu_count.clamp(8..32) so we have enough processes but not too much (I'm looking at you Ampere Altra).

Maybe we could accept a CRYSTAL_COMPILER_THREADS environment variable too?

@ysbaddaden
Copy link
Contributor Author

And... now I'm wondering about what happens when we have fewer compilation units than configured threads 🤔

Maybe I should limit the number of forked processes, just in case 😅

@straight-shoota
Copy link
Member

I'd like to merge this as is and fine tune the number of threads in a follow-up.

@ysbaddaden
Copy link
Contributor Author

I just pushed one last commit to fix edge cases with too little (must be >= 1) or too many processes (pointless).

The comment was to leave some indication about how to go further.

@straight-shoota
Copy link
Member

Yeah, that clamp makes sense for sanity.

@straight-shoota straight-shoota merged commit 0606158 into crystal-lang:master Mar 6, 2024
57 checks passed
@ysbaddaden ysbaddaden deleted the fix/codegen-distribute-work-to-forked-processes branch March 7, 2024 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants