-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rage is 38% slower at encrypting than Go implementation #57
Comments
Note: I plan on adding |
Yeah, most of the performance difference is that The other place to look at for optimisation is my implementation of STREAM. Currently encryption of each chunk involves an allocation because I am not using the |
Yeah, STREAM seems to be the slowest part here. Multicore optimizations should also speed-up things massively. See the gist for a tiny & very performant example of Rust threads. |
I ran a brief test on my laptop, of the form:
Switching to |
Oh heh, looks like my benchmarks were being limited by the speed of
Still no difference using |
All right, squeaky wheel gets the grease. The I rewrote the buffering logic and added a new AVX2 backend which can compute two ChaCha20 blocks in parallel. I've got it down to ~1.4cpb now: Will double check I didn't break anything and cut a new release soon, then bump the |
Here's the benchmarked improvement on the full AEAD construction: ...so encryption is ~60% faster, and decryption is unchanged. Note that there's still some low hanging fruit, like a SIMD implementation of Poly1305, and pipelining the execution of ChaCha20 and Poly1305 so they can execute in parallel. |
I adapted the Before vs after
Before
|
I've opened #58 with the benchmark and the dependency update. |
More measurements of the improvement on my desktop (i7-8700K CPU @ 3.70GHz). Before
Before
And current master without vs with AVX2:
|
And
So the current status is that |
Could be great to understand what exactly slows us down at this point. Not sure what is the best way to profile traces in Rust. |
Is there any reason to not compile with AVX2? I think, almost every x86 cpu nowadays supports it? |
The three main things are:
Re: the third item the |
Nope! Looking at the December 2019 Steam hardware survey, 77.05% of the surveyed Windows machines (which made up 96.86% of the survey, so I'm not looking at the macOS or Linux figures) support AVX2. Given that gamers tend towards newer hardware, this is most likely an upper bound on support (by how much, IDK). See also this Rust discussion thread. |
@tarcieri i've thought a nonce-misuse-resistant construction cannot be 1-pass? Specifically, SIV. Am I wrong? |
@paulmillr that’s true (for encryption, decryption in a SIV mode can still be 1-pass), but we’re talking about ChaCha20Poly1305 here... |
If anyone would like to try wiring it up, Benchmarks showed its AVX2 backend was about 40% faster than the |
Ooh, thanks! I'll try that today 😃 |
Also note that the |
What about parallelism / multicore usage? Anything we could do here? |
STREAM is "embarrassingly parallel" so pick any parallelization strategy you want |
Current master of each (measured on my laptop - Thinkpad P1 with Xeon E-2176M):
|
I've used Reading the 2 GiB input from The largest time sink is clearly the |
I've managed to speed up Same
Current master of
Current master of
Flame graph (highlighted sections are the |
And now that I've managed to update all the dependencies (#187, #186), and we have Configuration:
🚄💨 |
what are the use cases of armor tho? not that many I guess? |
It's true that you're unlikely to be armoring 2 GiB of data, but it's not outside the intended use case. Armoring was specifically added to the spec to handle CRLF platforms (because the binary spec is canonical LF and would be broken by platforms that translate LF to CRLF). Also, let me take my wins where I can 😅 |
It's awesome in any case. Is it possible to compile one binary that would be using avx2 when available and falling back to non-vectorized impl? |
FWIW the I'm also looking at implementing some end-to-end SIMD buffering in All that said, one of the big goals of the next release of the RustCrypto crates is runtime detection so target feature customization is no longer required, although that might come at a small performance hit until we can work through all of the impacts that has on e.g. inlining and other optimizations. |
I just tried switching from
So I'm not sure how |
Just verified it's working by running the benchmarks in the
|
RustCrypto/stream-ciphers#261 helps to close the gap significantly:
|
Awesome stuff!! |
I dug in further, and I think RustCrypto/stream-ciphers#262 would close almost all of the remaining gap between |
RustCrypto/stream-ciphers#267 does indeed close the remaining gap (for
|
Is it possible to combine rage and rage-avx2? Aka runtime avx detection without performance loss we've seen before. Maybe there's also some small issue that adds the perf hit |
Runtime AVX2 detection without performance loss is impossible, because Currently we check for AVX2 support inside both That being said, the runtime detection gap was seemingly smaller when using |
Huh, I just re-ran the benchmarks on my machine (on current
(The So I think we're actually probably fine on my last point above (at least, switching away from |
Can we get a new version of rage out? |
@str4d ping — it would be useful to have new release |
@paulmillr 0.7.0 is now out with the above changes. |
Weird: this person says rage is 5x slower FiloSottile/age#109 (comment) |
Speaking of latest gen desktop CPUs, core count does not matter: still slow on Windows/Ryzen system, but on Linux/Intel it can at least do about 700 MB/s encryption and decryption (either on tmpfs or on SSD, not much difference). You absolutely need to do several blocks in parallel threads to make it faster (single thread caps to about 1 GB/s with current cryptolibs), and perhaps check your I/O path to remove any extra buffer copying. Ideally you read encrypted data by mmap, or if not possible, by Several gigabytes per second should be very possible, but then you cannot afford any extra copies. |
@Tronic I'm well aware that we will eventually need to add threading support to boost performance further. However, the last time I tried that (#57 (comment)) I saw only a 20%-ish throughput improvement while using 4x more CPU. So there are clearly other bottlenecks that need addressing first before we add multithreading support. In any case, this particular issue is about catching up to the performance of the Go age implementation, which is also single-threaded. Let's move multithreading discussions to #271. |
Re-ran the benchmarks on my old and new machines:
Intel Core i7-8700K
Baselines are higher (probably because I have Firefox open), but otherwise it's the same approximate ratios we've seen before. AMD Ryzen 9 5950X
Compared to the i7-8700K:
Yay, I have a new target to optimise for! |
What's leaving us behind at this point? |
Per my earlier comment (#57 (comment)), I'm almost certain it's our lack of one-pass encryption: the Go AEAD impl uses separate custom assembly for ChaCha20Poly1305, whereas the Rust Crypto AEAD impl is compositional so the ChaCha20 assembly is separate from the Poly1305 assembly. |
The prospective v0.5 PR for the That should make it possible to interleave encryption+authentication / authentication+decryption passes at the granularity of blocks that the backend SIMD implementations operate over |
Re-ran the benchmarks on my Ryzen 9 5950X against #303, and that PR with RustCrypto/AEADs#415 applied.
Pre-release
Pre-release plus
|
Command | Time (s) | Relative |
---|---|---|
age |
0.80 | 1 |
rage |
1.55 | 1.94 |
rage-avx2 |
1.47 | 1.84 |
age -a |
2.83 | 1 |
rage -a |
3.03 | 1.07 |
rage-avx2 -a |
3.09 | 1.09 |
The new traits in universal-hash 0.5
are enabling a significant speed-up, I suspect due to us only checking the backend at runtime on the level of an entire message rather than on every block. But we still trail behind without one-pass encryption, and trying to implement that makes things significantly slower (and will likely do so until we figure out a way to lift the runtime checks to the AEAD level).
EDIT: Current performance
Just tried to encrypt random 2GB file — 5.37s @ Rust vs 1.07s @ Go.
Go is not great — this includes performance as well; so we could probably do better with Rage!
The text was updated successfully, but these errors were encountered: