-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quick Code Review: Auto-vectorization #2
Comments
Amazing, that's really helpful to know. Thanks for pointing it out. Do you plan on continuing to work on this? Was planning on moving on, but now I'm kind of curious to implement quantization to scale to bigger models. Would be happy to collaborate if you were interested. |
Nice. This bumped me up from 0.92 t/s to 1.02 t/2 on llama2 7B. |
Yeah, that's what I hope to do before moving on. Would love to colaborate. I started playing with some prototypes on this branch (it's not very reader friendly yet)
Nice! I wonder why the speed-up is so small compared to the 15M mode. Maybe the CPU waits on mmap page swaps? |
Nice, I will try to catch up on your code. Some of the HF people recommended trying to do GPTQ inference (quant-full mat-vec). Which version are you doing? |
My code is mostly experimenting with ways to do a clean matmul interface. Did a naive rowwise i8 quantization of weights and matmuls that gets accumulated to f32. But that's really just the first thing that poped to mind. |
hi! I saw that you are also a maintainer of Triton and worked on the AoT compiler. I'm playing around with trying to set this project up to use Triton just to learn it. Do you have any tips for getting this to work? I tried exporting PTX which worked reasonably well at first, but I think I am running into issues with calling into it from Rust. Curious if you had pointers to recommended ways to do it? My hacky code: https://github.com/srush/llama2.rs/pull/35/files#diff-7c199e27f9cec983de845ad01b4fd4e558534ee33fd49d8134cbab879361af67R158 |
What are the issues that you have? Is that slowness or something Triton related? I left some comments in the PR. |
Thanks, once I got it running it was fast, but then when I tried to further optimize the Triton code, the rust version went out of sync with the python version. Trying to make a minimal example. |
Hi Sasha! Nice to see your take on the llama2.rs!
I did a port of Anrej’s llama.c here
Had a chance to go over the code and compare to my version, the only thing I want to mention is that you can make the compiler auto-vectorize some computations (most notably matmul)
Did a quick benchmark of our implementations. Your's runs at ~52t/s mine runs at ~75t/s (on 2CPU/4Gb codespaces VM, running stories15M model). My guess is that most of the difference is because the compiler can’t auto-vectorize your matmul implementation.
A good way to help the compiler to auto-vectorize is to use iterators as much as possible. Key idea is to replace the following loop with an iterator:
So doing the following closes most of the gap and puts you implementation at ~74t/s
You can force the compiler to try and auto-vectorize with avx2 by passing the compiler flags:
RUSTFLAGS=“-C target-feature=+avx2,+fma"
The text was updated successfully, but these errors were encountered: