You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I profiled the logits processing code, and the bottleneck is the transfer of the allowed token ids list (which can have many elements) to GPU. My suggestion is to use a compressed version of the list that can be efficiently uncompressed/used to mask logits on GPU, for instance bitmaps.
We should first evaluate the potential speed-ups in Python; if the bottleneck becomes the bitmap construction we could move it to Rust, if it is the operations on GPU we can implement a CUDA kernel to mask the logits.
Is there a corresponding issue in the VLLM repository? Also, as mentioned here, tracking performance would really help with reasoning through these kinds of issues, wouldn’t it?
The integration of
outlines-core
in vLLM currently performs a lot worse than it could. We can improve the performance in two ways:outlines
and use the latest version ofoutlines-core
which shows better compilation performance.If those don't bring the speed on par with
xgrammar
we need to understand exactly why that is.The text was updated successfully, but these errors were encountered: