You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe the problem.
In experimenting with the performance characteristics of YARP, it became clear that there is a balancing act between how many requests should be handled at once, and the RPS, latency, CPU usage and working set. Kestrel + ASP.NET core by default does not try to throttle the number of requests that are handled at once. What this means is that if the server is hit with a high load, it will keep allocating tasks to the threadpool to handle the incoming requests. It may produce more RPS in average, but at the cost of higher latency for each request, higher CPU and potentially exponential working set increases in memory.
Describe the solution you'd like
We should have a rate limiting component that will limit the number of active requests that are processed by ASP.NET. Requests beyond that cap should be queued and then handled in order unless the request/connection times out. Ideally this throttling is done somewhere between Kestrel and the early stages of ASP.NET so that minimal work is done on each new request. AKA before allocating an HttpContext, doing any form of processing the stream for request line, headers etc.
There should be a way to specify a fixed value for this cap of simultaneous requests, a max queue size after which new requests will be rejected, a max queue duration for each waiting request.
In addition we should have an algorithmic solution that will dynamically adjust the max_simultaneous_requests to balance the value against the CPU usage and working set of the application. I suspect that the latter will be most useful in practice in containerized applications. This can then be set at a value below the OOM kill value so that the process can manage its resources.
I am not a mathematician, but I suspect some form of PID algorithm to control the value of the max_simultaneous_requests to limit the CPU usage and working set.
Additional context
Using the request latency as a variable is probably not practical, because while you could use the time ASP.NET core would take to handle the request, the latency would be affected by the queue duration, so any limitations would be of limited use.
The text was updated successfully, but these errors were encountered:
Is there an existing issue for this?
Is your feature request related to a problem? Please describe the problem.
In experimenting with the performance characteristics of YARP, it became clear that there is a balancing act between how many requests should be handled at once, and the RPS, latency, CPU usage and working set. Kestrel + ASP.NET core by default does not try to throttle the number of requests that are handled at once. What this means is that if the server is hit with a high load, it will keep allocating tasks to the threadpool to handle the incoming requests. It may produce more RPS in average, but at the cost of higher latency for each request, higher CPU and potentially exponential working set increases in memory.
Describe the solution you'd like
We should have a rate limiting component that will limit the number of active requests that are processed by ASP.NET. Requests beyond that cap should be queued and then handled in order unless the request/connection times out. Ideally this throttling is done somewhere between Kestrel and the early stages of ASP.NET so that minimal work is done on each new request. AKA before allocating an HttpContext, doing any form of processing the stream for request line, headers etc.
There should be a way to specify a fixed value for this cap of simultaneous requests, a max queue size after which new requests will be rejected, a max queue duration for each waiting request.
In addition we should have an algorithmic solution that will dynamically adjust the max_simultaneous_requests to balance the value against the CPU usage and working set of the application. I suspect that the latter will be most useful in practice in containerized applications. This can then be set at a value below the OOM kill value so that the process can manage its resources.
I am not a mathematician, but I suspect some form of PID algorithm to control the value of the max_simultaneous_requests to limit the CPU usage and working set.
Additional context
Using the request latency as a variable is probably not practical, because while you could use the time ASP.NET core would take to handle the request, the latency would be affected by the queue duration, so any limitations would be of limited use.
The text was updated successfully, but these errors were encountered: