-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testing the new NVCF feature enable-gateway-timeout #2576
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Sachin Pisal <spisal@nvidia.com>
CUDA Quantum Docs Bot: A preview of the documentation can be found here. |
I am planning to test this using long running examples with more number of shots. That will keep the workers busy, which will put the incoming jobs into the queue. This will trigger the queue timeout which will return 504 to the client. Please let me know if you have any other thoughts for testing this. |
CUDA Quantum Docs Bot: A preview of the documentation can be found here. |
{"nvcf-feature-enable-gateway-timeout", "true"}, | ||
// The max timeout for the polling response is 20 minutes | ||
// https://docs.nvidia.com/cloud-functions/user-guide/latest/cloud-function/api.html#http-polling | ||
{"NVCF-POLL-SECONDS", "1200"}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is this doing? I think the prior behavior was that the user would get some sort of "heartbeat" polling message approximately every 5 seconds. Is that going away? Do they have to wait 1200 seconds for that heartbeat now? Does it still run correctly if their job takes ~1 hour?
Testing the new NVCF feature enabled-gateway-timeout.
Currently, when a client invokes an API, the job gets into the queue on the server side. There is a timeout for the queue. If the worker does not/could not pick up the job within the queue timeout, a HTTP response of 202 (request accepted) is sent back to the client. Even though the worker has not picked up the job from the queue.
The new feature enables the a correct HTTP response back to the client to indicate what has happened to their request. If the job is not picked up by the worker within the queue timeout, a HTTP response of 504 (gateway timeout) is sent back to the client indicating that the worker has failed to pick up the job form the within the set queue timeout. In this case, the client can send back the request or we can have a retry mechanism.
With the new NVCF feature, once the worker picks up the job within the queue timeout, a HTTP response of 202 is sent back to the client. Now the client needs to poll the server for the result. This poll interval can be set in the request header with a key NVCF_POLL-SECONDS to a value between 1 min (default) to 20 minutes (maximum value). For long running job, it is recommended to have a long polling value.
Please refer to the NVCF document here.