Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing the new NVCF feature enable-gateway-timeout #2576

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

sacpis
Copy link
Collaborator

@sacpis sacpis commented Feb 2, 2025

Testing the new NVCF feature enabled-gateway-timeout.

Currently, when a client invokes an API, the job gets into the queue on the server side. There is a timeout for the queue. If the worker does not/could not pick up the job within the queue timeout, a HTTP response of 202 (request accepted) is sent back to the client. Even though the worker has not picked up the job from the queue.

The new feature enables the a correct HTTP response back to the client to indicate what has happened to their request. If the job is not picked up by the worker within the queue timeout, a HTTP response of 504 (gateway timeout) is sent back to the client indicating that the worker has failed to pick up the job form the within the set queue timeout. In this case, the client can send back the request or we can have a retry mechanism.

With the new NVCF feature, once the worker picks up the job within the queue timeout, a HTTP response of 202 is sent back to the client. Now the client needs to poll the server for the result. This poll interval can be set in the request header with a key NVCF_POLL-SECONDS to a value between 1 min (default) to 20 minutes (maximum value). For long running job, it is recommended to have a long polling value.

Please refer to the NVCF document here.

Signed-off-by: Sachin Pisal <spisal@nvidia.com>
Copy link

github-actions bot commented Feb 2, 2025

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

github-actions bot pushed a commit that referenced this pull request Feb 2, 2025
@sacpis
Copy link
Collaborator Author

sacpis commented Feb 2, 2025

I am planning to test this using long running examples with more number of shots. That will keep the workers busy, which will put the incoming jobs into the queue. This will trigger the queue timeout which will return 504 to the client.

Please let me know if you have any other thoughts for testing this.

@sacpis sacpis changed the title [WIP] Testing the new NVCF feature enable-gateway-timeout Testing the new NVCF feature enable-gateway-timeout Feb 4, 2025
@sacpis sacpis marked this pull request as ready for review February 4, 2025 19:33
Copy link

github-actions bot commented Feb 4, 2025

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

github-actions bot pushed a commit that referenced this pull request Feb 4, 2025
{"nvcf-feature-enable-gateway-timeout", "true"},
// The max timeout for the polling response is 20 minutes
// https://docs.nvidia.com/cloud-functions/user-guide/latest/cloud-function/api.html#http-polling
{"NVCF-POLL-SECONDS", "1200"}};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is this doing? I think the prior behavior was that the user would get some sort of "heartbeat" polling message approximately every 5 seconds. Is that going away? Do they have to wait 1200 seconds for that heartbeat now? Does it still run correctly if their job takes ~1 hour?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants