Testing the new NVCF feature enable-gateway-timeout #2576

sacpis · 2025-02-02T19:49:13Z

Testing the new NVCF feature enabled-gateway-timeout.

Currently, when a client invokes an API, the job gets into the queue on the server side. There is a timeout for the queue. If the worker does not/could not pick up the job within the queue timeout, a HTTP response of 202 (request accepted) is sent back to the client. Even though the worker has not picked up the job from the queue.

The new feature enables the a correct HTTP response back to the client to indicate what has happened to their request. If the job is not picked up by the worker within the queue timeout, a HTTP response of 504 (gateway timeout) is sent back to the client indicating that the worker has failed to pick up the job form the within the set queue timeout. In this case, the client can send back the request or we can have a retry mechanism.

With the new NVCF feature, once the worker picks up the job within the queue timeout, a HTTP response of 202 is sent back to the client. Now the client needs to poll the server for the result. This poll interval can be set in the request header with a key NVCF_POLL-SECONDS to a value between 1 min (default) to 20 minutes (maximum value). For long running job, it is recommended to have a long polling value.

Please refer to the NVCF document here.

Signed-off-by: Sachin Pisal <spisal@nvidia.com>

github-actions · 2025-02-02T21:17:46Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

sacpis · 2025-02-02T21:40:49Z

I am planning to test this using long running examples with more number of shots. That will keep the workers busy, which will put the incoming jobs into the queue. This will trigger the queue timeout which will return 504 to the client.

Please let me know if you have any other thoughts for testing this.

github-actions · 2025-02-04T21:05:55Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

bmhowe23 · 2025-02-04T21:30:14Z

runtime/common/BaseRestRemoteClient.h

+        {"nvcf-feature-enable-gateway-timeout", "true"},
+        // The max timeout for the polling response is 20 minutes
+        // https://docs.nvidia.com/cloud-functions/user-guide/latest/cloud-function/api.html#http-polling
+        {"NVCF-POLL-SECONDS", "1200"}};


What exactly is this doing? I think the prior behavior was that the user would get some sort of "heartbeat" polling message approximately every 5 seconds. Is that going away? Do they have to wait 1200 seconds for that heartbeat now? Does it still run correctly if their job takes ~1 hour?

Testing the new NVCF feature enable-gateway-timeout

80345d0

Signed-off-by: Sachin Pisal <spisal@nvidia.com>

sacpis requested review from bettinaheim and bmhowe23 February 2, 2025 19:49

github-actions bot pushed a commit that referenced this pull request Feb 2, 2025

Docs preview for PR #2576.

325f9db

sacpis changed the title ~~[WIP] Testing the new NVCF feature enable-gateway-timeout~~ Testing the new NVCF feature enable-gateway-timeout Feb 4, 2025

Merge branch 'main' into add_enable_gateway_timeout_header

90ec4be

sacpis marked this pull request as ready for review February 4, 2025 19:33

github-actions bot pushed a commit that referenced this pull request Feb 4, 2025

Docs preview for PR #2576.

e0dfda2

bmhowe23 reviewed Feb 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing the new NVCF feature enable-gateway-timeout #2576

Testing the new NVCF feature enable-gateway-timeout #2576

sacpis commented Feb 2, 2025

github-actions bot commented Feb 2, 2025

sacpis commented Feb 2, 2025

github-actions bot commented Feb 4, 2025

bmhowe23 Feb 4, 2025

Testing the new NVCF feature enable-gateway-timeout #2576

Are you sure you want to change the base?

Testing the new NVCF feature enable-gateway-timeout #2576

Conversation

sacpis commented Feb 2, 2025

github-actions bot commented Feb 2, 2025

sacpis commented Feb 2, 2025

github-actions bot commented Feb 4, 2025

bmhowe23 Feb 4, 2025

Choose a reason for hiding this comment