-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batching does not improve performance with dali #178
Comments
Hello, @hly0025 Can you tell how the time measurement in your application is done and do you suspect why could it yield such different perf results? |
Hello @banasraf Time MeasurementsThanks for your reply. Here is how the measurement was done. I timed, individually, the following for
This would write to the file each time we called this function. I then took all the results in the csv file and got the average median time for This is how I would get median avg 9.017ms for doing inferencing with batch size 1 for object detection. Additional InformationI realize, in the above, I only gave numbers for Batch 1 for pre-processor avg request latency is 2.374 ms and Batch 1 timing pre-processor inferencing in our application code is median avg 21.94ms Batch 8 for pre-processor avg request latency 10.749 ms and Batch 8 timing pre-processor inferencing in our application code is median avg 238.028ms We are using triton SummaryFor me, I think the biggest mystery is |
@hly0025 , Thank you for the thorough analysis. I've got somewhat confused by all the numbers you provided, so I've put them in a table. Could you please verify, if all these numbers and descriptions are correct according to your data? Numbers
I'm especially confused about the AnalysisFirst of all, let's note that the
Secondly, we shall also note, that the Application measurements are the latency measurements. While perf_analyzer also provides throughput, it is not measured by Application with the code snippet you've provided. So the remaining question is, why there is no perf improvement when using Would this analysis be reasonable with regards to your environment and requirements? Please let me know if you have any questions. Also, in case something in my analysis looks incorrect, it would be great if you'd clarify the two measurements I mentioned above. |
Thanks for your reply and thorough remarks. I appreciate your patience to comb through my explanation to try to understand the issue. On my side, I believe it is a good idea putting the numbers into a table. Here is, upon your request, my review of the numbers. Your analysis is very close, but an important distinction to make is that on the application side, I decided to time the infer time for object-detection and pre-processing separately. For now, I summarize the data again, and I hope this better conveys the information: Timing NumbersPerf-analyzer ResultsHere are the results obtained using perf analyzer using the triton configs I shared above:
Application ResultsHere are the results obtained using the timing code via
SummaryThe real confusion for me is that the application does not perform like I would expect at batch size 8. I hope separating out the results that I obtained via perf-analyzer versus those obtained from the application will help make things clearer. In my mind, the pre-processing is taking some time and I am not sure why. To briefly recap, the application is guaranteed to send a batch size of 8 from the client side to triton. Always. Given the configs, my understanding is that Triton should process this batch of 8 from the client side as one batch and send it back. However, at least where pre-processing is concerned, it seems to be having an issue. Thanks kindly again for your thorough remarks and response. PS - I can time ensemble (doing it all at once) if you like. However, my hope is by digging into each pre-processor and object-detection separately, that would help with the diagnostics, so to speak. |
@hly0025 , Thank you for clarifying the numbers. To be frank, I rather trust the perf_analyzer measurements and they actually look promising ( Could we take some time to verify, whether the Application measurements are reliable? I mean, perf_analyzer by default runs multiple iteration until the time measurements are stable enough. Could you tell, how many inference iterations you've run when taking these measurements? Also, it is natural that first few iterations will be slower because of memory allocation that happen underneath. Are you conducting the warmup before running the performance test? Could you also provide a little bit more statistics? You've measured median, is it possible to measure average and standard deviation? The more data you'd provide the higher chance we have to find the root cause of the discrepancy between the results. |
Thank you for your reply, that makes sense. Brief RecapOn the application side (client side), we set batch size to 8. This ensures that we are sending data in batch sizes of 8 to the
For statistics, please see IQR, min, max, median, 25th, and 75 percentile.
SummaryI can provide standard deviation and average if desired, but I hope the min, max, median, and the IQR helps address the core of what I hope you need to assess the application side of things more clearly. While this is admittedly, if you pardon the colloquial english expression: This is admittedly somewhat apples (application timing) to oranges (perf-analyzer). Nevertheless, I believe the preprocessing is still slower than I would anticipate based on the perf-analyzer. Thank you for your remarks and questions. |
Gentle inquiry @szalpal if there is a status update on this or thoughts? Thanks kindly in advance! |
I also use dali as my model preprocess. No matter how the parameters are adjusted, the throughput of DALI does not improve and can only reach a maximum of 750. However, if I use nvJPEGDecMultipleInstances for decoding, the decoding efficiency can reach 2100. I am using the COCO/val2017 dataset and running it on an A10. dali pipe
dali config.pbtxt
perf_analyzer parameters result
nvJPEGDecMultipleInstances parameters nvJPEGDecMultipleInstances result |
In the snippet you've provided (perf_analyzer parameters), I see you're benchmarking |
I'm so sorry for my delayed response. I check my command,previously I use bls_async_pre1 to do benchmark, this is using python backend to do dali pipeline. next result is using dali backend , Using the Dali backend is faster than using the Python backend, but GPU performance never reaches its maximum.
|
Issue
Batching does not improve performance with dali.
Description
In summary, inference slows as we increase batching in our application.
We have an application that sends data to triton for inferencing. As mentioned above, batching does not seem to improve performance with dali. We are using an ensemble model that uses dali for preprocessing and then do object detection with yolo.
Specifically, batch size of 8 is significantly slower than batch size of 1. We have only seen that with the dali portion of the application is much slower than the object-detection portion of the application.
Using perf-analyzer with batch sizes 1 and 8 with concurrency with 2 revealed improved inferences/sec as one might expect. However, this has not been observed in the application. Manual timing of the application has shown dali takes up the majority of the time with inferencing (object-detection seems to be fine).
It is worth mentioning that we are testing by sending batches in the application as well as using dynamic batching in triton, as can be seen below in the configs.
Perf Analyzer/Application Infer Timing
We ran our application and timed the average median milliseconds for inference with preprocessing and object detection for batch size 1 and 8. We also ran perf-analyzer for batch size 1 and 8 against triton using the configs provided below.
Batch 1 for object-detection avg request latency 4.623 ms and Batch 1 timing object-detection inferencing our application code is median avg 9.017ms. For batch 1 perf-analyzer, for concurrency: 2, we get throughput: 91.9781 infer/sec.
Batch 8 for object-detection avg request latency 10.749 ms and Batch 8 timing object-detection inferencing our application code is median avg 86.335ms. For batch 8 perf-analyzer, for concurrency: 2, we get throughput 170.247 infer/sec.
Additional information from perf-analyzer has been attached as a csv.
Config Information
Here is the configuration information:
ensemble- config.pbtxt
object detection - config.pbtxt
pre-processing - config.pbtxt
dali.py
Questions
Perf-Analyzer CSV Output
ensemble-concur2-ceiling8-batch8.csv
ensemble-concur2-ceiling8-batch1.csv
The text was updated successfully, but these errors were encountered: