-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Batches are not evaluated in parallel #2258
Comments
Any updates on this issue? |
can you maybe share your setup and experience on training with coqui tools. I am trying to improve my setup but I have no clue which / how many GPUs I should go for? :( |
Can you give more details which information you need @SuperKogito ? One of the problems in my setup was caused by pythons multiprocessing. I could not figured out the specific problem but using MPI instead of multiprocessing pool helped. Moreover I switched to Horovod for multi gpu usage but it is not supported in official STT To sum it up, from my perspective you will not benefit from running on more than 2 A100 devices, since this seems to be in no interested of the devs at the moment. |
Hello @NanoNabla and thank you for your response, |
I think with 4gb gpu memory you should be able to use bigger batch sizes than 1. Using bigger batch sizes will result in less training time. Nevertheless, from me opinion this is not related to this issue. Maybe open an own issue. |
@NanoNabla could you share your version with Horovod and MPI if it works better? |
Before we can solve this issue, we need to understand where it comes from. First thing we need, is to make sure is that the issue lies within STT code base and not Nvidia TF image. To do that, someone with more than 2 GPU (ideally 4 or more) should try to simulate a fake load on nvidia tensorflow base image ( If it doesn't, it means the issue is within Nvidia TF image not STT, otherwise it's a STT specific issue and we should try to hunt it down.
We take interest in all issues. Doesn't mean we have to react to every single one of them. Specially when we don't know what is causing the issue in the first place. Please make the requested test and let us know the results, we shall see how to deal with it. EDIT: @NanoNabla , you said:
This seem to indicate that it's an issue with STT, could you share your MNIST load so that other can confirm your results? (@FivomFive maybe?) (give us some logs) If it turns out it's really only an issue with STT, we need to find exactly what is causing this behavior but I have no idea how we could even begin to diagnose this issue. |
Hi guys, I'm experiencing the same issue. Unable to use more than 3 GPUs at the same time. Currently consists of 8GPU (nVidia A100) I tried using different coqui docker images and the outcome is the same. Next thing that I tried was to update nvidia tensorflow base image to the latest version on the Dockerfile.train. The custom build had the same outcome! @wasertech I can provide logs or anything that is needed in order to helpful to you guys. Kind regards, |
Hi guys, nice to hear that this issue becomes active, now and other people are also runnning the same issue.
To be honest, I also did not expect any further activities here since there were responds in nearly 5 month.... My project using STT will end soon therefore I only will have sparse time to spend in futher STT support.
It was just some random MNIST code for TensorFlow 1 I found on github if I remember correctly to check I kernels are excecuted in parallel. For better Benchmarking results, maybe we should use old TensorFlow Benchmarks since they still work with TensorFlow 1. I can provide output logs and/or Nsight profiles for this benchmark or any other small TensorFlow ones if you wish to. Moreover, as I already mentioned, using Horovod Frameworks kernels are executed pararllel but than we ran into other problems on our system. Maybe I could provide some code, if you are interested in it. |
No offense, but not everybody has 3+ GPU systems, let alone 8xA100's :) |
If we look here, we can see that they used a server with 8GPU nVidia A100 for their checkpoints. Why are we unable to do so? |
Yes but release 1.0.0 is based on DeepSpeech v0.9.3.
@arbianqx You can try with DeepSpeech and let us know if you encounter a similar behavior. It might help us understand where it comes from. |
I see. I used the latest deepspeech docker image, which was published back in 2021. I was able to run on 6GPU (nVidia GeForce 1660) but unable to do so on A100 GPUs. |
Why? What’s the issue? Logs please. |
Why not try to debug it with Coqui STT v1.0.0 and these settings? If this works, than the the problem is introduced after that point. Or it is related to flags... |
I'm trying to debug it with these params on 1.0.0. First of all I'm not sure how much shared memory I should specify on the docker run command:
Running with shared memory of 16,32,64,128g (I tried all these numbers) I managed to use 8GPU only with 16 batch sizes but not 64 as it was shown on the github gist. |
Just let nvidia tell you how much to fit. Run it without
As for batch sizes, it means nothing, it's probably due to something else like data or something. But it would mean that DS works with a high GPU count. Meaning the error was introduced after 1.0.0 |
Okay tried running to test transfer learning with Coqui 1.0.0 using 8GPU A100, with batch_size 24 `I STARTING Optimization
|
I hoped to get some support even if the hardware is not available to the devs. Moreover, it seems I'm not the only one with access to such a system, which is btw not unusal in hpc enviromnents ;)
I was able to train with 8 GPU but it is not efficient. So the question is, have they checked if running on 8 gpus is faster than running just on 2 or 3 gpus. We experience many people on our system running parallel application with a huge amount of allocated ressources without checking if there application is really using it effiently.
Are this docker container base on Dockerfile.build.tmpl ? This docker are using |
@reuben can correct me, but as much as I remember, the evaluation code had limitations that made it unable to run on parallel GPU / CPUs, from the beginning. Fixing that is non trivial |
@arbianqx Please format your logs correctly. The error Thanks @lissyx ! |
Hi, It was an issue of transfer learning which I forgot to add alphabet.txt correctly. On newer coqui docker images, it is unable to do so. |
@arbianqx, so, that would mean the problem was introduced after v1.0.0 or it is dependent to the parameters. Right? Is it consuming the power of all GPU's well? |
As far as I can see, it is related to the version. After v1.0.0 it stopped utilising multiple GPUs. |
Describe the bug
I tried running STT on a system with 8 NVIDA A100 GPUs. I experienced that running on up to 4 devices does not scale well. Each batch seems to be serialized across the GPUs. In the first steps everything looks ok but then parallel execution gets worse over the time.
I attaches traces generated by NVIDIA NSight Systems
I checked by environment using a simple mnist example where this problem does not occur.
Environment (please complete the following information):
I used NVIDIAS Docker container
nvcr.io/nvidia/tensorflow:22.02-tf1-py3
as it is used inDockerfile.train
The text was updated successfully, but these errors were encountered: