You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran a job that reads out of an s3 store and puts the data in elasticsearch. This bucket has 57 files in it and each file on average can have anywhere from 1-5 gigabytes. When the s3_reader goes to make slices, it will first list all the objects with the provided path. The max key on this request is set to 1000 (which is the default for the listObjects method) meaning that the request will return up to 1000 objects metadata on one request.
It will push a promise for each key in the list to an actions array that will then segment the file based on size and create slice records for it. In my case it will push 57 promises in the array and up creating 1,949,527 slice objects in the array. It will then run the createSlice() function on each on of those which will add metadata to each recored and push the recored into the slicer queue. I added 1GB of memory to the execution controller pod and that is where it OOM'ed at approximately about 336,891 slice records in the queue.
Potenital solutions:
We could add a configuration setting in the s3_reader that will allow us to manually set how many files we can create slices for at a time. This will modify maxKeys to limit what it can return at a time, and paginate the rest of the pages. The issue is that it doesn't really resolve the issue and the user would have to be aware about this "workaround". Also in the case of a massive file that could be 100gb would still OOM.
Add logic around the the s3_slicer that would only allow it to submit a maximum amount of slices. There are potentially a handful of issues with this.
The text was updated successfully, but these errors were encountered:
Here is my connector config. Keep in mind this was with a patched version of file-asset-apis that allows the aws sdk client to perform anonymous requests to public s3 stores #1158. At the time of this issue we don't have this feature.
I ran a job that reads out of an s3 store and puts the data in elasticsearch. This bucket has 57 files in it and each file on average can have anywhere from 1-5 gigabytes. When the
s3_reader
goes to make slices, it will first list all the objects with the provided path. The max key on this request is set to 1000 (which is the default for the listObjects method) meaning that the request will return up to 1000 objects metadata on one request.My job file:
The
s3-slicer
:file-assets/packages/file-asset-apis/src/s3/s3-slicer.ts
Lines 26 to 35 in f6f215c
It will push a promise for each key in the list to an
actions
array that will then segment the file based on size and create slice records for it. In my case it will push 57 promises in the array and up creating1,949,527
slice objects in the array. It will then run the createSlice() function on each on of those which will add metadata to each recored and push the recored into the slicer queue. I added1GB
of memory to the execution controller pod and that is where it OOM'ed at approximately about336,891
slice records in the queue.Potenital solutions:
s3_reader
that will allow us to manually set how many files we can create slices for at a time. This will modifymaxKeys
to limit what it can return at a time, and paginate the rest of the pages. The issue is that it doesn't really resolve the issue and the user would have to be aware about this "workaround". Also in the case of a massive file that could be 100gb would still OOM.s3_slicer
that would only allow it to submit a maximum amount of slices. There are potentially a handful of issues with this.The text was updated successfully, but these errors were encountered: