[bug] s3_reader slicer OOMs when reading out of large datasets #1159

sotojn · 2025-01-29T23:04:26Z

I ran a job that reads out of an s3 store and puts the data in elasticsearch. This bucket has 57 files in it and each file on average can have anywhere from 1-5 gigabytes. When the s3_reader goes to make slices, it will first list all the objects with the provided path. The max key on this request is set to 1000 (which is the default for the listObjects method) meaning that the request will return up to 1000 objects metadata on one request.

My job file:

{
    "name": "grab-taxi-data",
    "lifecycle": "persistent",
    "workers": 1,
    "log_level": "trace",
    "memory_execution_controller": 1073741824,
    "assets": [
        "elasticsearch",
        "file"
    ],
    "operations": [
        {
            "_op": "s3_reader",
            "path": "datasets-documentation/nyc-taxi",
            "size": 50000,
            "format": "tsv"
        },
        {
            "_op": "elasticsearch_bulk",
            "size": 50000,
            "index": "nyc-taxi-data"
        }
    ]
}

The s3-slicer:

file-assets/packages/file-asset-apis/src/s3/s3-slicer.ts

Lines 26 to 35 in f6f215c

    
           private async getObjects(): Promise<FileSlice[]> { 
        
               const data = await s3RequestWithRetry({ 
        
                   client: this.client, 
        
                   func: listS3Objects, 
        
                   params: { 
        
                       Bucket: this.bucket, 
        
                       Prefix: this.prefix, 
        
                       ContinuationToken: this._nextToken 
        
                   } 
        
               });

It will push a promise for each key in the list to an actions array that will then segment the file based on size and create slice records for it. In my case it will push 57 promises in the array and up creating 1,949,527 slice objects in the array. It will then run the createSlice() function on each on of those which will add metadata to each recored and push the recored into the slicer queue. I added 1GB of memory to the execution controller pod and that is where it OOM'ed at approximately about 336,891 slice records in the queue.

Potenital solutions:

We could add a configuration setting in the s3_reader that will allow us to manually set how many files we can create slices for at a time. This will modify maxKeys to limit what it can return at a time, and paginate the rest of the pages. The issue is that it doesn't really resolve the issue and the user would have to be aware about this "workaround". Also in the case of a massive file that could be 100gb would still OOM.
Add logic around the the s3_slicer that would only allow it to submit a maximum amount of slices. There are potentially a handful of issues with this.

The text was updated successfully, but these errors were encountered:

sotojn · 2025-01-29T23:11:18Z

Here is my connector config. Keep in mind this was with a patched version of file-asset-apis that allows the aws sdk client to perform anonymous requests to public s3 stores #1158. At the time of this issue we don't have this feature.

  s3:
    default:
      endpoint: "https://s3.eu-west-3.amazonaws.com"
      accessKeyId: ""
      secretAccessKey: ""
      forcePathStyle: true
      sslEnabled: true
      region: "eu-west-3"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] s3_reader slicer OOMs when reading out of large datasets #1159

[bug] s3_reader slicer OOMs when reading out of large datasets #1159

sotojn commented Jan 29, 2025 •

edited

Loading

sotojn commented Jan 29, 2025 •

edited

Loading

[bug] s3_reader slicer OOMs when reading out of large datasets #1159

[bug] s3_reader slicer OOMs when reading out of large datasets #1159

Comments

sotojn commented Jan 29, 2025 • edited Loading

sotojn commented Jan 29, 2025 • edited Loading

sotojn commented Jan 29, 2025 •

edited

Loading

sotojn commented Jan 29, 2025 •

edited

Loading