Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] s3_reader slicer OOMs when reading out of large datasets #1159

Open
sotojn opened this issue Jan 29, 2025 · 1 comment
Open

[bug] s3_reader slicer OOMs when reading out of large datasets #1159

sotojn opened this issue Jan 29, 2025 · 1 comment

Comments

@sotojn
Copy link
Contributor

sotojn commented Jan 29, 2025

I ran a job that reads out of an s3 store and puts the data in elasticsearch. This bucket has 57 files in it and each file on average can have anywhere from 1-5 gigabytes. When the s3_reader goes to make slices, it will first list all the objects with the provided path. The max key on this request is set to 1000 (which is the default for the listObjects method) meaning that the request will return up to 1000 objects metadata on one request.

My job file:

{
    "name": "grab-taxi-data",
    "lifecycle": "persistent",
    "workers": 1,
    "log_level": "trace",
    "memory_execution_controller": 1073741824,
    "assets": [
        "elasticsearch",
        "file"
    ],
    "operations": [
        {
            "_op": "s3_reader",
            "path": "datasets-documentation/nyc-taxi",
            "size": 50000,
            "format": "tsv"
        },
        {
            "_op": "elasticsearch_bulk",
            "size": 50000,
            "index": "nyc-taxi-data"
        }
    ]
}

The s3-slicer:

private async getObjects(): Promise<FileSlice[]> {
const data = await s3RequestWithRetry({
client: this.client,
func: listS3Objects,
params: {
Bucket: this.bucket,
Prefix: this.prefix,
ContinuationToken: this._nextToken
}
});

It will push a promise for each key in the list to an actions array that will then segment the file based on size and create slice records for it. In my case it will push 57 promises in the array and up creating 1,949,527 slice objects in the array. It will then run the createSlice() function on each on of those which will add metadata to each recored and push the recored into the slicer queue. I added 1GB of memory to the execution controller pod and that is where it OOM'ed at approximately about 336,891 slice records in the queue.

Potenital solutions:

  1. We could add a configuration setting in the s3_reader that will allow us to manually set how many files we can create slices for at a time. This will modify maxKeys to limit what it can return at a time, and paginate the rest of the pages. The issue is that it doesn't really resolve the issue and the user would have to be aware about this "workaround". Also in the case of a massive file that could be 100gb would still OOM.
  2. Add logic around the the s3_slicer that would only allow it to submit a maximum amount of slices. There are potentially a handful of issues with this.
@sotojn
Copy link
Contributor Author

sotojn commented Jan 29, 2025

Here is my connector config. Keep in mind this was with a patched version of file-asset-apis that allows the aws sdk client to perform anonymous requests to public s3 stores #1158. At the time of this issue we don't have this feature.

  s3:
    default:
      endpoint: "https://s3.eu-west-3.amazonaws.com"
      accessKeyId: ""
      secretAccessKey: ""
      forcePathStyle: true
      sslEnabled: true
      region: "eu-west-3"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant