Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add eTagValidation for uploadFileMultipart #1130

Merged
merged 18 commits into from
Nov 30, 2023
Merged

add eTagValidation for uploadFileMultipart #1130

merged 18 commits into from
Nov 30, 2023

Conversation

BusyByte
Copy link
Contributor

@BusyByte BusyByte commented Nov 29, 2023

Computes and validates the computed eTag matches the eTag from AWS S3.
Made the validator an option with a default of None so we don't break existing consumers of the function and also allows to opt in rather than forced to get the new behavior.
We have tested this and verified with S3 and have used it successfully.

related to: #1108

@BusyByte BusyByte marked this pull request as draft November 29, 2023 20:56
dmytro.obodowsky and others added 2 commits November 30, 2023 09:18
Refactored `S3.scala` to optimize `checksumPart` and `formatChecksum` functions, thereby improving code readability and performance. Added `S3Benchmark.scala` and `S3OpsStub.scala` for benchmarking S3 multi-part uploads.
This way it can be verified and eventually incorporated in the return from uploadFileMultipart in future versions.
@BusyByte BusyByte marked this pull request as ready for review November 30, 2023 15:56
@BusyByte
Copy link
Contributor Author

@semenodm I don't really have any more items I want to address. Let me know if I need to make any additional changes or if you have questions on anything.

dmytro.obodowsky added 2 commits November 30, 2023 10:13
Modified uploadPart and checksumPart methods in S3.scala. A chunk of bytes is now converted to an array only once for both uploading to S3 and computing MD5 checksum. This change enhances the efficiency of the code by minimizing the number of times the bytes data is processed, which can yield performance improvements especially for larger files.
@semenodm
Copy link
Member

@BusyByte i found the optimization for the digest, take a look at latest commits.
also I was thinking about weather we need to do unconditional digest calculation if no digest calculator was provided.

@BusyByte
Copy link
Contributor Author

@semenodm I'm fine with the changes you made. Yeah, depending on the performance impact we could change it to not digest if there's no validator defined.

@semenodm
Copy link
Member

doing this

Changed the 'digest' field in the PartProcessingOutcome case class to be optional due to the multipartETagValidation changes. Modified the getOverallChecksum and upload parts methods accordingly, to handle Option[PartDigest] safely and efficiently. This is done to prevent unnecessary digest calculations during multipart upload when checksum validation is not necessary
@BusyByte
Copy link
Contributor Author

@semenodm the changes in b70615a LGTM

Changed the 'digest' field in the PartProcessingOutcome case class to be optional due to the multipartETagValidation changes. Modified the getOverallChecksum and upload parts methods accordingly, to handle Option[PartDigest] safely and efficiently. This is done to prevent unnecessary digest calculations during multipart upload when checksum validation is not necessary
@semenodm
Copy link
Member

@BusyByte thank you for the contribution, this was nice one.

@BusyByte
Copy link
Contributor Author

BusyByte commented Nov 30, 2023

@semenodm just some notes for a future major release I think it'd be good to return from uploadFileMultipart

  • Whether md5 validation happened and successful
  • How many total bytes were uploaded to S3
  • What the max partNumber/partId was.

We send events and record metrics with this information and exposing it on the return from uploadFileMultipart would enable these to be captured.
This however would be a breaking change so I am not proposing it until the next major version.

@semenodm semenodm merged commit 8e50179 into laserdisc-io:main Nov 30, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants