Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort output of S3_File.list and Enso_File.list #11929

Open
wants to merge 24 commits into
base: develop
Choose a base branch
from

Conversation

GregoryTravis
Copy link
Contributor

@GregoryTravis GregoryTravis commented Dec 19, 2024

Also: Comparable for Vector.
Closes #11899

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    TypeScript,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • Unit tests have been written where possible.
  • If meaningful changes were made to logic or tests affecting Enso Cloud integration in the libraries,
    or the Snowflake database integration, a run of the Extra Tests has been scheduled.
    • If applicable, it is suggested to paste a link to a successful run of the Extra Tests.

@GregoryTravis GregoryTravis added the CI: No changelog needed Do not require a changelog entry for this PR. label Dec 19, 2024
@GregoryTravis GregoryTravis marked this pull request as ready for review December 19, 2024 21:25
@GregoryTravis GregoryTravis marked this pull request as draft December 19, 2024 21:25
@@ -187,7 +187,7 @@ type S3_File
S3_File.Value (S3_Path.Value bucket key) self.credentials
files = pair.second . map key->
S3_File.Value (S3_Path.Value bucket key) self.credentials
sub_folders + files
sub_folders + files . sort on=.s3_path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this just sort files but not sub_folders?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we could just sort on .path and then we wouldn't need to implement the custom comparator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to not use a comparator? I was thinking of adding a comparator for S3_File as well -- if a user might have a vector of them, then they might want to sort it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it sorted the files and sub-folders properly:

s3://enso-data-samples/Store Data.xlsx
s3://enso-data-samples/data_2500_rows.csv
s3://enso-data-samples/examples/
s3://enso-data-samples/locations.json
s3://enso-data-samples/products.csv
s3://enso-data-samples/sample.xml
s3://enso-data-samples/spreadsheet.xls
s3://enso-data-samples/spreadsheet.xlsx
s3://enso-data-samples/tableau/
s3://enso-data-samples/transactions.csv

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still add parentheses, the precedence of + vs . is not something that everyone knows immediately, so I think making it clear with parentheses will make it better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to not use a comparator? I was thinking of adding a comparator for S3_File as well -- if a user might have a vector of them, then they might want to sort it.

Hmm, I guess we could do that yeah. But still I'd rely on comparing the path as Text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 1569 to 1588
type Vector_Comparator
compare x y =
min_length = x.length.min y.length
when_prefixes_equal =
## At this point, if the vectors are the same length, then they are
identical; otherwise, the shorter one is lesser.
if x.length == y.length then Ordering.Equal else
Ordering.compare x.length y.length
go i =
if i >= min_length then when_prefixes_equal else
x_elem = x.at i
y_elem = y.at i
Ordering.compare x_elem y_elem . and_then <|
@Tail_Call go (i + 1)
k = go 0
k

hash x = Default_Comparator.hash_builtin x

Comparable.from (that : Vector) = Comparable.new that Vector_Comparator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this just re-implements Vector_Lexicographic_Order.compare?

I think we did not register it into Comparable on purpose - comparing vectors is not unambiguously defined operation - there's multiple orderings that may seem right depending on context. So we did not want to provide any default one to avoid user confusion and require to choose one explicitly (we implement only lexicographic one but for example an (incomplate) ordering could be the point-wise order.

Of course we can decide that it is beneficial to include a default < for vectors. But I think the original decision was not wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside to making the comparable is that if a user accidentally nests vectors, and sorts them, they won't know they made a mistake. But it will be easy to see that the data is nested. Without a comparator, they'll get an error.

The upside is that they'll be able to sort vectors containing more things.

@JaroslavTulach You asked if modern languages allow this; Haskell does:

 main = do
   msp $ [1, 2] < [3, 4]
=> True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the Vector_Comparator since it's redundant and it's a separate question.

Comment on lines 47 to 48
r . should_equal (r.sort on=.path)
r . should_equal (r.sort on=.enso_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enso_path is private, I don't think the second check can work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines 264 to 268
group_builder.specify "list should sort its output" <|
r = root.list
r.should_be_a Vector
r . should_equal (r.sort on=.s3_path)
r . should_equal (r.sort on=(x-> x.s3_path.to_text))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, .s3_path is private and implementation detail that should be a black box in the tests, use path that returns Text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to .path

@GregoryTravis
Copy link
Contributor Author

I removed the changes to vector, and added Comparable for Enso_File and S3_File, under the assumption that it's always good to have more comparators. If not, I can take those out too.

@GregoryTravis GregoryTravis marked this pull request as ready for review December 20, 2024 19:29
@@ -616,3 +616,11 @@ translate_file_errors related_file result =
s3_path = S3_Path.Value error.bucket error.key
s3_file = S3_File.Value s3_path related_file.credentials
Error.throw (File_Error.Not_Found s3_file)

type S3_File_Comparator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type should probably be marked as PRIVATE as we don't want it to show up in CB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@GregoryTravis GregoryTravis requested a review from radeusgd January 6, 2025 18:21
Copy link
Member

@jdunkerley jdunkerley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mark the functions as private.

hash x = S3_Path_Comparator.hash_builtin x
hash_builtin x = @Builtin_Method "Default_Comparator.hash_builtin"

Comparable.from that:S3_Path = Comparable.new that S3_Path_Comparator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Comparable.from that:S3_Path = Comparable.new that S3_Path_Comparator
## PRIVATE
Comparable.from that:S3_Path = Comparable.new that S3_Path_Comparator

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

;-)

It would be easier to change the IDE to not display from methods in the component browser than adding this ## PRIVATE at every conversion method occurrence.

Comment on lines 162 to 165
compare x y = Ordering.compare x.bucket y.bucket . and_then (Ordering.compare x.key y.key)

hash x = S3_Path_Comparator.hash_builtin x
hash_builtin x = @Builtin_Method "Default_Comparator.hash_builtin"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the linter's benefit please tag ## PRIVATE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw. why does the S3_Path_Comparator contain a method hash_builtin? It does not seem right.

I'm not sure we have a rule for this, but every builtin should have at most one (or perhaps, exactly one) registration.

Instead, I think we should just delegate to default comparator there.

Suggested change
compare x y = Ordering.compare x.bucket y.bucket . and_then (Ordering.compare x.key y.key)
hash x = S3_Path_Comparator.hash_builtin x
hash_builtin x = @Builtin_Method "Default_Comparator.hash_builtin"
compare x y = Ordering.compare x.bucket y.bucket . and_then (Ordering.compare x.key y.key)
hash x = Default_Comparator.hash_builtin x

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry the suggestion I gave you above won't work because Default_Comparator is private.

Still registering a duplicated builtin to get around private check is not right. Moreover there are movements (#6282) to allow builtins only from the Base library, so this would not work then.

I guess the correct solution then may be to delegate to the default comparator by computing the hash of the element:

Suggested change
compare x y = Ordering.compare x.bucket y.bucket . and_then (Ordering.compare x.key y.key)
hash x = S3_Path_Comparator.hash_builtin x
hash_builtin x = @Builtin_Method "Default_Comparator.hash_builtin"
compare x y = Ordering.compare x.bucket y.bucket . and_then (Ordering.compare x.key y.key)
## PRIVATE
hash x =
key = x.key
Comparable.from key . comparator . hash key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved just the hash builtin to Ordering.enso:

## PRIVATE
default_comparator_hash_builtin x = @Builtin_Method "Default_Comparator.hash_builtin"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it would be more efficient to call the builtin directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved just the hash builtin to Ordering.enso:

## PRIVATE
default_comparator_hash_builtin x = @Builtin_Method "Default_Comparator.hash_builtin"

That still does not prepare us for #6282. Once that issue is addressed, builtin methods cannot be exposed from Base directly. You would need to create a method:

default_comparator_hash x = default_comparator_hash_builtin x

that would be wrapping the builtin, then you could export default_comparator_hash as part of public API.

Still, I'm not quite sure if this is the correct approach.
I think the correct approach to creating a custom hash function is to rely on Comparable.from on the elements of the custom type and then combining the resulting hashes (if there's >1 element). We currently perhaps lack a generic function to combine hashes - that's perhaps something that could be added to the public API of Base. We could just compute the hash of the array of hashes to combine, I think that should work okay.

Seems like it would be more efficient to call the builtin directly.

On a hot path the overhead should be totally negligible. But I guess that would be a question to the engine team. Regardless, the guidelines from the engine team were to not export builtin methods as public API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For implementing a generic hash function, I would prefer to use the existing implementations in HashCodeNode rather than create another, different generic one in Enso.

What I am currently doing:

S3_File.enso:

type S3_File_Comparator
    ...
    hash x = default_comparator_hash x

Ordering.enso:

default_comparator_hash x = Default_Comparator.hash x

This results in a stack overflow.

@JaroslavTulach should I try to use HashCodeNode for a generic hash implementation?

Comment on lines 622 to 627
compare x y = Ordering.compare x.s3_path y.s3_path

hash x = S3_File_Comparator.hash_builtin x
hash_builtin x = @Builtin_Method "Default_Comparator.hash_builtin"

Comparable.from that:S3_File = Comparable.new that S3_File_Comparator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we tag these as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 592 to 597
compare x y = Ordering.compare x.enso_path y.enso_path

hash x = Enso_File_Comparator.hash_builtin x
hash_builtin x = @Builtin_Method "Default_Comparator.hash_builtin"

Comparable.from that:Enso_File = Comparable.new that Enso_File_Comparator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 81 to 87
type Enso_Path_Comparator
compare x y = Vector_Lexicographic_Order.compare x.path_segments y.path_segments

hash x = Enso_Path_Comparator.hash_builtin x
hash_builtin x = @Builtin_Method "Default_Comparator.hash_builtin"

Comparable.from that:Enso_Path = Comparable.new that Enso_Path_Comparator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

compare x y = Ordering.compare x.s3_path y.s3_path

## PRIVATE
hash x = default_comparator_hash x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not hash x = Ordering.hash x.s3_path to be consistent with compare?

@GregoryTravis
Copy link
Contributor Author

I've removed the comparators.

@@ -451,7 +449,7 @@ type Enso_File
file = Enso_File.Value (self.enso_path.resolve asset.name)
Asset_Cache.update file asset
file
results.sort
results.sort on=(f-> f.enso_path.path_segments) by=Vector_Lexicographic_Order.compare
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK, although I'd still prefer to just do on=.path for all backends.

The catch is - if we ever make list with recursive=True allow to cross data links, Enso_File.list could still return S3_File instances if there's a data link pointing towards an S3 bucket. So a file-specific approach to sorting would break in such scenario whereas a .path based one will always be universal.

Although I'm not sure how likely the above scenario is, so for now this is also OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@GregoryTravis GregoryTravis added the CI: Ready to merge This PR is eligible for automatic merge label Jan 8, 2025
Copy link
Member

@JaroslavTulach JaroslavTulach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

results.sort on=.path looks simple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: No changelog needed Do not require a changelog entry for this PR. CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sort output of list for S3_File and Enso_File
5 participants