Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colwise!(dist, res, A, B) does not consider eltype(res) #268

Open
alyst opened this issue Jan 6, 2025 · 8 comments
Open

colwise!(dist, res, A, B) does not consider eltype(res) #268

alyst opened this issue Jan 6, 2025 · 8 comments

Comments

@alyst
Copy link
Member

alyst commented Jan 6, 2025

Here's a test case (Julia 1.11.2, Distances.jl 0.10.12):

X = reshape(Int16[0, 0], 2, 1)
Y = reshape(Int16[1000, 1000], 2, 1)

julia> colwise!(SqEuclidean(), Matrix{Float64}(undef, 1, 1), X, Y)
1×1 Matrix{Float64}:
 -31616.0

julia> pairwise!(SqEuclidean(), Matrix{Float64}(undef, 1, 1), X, Y, dims=2)
1×1 Matrix{Float64}:
 33920.0
@dkarrasch
Copy link
Member

I think it's rather the other way around: it's pure luck (and due to a specific implementation) that pairwise! works here:

julia> SqEuclidean()(vec(X), vec(Y))
-31616

But we may transfer the same trick from pairwise! to colwise!.

@alyst
Copy link
Member Author

alyst commented Jan 7, 2025

Maybe the core method for the distance API should be dist(T, x, y) with dist(x, y) = dist(result_type(x, y), x, y)

@dkarrasch
Copy link
Member

dkarrasch commented Jan 7, 2025

But what would that help? For your Int16 example, you'd get result_type(x, y) = Int16, and there you get your overflow. To prevent it, you would need to know the target type in the middle of the computation, which is as good as knowing it upfront, where you can change the eltype of your input to begin with.

@alyst
Copy link
Member Author

alyst commented Jan 7, 2025

That will help with colwise!()-like methods, where the eltype of the result is known in advance (but currently it is ignored): colwise!() will call dist(T, x, y) directly.
In other situations the current behaviour (overflow) will be preserved.

@dkarrasch
Copy link
Member

To prevent the overflow, you would need to promote your input (as you can see from your OP), which you can do as the user/package author. I don't think we want to silently promote eltypes here.

@alyst
Copy link
Member Author

alyst commented Jan 7, 2025

These matrices might be large, so promoting Int16 to Float64 just to tell colwise!() how to calculate the distance (which it is currently doing element-wise) is a bit of an overkill.
Also at the moment pairwise!() and colwise!() behaviors are different.
I agree that probably there's no "best" automatic behavior of how to convert, but ideally Distances.jl can provide an API that allows controlling it, so that the users don't have to implement workarounds.

@dkarrasch
Copy link
Member

Also at the moment pairwise!() and colwise!() behaviors are different.

That's why I said the fact that pairwise! works is rather luck. It's due to this performance optimization, which doesn't make sense for colwise!:

Distances.jl/src/metrics.jl

Lines 616 to 617 in 35c6d0d

function _pairwise!(dist::Union{SqEuclidean,Euclidean}, r::AbstractMatrix,
a::AbstractMatrix, b::AbstractMatrix)

So, what you're asking for is not about controlling the target type, but a preprocessing of the data, which you can currently do like:

X = reshape(Int16[0, 0], 2, 1)
Y = reshape(Int16[1000, 1000], 2, 1)

using Distances
map((x, y) -> sqeuclidean(float(x), float(y)), eachcol(X), eachcol(Y))
map!((x, y) -> sqeuclidean(float(x), float(y)), Matrix{Float64}(undef, 1, 1), eachcol(X), eachcol(Y))

Note that the pairwise! computation is also incorrect, for the same reason. It doesn't preprocess the data, so calculates with Int16, for which Int16(1000)^2 is already incorrect.

@alyst
Copy link
Member Author

alyst commented Jan 8, 2025

Thank you for the detailed explanations!

Implementing the workaround in the user script is probably the optimal strategy now, but look at it from the perspective of the other packages, in particular Clustering.jl.

  • The input data may be of different eltypes.
    I think we agree that implementing the logic to properly handle all eltype and metric permutations in Clustering.jl is not a good idea.
  • Also, the input data could be very large, and the users might be already struggling with the memory usage.
    So just internally converting the data into an appropriate eltype is not an option.
    Conversion could be done in the user script. But right now -- if the user is not doing a conversion -- the resulting distances could be incorrect, and the user will have no visibility to that.

It was my assumption that Distances.jl -- in addition to defining the metric types -- also implements the efficient distances calculation for the large datasets, so it could be used as the plug-in solutions to these problems.

AFAIU, the immediate solution is to check whether the distances are negative if SemiMetric is used and warn the user.
That could be done on the Clustering.jl side, but it could also be done in Distances.jl since the same problem may manifest itself in other contexts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants