Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify method in Tokenizer trait #38

Open
wingyplus opened this issue Aug 11, 2021 · 5 comments · Fixed by #39
Open

Clarify method in Tokenizer trait #38

wingyplus opened this issue Aug 11, 2021 · 5 comments · Fixed by #39

Comments

@wingyplus
Copy link
Contributor

wingyplus commented Aug 11, 2021

From Tokenizer trait declaration:

pub trait Tokenizer {
    fn segment(&self, text: &str, safe: Option<bool>, parallel: Option<bool>) -> Vec<String>;

    fn segment_to_string(
        &self,
        text: &str,
        safe: Option<bool>,
        parallel: Option<bool>,
    ) -> Vec<String>;
}

It provides 2 methods that look do the same thing. And in Newmm, it do the same things for both methods (Ref here). I purpose to clarify which one we should use and clean up trait protocol.

@Gorlph
Copy link
Contributor

Gorlph commented Aug 11, 2021

It should be addressed in #39

@wingyplus
Copy link
Contributor Author

wingyplus commented Aug 13, 2021

I'm still unclear which one we should use between segment_to_string and segment since it just differences only unwrap. :(

@bact
Copy link
Member

bact commented Aug 14, 2021

Does it confused because both of them return string?

@wingyplus
Copy link
Contributor Author

Yes it is.

@Gorlph
Copy link
Contributor

Gorlph commented Aug 14, 2021

I suggest renaming them.

segment and segment_to_string, at first, were different - segment used to return an array of utf-8 bytes while segment_to_string returns an array of Rust String for at-the-time overhead problem and testing, respectively.

After the overhead problem had been solved, there was no longer a need to return bytes anymore, so I lazily changed segment to return an array of String.

With new signature change as of #39, I intended to make one return array of String, while the other explicitly return Result - to let developers of bindings handle it.

In the future, IMO, it will be better to remove the method which returns Vec altogether.

The current version abuses the use of panic! very liberally where proper idiomatic error handling should be applied - such as Runtime error (look at newmm_custom and count panic!). I would like to keep panic! at initialization, though.

@bact bact reopened this Aug 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants