-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Proglearn oblique tree speed #419
Comments
please assign me to this |
I took a look at the sklearn cython tree code. I think a potential course of action is to basically copy over the This would allow for easy compatible integration with sklearn. It also uses their optimized code to the best extent. Some thoughts about modifying their cython code, lots of stuff like sparse data and feature importance can be ignored to start. I think an MVP could simply sum together a sampled fixed number of features (equivalent to a dot product with weights of all 1). Verify this works, then we can iterate by adding weights/dot product instead of sum, controlling how many features to sample with a parameter in the constructor, etc. |
@rflperry I hold some objections for the copying idea, as copying the classifier code basically means copying over the entire sklearn.tree folder. I found it very hard to locate the codes we really need, and the easiest step for That being said, @parthgvora if you need, I think I can provide some help about how |
I personally am not adverse to copying over the entire folder. It seems in some ways easier than trying to write cython functions fully from scratch for parts of the forest. Of course that cython code has a lot of functionality we don't need to worry about modifying for now, but that can just be deleted and then added/modified later (i.e. sparse X case, feature importance, etc). But that's just my take, up to @parthgvora |
To be honest, I kind of think Hao is right here. Having tried this, there's a ton of stuff in the sklearn.tree folder we'd need to sort out to do what we need to do. Also, dealing with the GIL is a real pain in cython. So I'm kind of adverse to copying the folder directly. That said, we should keep their code in mind when writing our own cython to make integration easier. We can also take some inspiration from the way they wrote their code as well, and do smaller copy-pastes as necessary. So, I think we should cythonize piece by piece, starting with the splitter. We're probably going to need our own splitter for oblique splits, and its the slowest part anyways. So I'm gonna try to write this out in cython best I can, using their code for guidance at times. Then, if we want, we can keep adding sklearn functionality with small copy-pastes. Also, implementing the splitter first should speed up the code enough to let someone implement and test MORF in python while we keep working on cythonizing. |
Cythonize bottlenecks in proglearn oblique tree splitter to speed up training times. Main bottleneck is finding the next best split.
The text was updated successfully, but these errors were encountered: