8000 BallTree.data is a memory view · Issue #11728 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

BallTree.data is a memory view #11728

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Aug 1, 2018 · 3 comments · Fixed by #11764
Closed

BallTree.data is a memory view #11728

amueller opened this issue Aug 1, 2018 · 3 comments · Fixed by #11764
Labels
Easy Well-defined and straightforward way to resolve help wanted

Comments

@amueller
Copy link
Member
amueller commented Aug 1, 2018

Via stackoverflow. BallTree.data is documented as numpy array but is instead a memory view. I haven't looked at the code, but I think we should store it as a numpy array and if that results in any efficiency issues, we can make it private instead.

@amueller amueller added Easy Well-defined and straightforward way to resolve help wanted labels Aug 1, 2018
@BlaneG
Copy link
BlaneG commented Aug 2, 2018

I haven't worked with Cython before. Are you just suggesting to cast the memory view as a numpy array when data is initialized in ball_tree?:
cdef DTYPE_t* data = np.asarray(&tree.data[0, 0])

Or is there a cleaner way to do this?

@jeremiedbb
Copy link
Member

This is not as simple as that. BallTree heritates from the BinaryTree class, and data is a public attribute of BinaryTree, stored as a memoryview.

In BinaryTree all array attributes are stored as memory views and I think we shouldn't store them a np.ndarray. Memoryviews are more modern and it's advised to use them instead of numpy arrays now.

Is it not possible to just change the doc and document data as a memoryview ?
If not, I think making it private is the best solution. We could maybe add a get_data function which returns a numpy array if it's really needed.

@rth
Copy link
Member
rth commented Aug 3, 2018

Yes, also in BinaryTree data is actually the memoryview corresponding to the data_arr attribute. data is exposed in python using the readonly property while data_arr is not (more information about static cython attributes can be found in the docs).

So we could, refactor BinaryTree, BallTree etc to use the data variable name for what is now data_arr and make it public, while not making public what is currently data.

In the end though, data is just a copy of the training data that the user has already. I guess it was exposed to be consistent with the scipy's KDTree implementation, but I wonder if making sure it's 100% consistent would really justify the effort. Maybe just documenting that it's a memoryview, so that the few users who want to use it, can call np.asarray(estimator.data) to get an array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0