10000 Node radius in KDTree seems wrongly computed · Issue #11313 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Node radius in KDTree seems wrongly computed #11313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Zaharid opened this issue Jun 18, 2018 · 1 comment
Closed

Node radius in KDTree seems wrongly computed #11313

Zaharid opened this issue Jun 18, 2018 · 1 comment

Comments

@Zaharid
Copy link
Zaharid commented Jun 18, 2018

Description

I am trying to understand the implementation of the KDTree structure here, and I came across this line

rad += pow(0.5 * abs(upper_bounds[j] - lower_bounds[j]),

I do not understand how the code corresponds to the description in the comments below

    # The radius will hold the size of the circumscribed hypersphere measured
    # with the specified metric: in querying, this is used as a measure of the
    # size of each node when deciding which nodes to split.

As far as I can tell, the value of j is set to nfeatures-1 and the code is repeated once for each datapoint instead once for each dimension. Moreover the boundaries are being added while they are still changing.

Steps/Code to Reproduce

Consider

In [1]: from sklearn.neighbors import KDTree

In [2]: data = [[0,0], [1,0], [2,0], [3,0], [4,0], [5,0], [6,0]]

In [3]: t = KDTree(data,leaf_size=1)

In [4]: t.node_data[0]
Out[4]: {'idx_end': 7, 'idx_start': 0, 'is_leaf': 0, 'radius': 0.0}

In [5]: t.node_data[1]
Out[5]: {'idx_end': 3, 'idx_start': 0, 'is_leaf': 0, 'radius': 0.0}

In [6]: t.node_data[2]
Out[6]: {'idx_end': 7, 'idx_start': 3, 'is_leaf': 0, 'radius': 0.0}

In [7]: t.node_data[3]
Out[7]: {'idx_end': 1, 'idx_start': 0, 'is_leaf': 1, 'radius': 0.0}

In [8]: t.node_data[4]
Out[8]: {'idx_end': 3, 'idx_start': 1, 'is_leaf': 1, 'radius': 0.0}

In [9]: t.node_data[5]
Out[9]: {'idx_end': 5, 'idx_start': 3, 'is_leaf': 1, 'radius': 0.0}

In [10]: t.node_data[6]
Out[10]: {'idx_end': 7, 'idx_start': 5, 'is_leaf': 1, 'radius': 0.0}

Note that all radii are zero.

Expected Results

I would expect the radii of these nodes to be different, not all zero. I think they are zero in the example because they only look at the last dimension.

Actual Results

I believe this might lead to degraded performance when quering, trough the use of radii in this function:

cdef int _query_dual_depthfirst(self, ITYPE_t i_node1,

but I have not tested it, and still do not understand the algorithm well enough.

Versions

I am looking at the current master version on github. The example above works with version 0.19.1.

@Zaharid Zaharid changed the title Node radius in KDTree seems suspicious Node radius in KDTree see,s wrongly computed Jul 3, 2018
@Zaharid Zaharid changed the title Node radius in KDTree see,s wrongly computed Node radius in KDTree seems wrongly computed Jul 3, 2018
@jakevdp
Copy link
Member
jakevdp commented Jul 16, 2018

Yeah, I think that block should be in a for j in range(n_features) loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
0