8000 sklearn.neighbors.KDTree complexity for building is not O(n(k+log(n)) · Issue #7687 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

sklearn.neighbors.KDTree complexity for building is not O(n(k+log(n)) #7687

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MarDiehl opened this issue Oct 17, 2016 · 18 comments · Fixed by #19473
Closed

sklearn.neighbors.KDTree complexity for building is not O(n(k+log(n)) #7687

MarDiehl opened this issue Oct 17, 2016 · 18 comments · Fixed by #19473

Comments

@MarDiehl
Copy link
MarDiehl commented Oct 17, 2016

Description

Building a kd-Tree can be done in O(n(k+log(n)) time and should (to my knowledge) not depent on the details of the data. However, the KDTree implementation in scikit-learn shows a really poor scaling behavior for my data. I cannot produce this behavior with data generated by sklearn.datasets.samples_generator.make_blobs

Steps/Code to Reproduce

download numpy data (search.npy) from https://webshare.mpie.de/index.php?6b4495f7e7 and run the following code on python 3

import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets.samples_generator import make_blobs
from scipy.spatial     import KDTree as KDTree_Scipy
from sklearn.neighbors import KDTree as KDTree_Sklearn
from time import sleep, perf_counter as pc

search_raw_artificial, dev_null = make_blobs(n_samples=6830160,n_features=5,centers=1000,cluster_std=0.4,random_state=0)
search_raw_real                 = np.load('search.npy')

N = [240000,2400000,4800000,6000000]

for search_raw in [search_raw_artificial,search_raw_real]:
  for i,n in enumerate(N):

    search = search_raw[0:n,:] 
    print('data shape',search.shape)
    print('delta',np.amax(search,0)-np.amin(search,0))
    t0 = pc()
    neigh = NearestNeighbors(algorithm='ball_tree').fit(search)
    print('sklearn.neighbors (ball_tree) build finished in {}s'.format(pc()-t0))
    t0 = pc()
    neigh = NearestNeighbors(algorithm='kd_tree').fit(search)
    print('  sklearn.neighbors (kd_tree) build finished in {}s'.format(pc()-t0))
    t0 = pc()
    neigh = KDTree_Sklearn(search)
    print('    sklearn.neighbors KD tree build finished in {}s'.format(pc()-t0))
    t0 = pc()
    neigh = KDTree_Scipy(search)
    print('    scipy.spatial     KD tree build finished in {}s'.format(pc()-t0))
    print('')

Expected Results

Time complexity scaling of scikit-learn KDTree should be similar to scaling of scipy.spatial KDTree

Actual Results

Artificial Data, scikit-learn outperforms scipy.spatial

data shape (240000, 5)
delta [ 22.7311549 22.61482157 22.57353059 22.65385101 22.77163478]
sklearn.neighbors (ball_tree) build finished in 0.1524970519822091s
sklearn.neighbors (kd_tree) build finished in 0.17296032601734623s
sklearn.neighbors KD tree build finished in 0.172917598974891s
scipy.spatial KD tree build finished in 2.265735782973934s

data shape (2400000, 5)
delta [ 23.38025743 23.22174801 22.88042798 22.8831237 23.31696732]
sklearn.neighbors (ball_tree) build finished in 3.462802237016149s
sklearn.neighbors (kd_tree) build finished in 3.7110973289818503s
sklearn.neighbors KD tree build finished in 3.5682168990024365s
scipy.spatial KD tree build finished in 19.92274082399672s

data shape (4800000, 5)
delta [ 23.38025743 23.26302877 23.22210673 22.97866792 23.31696732]
sklearn.neighbors (ball_tree) build finished in 8.922708058031276s
sklearn.neighbors (kd_tree) build finished in 9.238389031030238s
sklearn.neighbors KD tree build finished in 8.879073369025718s
scipy.spatial KD tree build finished in 38.43681587401079s

data shape (6000000, 5)
delta [ 23.42236957 23.26302877 23.22210673 23.20207953 23.31696732]
sklearn.neighbors (ball_tree) build finished in 12.75000820402056s
sklearn.neighbors (kd_tree) build finished in 13.30022174998885s
sklearn.neighbors KD tree build finished in 12.794657755992375s
scipy.spatial KD tree build finished in 48.33784791099606s

My Data, scipy.spatial outperforms scikit-learn for increasing data size

data shape (240000, 5)
delta [ 2.14487407 2.14472508 2.14499087 8.86612151 0.15491879]
sklearn.neighbors (ball_tree) build finished in 0.39374090504134074s
sklearn.neighbors (kd_tree) build finished in 0.21525143302278593s
sklearn.neighbors KD tree build finished in 0.21449304796988145s
scipy.spatial KD tree build finished in 2.244567967019975s

data shape (2400000, 5)
delta [ 2.14502773 2.14502543 2.14502904 8.86612151 1.59685522]
sklearn.neighbors (ball_tree) build finished in 4.199425678991247s
sklearn.neighbors (kd_tree) build finished in 4.40237572795013s
sklearn.neighbors KD tree build finished in 4.295626600971445s
scipy.spatial KD tree build finished in 26.322200270951726s

data shape (4800000, 5)
delta [ 2.14502773 2.14502864 2.14502904 8.86612151 3.19371044]
sklearn.neighbors (ball_tree) build finished in 110.31694995303405s
sklearn.neighbors (kd_tree) build finished in 112.8703724470106s
sklearn.neighbors KD tree build finished in 114.07325625402154s
scipy.spatial KD tree build finished in 51.79352715797722s

data shape (6000000, 5)
delta [ 2.14502838 2.14502902 2.14502914 8.86612151 3.99213804]
sklearn.neighbors (ball_tree) build finished in 2458.668528069975s
sklearn.neighbors (kd_tree) build finished in 2451.2438263060176s
sklearn.neighbors KD tree build finished in 2801.8054143560003s
scipy.spatial KD tree build finished in 62.066240190993994s

Comments

cKDTree from scipy.spatial behaves even better
I cannot use cKDTree/KDTree from scipy.spatial because calculating a sparse distance matrix (sparse_distance_matrix function) is extremely slow compared to neighbors.radius_neighbors_graph/neighbors.kneighbors_graph and I need a sparse distance matrix for DBSCAN on large datasets (n_samples >10 mio) with low dimensionality (n_features = 5 or 6)

Versions

Linux-4.7.6-1-ARCH-x86_64-with-arch
Python 3.5.2 (default, Jun 28 2016, 08:46:01) [GCC 6.1.1 20160602]
NumPy 1.11.2
SciPy 0.18.1
Scikit-Learn 0.18

@amueller
Copy link
Member

I'm trying to download the data but your sever is sloooow and has an invalid SSL certificate ;) Maybe use figshare or dropbox or drive the next time?

@amueller
Copy link
Member

I can reproduce with master:

data shape (4800000, 5)
delta [ 2.14502773  2.14502864  2.14502904  8.86612151  3.19371044]
sklearn.neighbors (ball_tree) build finished in 159.69595906400355s
  sklearn.neighbors (kd_tree) build finished in 150.84360234299675s
    sklearn.neighbors KD tree build finished in 153.49737433198607s
    scipy.spatial     KD tree build finished in 45.566852455958724s

@amueller
Copy link
Member

Also ping @jakevdp

@amueller
Copy link
Member

I think the algorithms is not very efficient for your particular data. Actually, just running it on the last dimension or the last two dimensions, you can see the issue.

I'm trying to understand what's happening in partition_node_indices but I don't really get it.
It looks like it has complexity n ** 2 if the data is sorted?

@jakevdp
Copy link
Member
jakevdp commented Oct 17, 2016

From what I recall, the main difference between scipy and sklearn here is that scipy splits the tree using a midpoint rule. This leads to very fast builds (because all you need is to compute (max - min)/2 to find the split point) but for certain datasets can lead to very poor performance and very large trees (worst case, at every level you're splitting only one point from the rest).

In sklearn, we use a median rule, which is more expensive at build time but leads to balanced trees every time. I made that call because we choose to pre-allocate all arrays to allow numpy to handle all memory allocation, and so we need a 50/50 split at every node.

In general, since queries are done N times and the build is done once (and median leads to faster queries when the query sample is similarly distributed to the training sample), I've not found the choice to be a problem. Still, I'm not sure why the distribution of data would affect this, unless you're somehow hitting the worst case of the median computation...

@jakevdp
Copy link
Member
jakevdp commented Oct 17, 2016

But I've not looked at any of this code in a couple years, so there may be details I'm forgetting

@jakevdp
Copy link
Member
jakevdp commented Oct 17, 2016

@MarDiehl a couple quick diagnostics: what is the range (i.e. max - min) of each of your dimensions? Second, if you first randomly shuffle the data, does the build time change?

@MarDiehl
Copy link
Author

Thanks for the very quick reply and taking care of the issue.
For faster download, the file is now available on https://www.dropbox.com/s/eth3utu5oi32j8l/search.npy?dl=0
Shuffling helps and give a good scaling, i.e. after np.random.shuffle(search_raw_real) I get

data shape (240000, 5)
delta [ 2.14497909 2.14495737 2.14499935 8.86612151 4.54031222]
sklearn.neighbors (ball_tree) build finished in 0.16637464799987356s
sklearn.neighbors (kd_tree) build finished in 0.17206305199988492s
sklearn.neighbors KD tree build finished in 0.184408041000097s
scipy.spatial KD tree build finished in 2.320559198999945s

data shape (2400000, 5)
delta [ 2.14502838 2.14502903 2.14502893 8.86612151 4.54031222]
sklearn.neighbors (ball_tree) build finished in 3.2228471139997055s
sklearn.neighbors (kd_tree) build finished in 3.524644171000091s
sklearn.neighbors KD tree build finished in 3.2397920609996618s
scipy.spatial KD tree build finished in 26.382782556000166s

data shape (4800000, 5)
delta [ 2.14502852 2.14502903 2.14502904 8.86612151 4.54031222]
sklearn.neighbors (ball_tree) build finished in 12.170209839000108s
sklearn.neighbors (kd_tree) build finished in 12.363510834999943s
sklearn.neighbors KD tree build finished in 12.047136137000052s
scipy.spatial KD tree build finished in 47.75648402300021s

data shape (6000000, 5)
delta [ 2.14502852 2.14502903 2.14502914 8.86612151 4.54031222]
sklearn.neighbors (ball_tree) build finished in 11.137991230999887s
sklearn.neighbors (kd_tree) build finished in 11.372971363000033s
sklearn.neighbors KD tree build finished in 11.437613521000003s
scipy.spatial KD tree build finished in 56.40389510099976s

Since it was missing in the original post, a few words on my data structure. First of all, each sample is unique. Using pandas to check:
import pandas as pd
df = pd.DataFrame(search_raw_real)
print(df.shape)
print(df.drop_duplicates().shape)

The data has a very special structure, best described as a checkerboard (coordinates on a regular grid, dimension 3 and 4 for 0-based indexing) with 24 vectors (dimension 0,1,2) placed on every tile. On one tile, all 24 vectors differ (otherwise the data points would not be unique), but neigbouring tiles often hold the same or similar vectors. The data is ordered, i.e. point 0 is the first vector on (0,0), point 1 the second vector on (0,0), point 24 is the first vector on point (1,0) etc. This can also be seen from the data shape output of my test algorithm

@jakevdp 8000
Copy link
Member
jakevdp commented Oct 18, 2016

If you have data on a regular grid, there are much more efficient ways to do neighbors searches. Sounds like this is a corner case in which the data configuration happens to cause near worst-case performance of the tree building.

@MarDiehl
Copy link
Author
MarDiehl commented Oct 18, 2016

@jakevdp only 2 of the dimensions are regular (dimensions are a * (n_x,n_y) where a is a constant 0.01<a<0.5 and n_x,n_y is the number of points, typically 40<n_x/n_y<10000). The other 3 dimensions are in the range [-1.07,1.07], 24 of them exist on each point of the regular grid and they are not regular. What I finally need (for DBSCAN) is a sparse distance matrix.
Shuffle the data and use the KDTree seems to be the most attractive option for me so far or could you recommend any way to get the matrix?
Many thanks!

@jakevdp
Copy link
Member
jakevdp commented Oct 18, 2016

DBSCAN should compute the distance matrix automatically from the input, but if you need to compute it manually you can use kneighbors_graph or related routines.

@amueller
Copy link
Member

I wonder whether we should shuffle the data in the tree to avoid degenerate cases in the sorting.

@jakevdp
Copy link
Member
jakevdp commented Oct 18, 2016

My suspicion is that this is an extremely infrequent corner-case, and adding computational and memory overhead in every case would be a bit overkill.

@amueller
Copy link
Member

I think the case is "sorted data", which I imagine can happen. Maybe checking if we can make the sorting more robust would be good.

@jakevdp
Copy link
Member
jakevdp commented Oct 18, 2016

I suspect the key is that it's gridded data, sorted along one of the dimensions. The combination of that structure and the presence of duplicates could hit the worst-case for a basic binary partition algorithm... there are probably variants out there that would perform better. Anyone take an algorithms course recently? 😎

@sturlamolden
Copy link
Contributor

SciPy can use a sliding midpoint or a medial rule to split kd-trees.

The slowness on gridded data has been noticed for SciPy as well when building kd-tree with the median rule. It is due to the use of quickselect instead of introselect. Sklearn suffers from the same problem.

Another thing I have noticed is that the size of the data set matters as well. For large data sets (e.g. several million of points) building with the median rule can be very slow, even for well behaved data. With large data sets it is always a good idea to use the sliding midpoint rule instead.

@amueller
Copy link
Member

@sturlamolden what's your recommendation?

@sturlamolden
Copy link
Contributor

For large data sets (typically >1E6 data points), use cKDTree with balanced_tree=False. This will build the kd-tree using the sliding midpoint rule, and tends to be a lot faster on large data sets. The sliding midpoint rule requires no partial sorting to find the pivot points, which is why it helps on larger data sets.

Dealing with presorted data is harder, as we must know the problem in advance. One option would be to use intoselect instead of quickselect. The required C code is in NumPy and can be adapted. This is not perfect. Although introselect is always O(N), it is slow O(N) for presorted data. Another option would be to build in some sort of timeout, and switch strategy to sliding midpoint if building the kd-tree takes too long (e.g. if it exceeeds one second).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
0