-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
sklearn.neighbors.KDTree complexity for building is not O(n(k+log(n)) #7687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm trying to download the data but your sever is sloooow and has an invalid SSL certificate ;) Maybe use figshare or dropbox or drive the next time? |
I can reproduce with master:
|
Also ping @jakevdp |
I think the algorithms is not very efficient for your particular data. Actually, just running it on the last dimension or the last two dimensions, you can see the issue. I'm trying to understand what's happening in |
From what I recall, the main difference between scipy and sklearn here is that scipy splits the tree using a midpoint rule. This leads to very fast builds (because all you need is to compute (max - min)/2 to find the split point) but for certain datasets can lead to very poor performance and very large trees (worst case, at every level you're splitting only one point from the rest). In sklearn, we use a median rule, which is more expensive at build time but leads to balanced trees every time. I made that call because we choose to pre-allocate all arrays to allow numpy to handle all memory allocation, and so we need a 50/50 split at every node. In general, since queries are done N times and the build is done once (and median leads to faster queries when the query sample is similarly distributed to the training sample), I've not found the choice to be a problem. Still, I'm not sure why the distribution of data would affect this, unless you're somehow hitting the worst case of the median computation... |
But I've not looked at any of this code in a couple years, so there may be details I'm forgetting |
@MarDiehl a couple quick diagnostics: what is the range (i.e. max - min) of each of your dimensions? Second, if you first randomly shuffle the data, does the build time change? |
Thanks for the very quick reply and taking care of the issue. data shape (240000, 5) data shape (2400000, 5) data shape (4800000, 5) data shape (6000000, 5) Since it was missing in the original post, a few words on my data structure. First of all, each sample is unique. Using pandas to check: The data has a very special structure, best described as a checkerboard (coordinates on a regular grid, dimension 3 and 4 for 0-based indexing) with 24 vectors (dimension 0,1,2) placed on every tile. On one tile, all 24 vectors differ (otherwise the data points would not be unique), but neigbouring tiles often hold the same or similar vectors. The data is ordered, i.e. point 0 is the first vector on (0,0), point 1 the second vector on (0,0), point 24 is the first vector on point (1,0) etc. This can also be seen from the |
If you have data on a regular grid, there are much more efficient ways to do neighbors searches. Sounds like this is a corner case in which the data configuration happens to cause near worst-case performance of the tree building. |
@jakevdp only 2 of the dimensions are regular (dimensions are a * (n_x,n_y) where a is a constant 0.01<a<0.5 and n_x,n_y is the number of points, typically 40<n_x/n_y<10000). The other 3 dimensions are in the range [-1.07,1.07], 24 of them exist on each point of the regular grid and they are not regular. What I finally need (for DBSCAN) is a sparse distance matrix. |
DBSCAN should compute the distance matrix automatically from the input, but if you need to compute it manually you can use kneighbors_graph or related routines. |
I wonder whether we should shuffle the data in the tree to avoid degenerate cases in the sorting. |
My suspicion is that this is an extremely infrequent corner-case, and adding computational and memory overhead in every case would be a bit overkill. |
I think the case is "sorted data", which I imagine can happen. Maybe checking if we can make the sorting more robust would be good. |
I suspect the key is that it's gridded data, sorted along one of the dimensions. The combination of that structure and the presence of duplicates could hit the worst-case for a basic binary partition algorithm... there are probably variants out there that would perform better. Anyone take an algorithms course recently? 😎 |
SciPy can use a sliding midpoint or a medial rule to split kd-trees. The slowness on gridded data has been noticed for SciPy as well when building kd-tree with the median rule. It is due to the use of quickselect instead of introselect. Sklearn suffers from the same problem. Another thing I have noticed is that the size of the data set matters as well. For large data sets (e.g. several million of points) building with the median rule can be very slow, even for well behaved data. With large data sets it is always a good idea to use the sliding midpoint rule instead. |
@sturlamolden what's your recommendation? |
For large data sets (typically >1E6 data points), use cKDTree with balanced_tree=False. This will build the kd-tree using the sliding midpoint rule, and tends to be a lot faster on large data sets. The sliding midpoint rule requires no partial sorting to find the pivot points, which is why it helps on larger data sets. Dealing with presorted data is harder, as we must know the problem in advance. One option would be to use intoselect instead of quickselect. The required C code is in NumPy and can be adapted. This is not perfect. Although introselect is always O(N), it is slow O(N) for presorted data. Another option would be to build in some sort of timeout, and switch strategy to sliding midpoint if building the kd-tree takes too long (e.g. if it exceeeds one second). |
Description
Building a kd-Tree can be done in O(n(k+log(n)) time and should (to my knowledge) not depent on the details of the data. However, the KDTree implementation in scikit-learn shows a really poor scaling behavior for my data. I cannot produce this behavior with data generated by sklearn.datasets.samples_generator.make_blobs
Steps/Code to Reproduce
download numpy data (search.npy) from https://webshare.mpie.de/index.php?6b4495f7e7 and run the following code on python 3
Expected Results
Time complexity scaling of scikit-learn KDTree should be similar to scaling of scipy.spatial KDTree
Actual Results
Artificial Data, scikit-learn outperforms scipy.spatial
data shape (240000, 5)
delta [ 22.7311549 22.61482157 22.57353059 22.65385101 22.77163478]
sklearn.neighbors (ball_tree) build finished in 0.1524970519822091s
sklearn.neighbors (kd_tree) build finished in 0.17296032601734623s
sklearn.neighbors KD tree build finished in 0.172917598974891s
scipy.spatial KD tree build finished in 2.265735782973934s
data shape (2400000, 5)
delta [ 23.38025743 23.22174801 22.88042798 22.8831237 23.31696732]
sklearn.neighbors (ball_tree) build finished in 3.462802237016149s
sklearn.neighbors (kd_tree) build finished in 3.7110973289818503s
sklearn.neighbors KD tree build finished in 3.5682168990024365s
scipy.spatial KD tree build finished in 19.92274082399672s
data shape (4800000, 5)
delta [ 23.38025743 23.26302877 23.22210673 22.97866792 23.31696732]
sklearn.neighbors (ball_tree) build finished in 8.922708058031276s
sklearn.neighbors (kd_tree) build finished in 9.238389031030238s
sklearn.neighbors KD tree build finished in 8.879073369025718s
scipy.spatial KD tree build finished in 38.43681587401079s
data shape (6000000, 5)
delta [ 23.42236957 23.26302877 23.22210673 23.20207953 23.31696732]
sklearn.neighbors (ball_tree) build finished in 12.75000820402056s
sklearn.neighbors (kd_tree) build finished in 13.30022174998885s
sklearn.neighbors KD tree build finished in 12.794657755992375s
scipy.spatial KD tree build finished in 48.33784791099606s
My Data, scipy.spatial outperforms scikit-learn for increasing data size
data shape (240000, 5)
delta [ 2.14487407 2.14472508 2.14499087 8.86612151 0.15491879]
sklearn.neighbors (ball_tree) build finished in 0.39374090504134074s
sklearn.neighbors (kd_tree) build finished in 0.21525143302278593s
sklearn.neighbors KD tree build finished in 0.21449304796988145s
scipy.spatial KD tree build finished in 2.244567967019975s
data shape (2400000, 5)
delta [ 2.14502773 2.14502543 2.14502904 8.86612151 1.59685522]
sklearn.neighbors (ball_tree) build finished in 4.199425678991247s
sklearn.neighbors (kd_tree) build finished in 4.40237572795013s
sklearn.neighbors KD tree build finished in 4.295626600971445s
scipy.spatial KD tree build finished in 26.322200270951726s
data shape (4800000, 5)
delta [ 2.14502773 2.14502864 2.14502904 8.86612151 3.19371044]
sklearn.neighbors (ball_tree) build finished in 110.31694995303405s
sklearn.neighbors (kd_tree) build finished in 112.8703724470106s
sklearn.neighbors KD tree build finished in 114.07325625402154s
scipy.spatial KD tree build finished in 51.79352715797722s
data shape (6000000, 5)
delta [ 2.14502838 2.14502902 2.14502914 8.86612151 3.99213804]
sklearn.neighbors (ball_tree) build finished in 2458.668528069975s
sklearn.neighbors (kd_tree) build finished in 2451.2438263060176s
sklearn.neighbors KD tree build finished in 2801.8054143560003s
scipy.spatial KD tree build finished in 62.066240190993994s
Comments
cKDTree from scipy.spatial behaves even better
I cannot use cKDTree/KDTree from scipy.spatial because calculating a sparse distance matrix (sparse_distance_matrix function) is extremely slow compared to neighbors.radius_neighbors_graph/neighbors.kneighbors_graph and I need a sparse distance matrix for DBSCAN on large datasets (n_samples >10 mio) with low dimensionality (n_features = 5 or 6)
Versions
Linux-4.7.6-1-ARCH-x86_64-with-arch
Python 3.5.2 (default, Jun 28 2016, 08:46:01) [GCC 6.1.1 20160602]
NumPy 1.11.2
SciPy 0.18.1
Scikit-Learn 0.18
The text was updated successfully, but these errors were encountered: