-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Replace morphology.local_maxima with faster flood-fill based Cython version #3022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks @lagru! Fantastic to get identical results to local_maxima with a 10-20x speed advantage! To answer your decision points, from my perspective:
A major point in this review is that this function and these algorithms would make complete sense in n dimensions, rather than restricting to 2D, and also with varying levels of connectivity (4- or 8-connected in 2D, 6-, 18-, or 26-connected in 3D, though I prefer the terms 1-connected, 2-connected, 3-connected for the same). This would be easier for algo B than A, I think. I've been meaning to write a guide about working in nD vs 2D, but in the meantime, I suggest looking at the morphology.watershed code for an example. The key is to think in indices and neighbours instead of rows/columns and -1/+1 on each. One potential limiting factor is that, if I remember correctly, we have flood fill in 2D and 3D but not nD. But just 2D/3D is a big improvement over 2D-only. |
I would generally agree. However I have the fear that that would incur noticeable speed penalties for the 2D case.
That seems well documented but a little bit overwhelming to grasp at first glance. I think I get that you actually iterate each pixel on the flattened array. The neighbors are found by using offsets from the current index. I don't yet understand how these offsets are computed without using to much memory. E.g. for a connectivity of 2 in the 2D-case 8 offsets for each pixel (except for the edges) are required. So I would need memory at least 8 times the size of the provided image. This gets even worse for higher dimensions... |
Awesome work. Many thanks! I reply to A, B and D A. I agree with your intuition :) |
Amazing. That's really cool. What I do not understand is: why do you get the same results as Another question I am asking myself: where does this enormous speed gain come from? Is this algorithmic or implementation? |
(Assuming you mean
I think the implementation in Cython plays definitely a role and |
OK cool, thanks a lot for your comment. Because my point is that if the reconstruction is really suboptimal, it might be worth investigating an alternative implementation irrespective of the detection of local maxima, which you have solved already. Simply because the reconstruction can be used for many things. By the way, I think it is better to have at least the option to also have maxima that touch the border. Actually, there is no mathematically sound reason for excluding or including them in the first place (I mean, it is just a choice), but for some morphological operators it is important to have all maxima to start with. |
@ThomasWalter I should have added that excluding maxima at borders simplifies the implementation of algorithm A and more importantly allows the first loop to exclude more pixels. Especially for "unnatural" images which have large plateaus which may get excluded early this can reduce the evaluation burden for the second queue-based part! |
@ThomasWalter I'm going from memory/intuition here rather than a full overview of the code, but if I remember correctly, the morphological reconstruction will run a dilation on the full image multiple times, enough to cover any plateaus. On the other hand, this algorithm only needs one pass over the image + some flood filling, so it can potentially be much faster. Regarding the borders, we can reduce borderless to border-ful by padding with the minimum value. (Though this may not be the fastest/most efficient approach).
No, you compute the offsets "just in time", when you are observing each pixel/voxel. So you compute an offset array (a 4-, 8-, 6-, 18-, or 26-element array of signed ints) and when you check a pixel you compute its neighbors by adding the pixel coordinate to that array. You do not broadcast all pixel coordinates against the offset array! That would be crazy. =P (In my defense, I wrote this code a very long time ago. =D) |
Alright, I got an ND-version of algorithm A to work and it seems like its even slightly faster than the 2D versions for the 2D case. :D I updated the benchmark notebook as well so you can see for yourself. |
skimage/morphology/extrema.py
Outdated
# Set edge values of flag-array to 3 | ||
without_borders = (slice(1, -1) for _ in range(flags.ndim)) | ||
flags = np.pad(flags[tuple(without_borders)], pad_width=1, | ||
mode="constant", constant_values=3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There must be a better / faster way to do this right? But I can't think of a way to get a view or slice of only the "edge" values (the inverted of without_borders
).
skimage/morphology/_extrema_cy.pyx
Outdated
|
||
# TODO prevent integer overflow! | ||
self._buffer_size *= 2 | ||
new_buffer_ptr = <QueueItem*>realloc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect that there is a constant or variable that stores the maximal possible size of Py_ssize_t
but I can't find it.
Some additional question:
|
skimage/morphology/_extrema_cy.pyx
Outdated
|
||
|
||
cdef: | ||
struct Queue: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance we could re-use a data structure already present somewhere else in order to limit the complexity of the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see that you've mentioned the Queue of the watershed. A common codebase would be great if it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, a common codebase would be ideal (see this comment as well #2876 (comment)). However as already mentioned there are some problematic differences between the two queue implementations and its requirements which I will try to summarize below:
Heap (general_heap.pxi) is a priority queue used by watershed, Queue is used in my code:
- Multiple iterations: Queue allows multiple iterations via
q_restore
and preserves a consumed (popped) item internally until cleared explicitly. Heap may overwrite consumed data. - Order: Queue uses insertion order while Heap sorts (using the array of pointers) by the return value of a function
smaller
. - A consequence the internal data representation is different as well: Queue 8000 uses only one array to store its items while Heap uses an additional array to store pointers to its items.
@emmanuelle You seem to be the original contributor of Heap so please point out any misunderstandings on my part.
I don't see much overlap here concerning the internal machinery of these "basic" containers. But it may be possible to implement a data container that fulfills the requirements of both Heap and Queue. But would that be wise? I think my code would be slower due to overhead introduced by the sorting machinery and watershed would have to suffer the larger memory footprint due to internal preservation of popped items.
Having said that, if you still think it might be worth it I'll try to make a quick draft of a queue that can be used as a replacement for both. That might be more useful to reach a more informed decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No no, a FIFO Queue and a Heap are two different data structures, even if the latter is used to implement Priority Queues. I don't think there is much code to be shared between them and it makes sense to keep them separate.
I don't have anything to add at this point, but want to express my excitement at this conversation happening. Thanks! |
Okay, the new algorithm seems to pass the unit tests of the old
To F: I would prefer to replace the old slower one. Mainly to avoid ambiguity between 3 maxima finding functions and because I can't think of a more fitting name than To G: I like the current scope: find local maxima depending on connectivity and definition at the edges. Convenient filter options could be provided with another function, e.g. To H: Honestly, I would prefer to worry about that in another PR after the API and scope are clear. |
@lagru my 2c:
I want to keep |
These two new functions work very similar to skimage.morphology.local_maxima but are generally faster.
I have difficulties finding the offending commit. It seems like declaring the flag values as module variables in 08da763 caused a noticable slow down which I don't understand. I'd have thought that the compiler would optimize that change away. These two commits serve only the purpose to make the code more readable; are they worth the performance hit? The current implementation however is still magnitudes faster than the old version and still in the same neighborhood as |
@lagru I'm running benchmarks on every commit in that range using the machinery in #3137. Generally readability counts a lot in this library, and anyway I think there are probably Cython optimisations to make sure that those variable assignments don't actually cost anything. Once I figure out exactly what broke, I'll report back to see whether there are any possible fixes. |
@lagru which benchmark specifically are you concerned about? For the random image (1000,1000) it's actually a bit faster after "MAINT: Merge _compute_neighbors and _offsets_to_raveled_neighbors". Here's the results:
This is on my branch |
ping @lagru |
@jni Sorry that I took so long and thank you for investigating yourself!
When rerunning the notebook with the benchmarks I get a qualitative difference between the new BUT I can't even reproduce the original results of the notebook any longer. My only explanation is that something in the background or an update to a related library changed how the benchmarks perform. Considering that weird behavior, I'd leave it as it is. If I solve this mystery I can always come bag and apply it. |
Ok, I agree, this is not worth worrying about at this moment. @scikit-image/core anyone care to review/merge? @ThomasWalter maybe you would like to review also? |
@jni : please go ahead, I did not really follow. (actually I did not know that I can review, as I do not belong to the core) |
@ThomasWalter well now you know. =) @soupault @stefanv @emmanuelle an extra ping for good measure. ;) This is a big PR but the code is very clean after several revisions, so it should not be too tricky to review. Have a look! Note: I intend to backport to 0.14 so let's keep the Py2 compatibility stuff and take it out in a later PR. |
Don't use NOT_MAXIMUM for boolean flag `is_maximum`. This might break if the value is changed and no longer 0.
Failed due to mismatching display formats on different platforms. This should alleviate the issue.
@lagru This is an exceptional piece of work; I wish all PRs were of such quality! |
@meeseeksdev backport to v0.14.x |
There seem to be a conflict, please backport manually |
This is a follow-up to #3016 that
demonstrates a first draft of alternative algorithmsproposes an alternative flood-fill based algorithm in Cython for maxima detection in images (and ND arrays). You can find a demo and benchmarks here.Description
This PR is meant to be a place for discussion first and is not necessarily meant to be merged.The current state is work in progress andI'm sure there is room for improvement. However I felt that I have reached a point where it would be useful to start a discussion and review process. There are some open points that will / could have a large impact on the functions performance and implementation which I list below or as review comments. Furthermore I felt a little out of my depth concerning some C/Cython related stuff. Especially while implementing the re-sizable buffer. So if you see any bad practices please point them out! I'm eager to improve.For a description of the algorithms itself please have a look at the docstrings and code comments. Including it here wouldn't make much sense
as these are still WIP.Oh, and just because it isn't said enough: Thank you guys for your work and time. scikit-image is a great tool and library!
Checklist
I'll try to add talking points and arguments you bring up here.
General decision points
--> I'll just make this toggle-able. For now I'll set the default to exclude maxima on borders.
feature.peak_local_max
) or a "boolean" array with the same shape as the image (likemorphology.local_maxima
)?--> Return array of 1's and 0's. Optionally add a parameter to change to indices.
(C) Which variant of the algorithm should be used? Benchmarks here...
--> It seems that algorithm A (prefiltering the array in one direction) has a better performance for all test cases. So I'll prefer that one for now.
(D) Is this a candidate for SciPy's ndimage module or for scikit-image? If the latter, which module?
--> I'll post and ask on the SciPy mailing list when I'm confident with the progress.
8000- (E) Do you actually want this as an addition to the respective library? 😉
Final tasks
./doc/examples
(new features only)--> Already exists here
References
For reviewers
(Don't remove the checklist below.)
later.
__init__.py
.doc/release/release_dev.rst
.