[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add coverage_simplify exposing topological simplification of coverages #1969

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

martinfleis
Copy link
Contributor

xref #1570

GEOS supports this simplification for collections of polygons only. Here, I have currently opted for returning other geometries untouched but the correct behaviour is up to a discussion. When checking for correct geom types, nesting og GeometryCollections is not supported and a GeometryCollection of a GeometryCollections will not be simplified no matter the contents of the inner collection.

If we did not do this check, when points or line strings are passed, shapely returns (empty) GEOS Exception. When passing GeometryCollection with points or line strings, it kills the kernel, so I suppose that the checks I have included now are quite a feasible way forward.

I am mirroring the API of PostGIS, not that of GEOS. So the keyword is simplify_boundary rather than preserve_boundary which seems more intuitive.

@coveralls
Copy link
coveralls commented Jan 22, 2024

Pull Request Test Coverage Report for Build 10995499008

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 8 of 8 (100.0%) changed or added relevant lines in 1 file are covered.
  • 59 unchanged lines in 11 files lost coverage.
  • Overall coverage decreased (-0.2%) to 87.638%

Files with Coverage Reduction New Missed Lines %
shapely/geometry/multipoint.py 1 96.77%
shapely/geometry/polygon.py 1 96.77%
shapely/geometry/multilinestring.py 1 96.88%
shapely/geometry/point.py 2 95.08%
shapely/_ragged_array.py 3 97.84%
shapely/io.py 3 79.1%
shapely/set_operations.py 3 96.7%
shapely/_geometry.py 4 96.33%
shapely/geos.py 6 0.0%
shapely/constructive.py 15 90.38%
Totals Coverage Status
Change from base Build 10631218520: -0.2%
Covered Lines: 2616
Relevant Lines: 2985

💛 - Coveralls

Copy link
Collaborator
@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @martinfleis !

It would be good to report test cases that crash the kernel to GEOS; those should be handled upstream.

I assume you are going to expand the test cases?

shapely/constructive.py Outdated Show resolved Hide resolved
shapely/constructive.py Outdated Show resolved Hide resolved
shapely/constructive.py Outdated Show resolved Hide resolved
src/ufuncs.c Outdated Show resolved Hide resolved
@martinfleis
Copy link
Contributor Author

I'll fish those cases out and report to GEOS, good point.

I actually wrote more tests but apparently did not push the commit. Will do once I'll get to laptop.

@theroggy
Copy link
Contributor

I suppose that the input coverage needs to be entirely topologically sound? Eg. that if two polygons touch not in a common point but a point touches a vector of another polygon between two points, this will not be treated as a common boundary?

If this is the case, I think it will be useful to explicitly mention this in the doc to avoid misunderstandings...

@jorisvandenbossche
Copy link
Member

We also need to decide how to treat the input geometry argument. AFAIK in the current state of the PR, you just pass each input geometry to the GEOS function, and so this assumes that the user already collected their set of polygons into a collection.

Alternatively, we can also do that for the user. Similarly as ST_CoverageSimplify is a window function (thus taking a set of geometries to simplify together), and similarly as for union_all we also collect the input geometries in a collection first before passing to GEOSUnaryUnion.

@martinfleis
Copy link
Contributor Author

That's a good point. I initially wanted to do that on the GeoPandas side (collect, simplify, explode) but it may make sense to do the first two steps (collect and simplify) already in shapely and return the collection. I think this covers majority of use cases and if you want to simplify per geometry, you can always loop.

I'll leave this decision to you as you have a better feeling for what the optimal behaviour of shapely is.

(Another example of this is polygonize.)

@martinfleis
Copy link
Contributor Author

I have changed the behaviour to follow the logic of union_all or ST_CoverageSimplify to make the function operate on the whole array and return a collection. The axis keyword is not yet tested but please have a look if this makes more sense API-wise.

returned collected but unchanged.

If the geometry is polygonal but does not form a valid coverage due to overlaps,
it will be simplified but it may result in invalid topology.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martinfleis Lovely docstring, thank you!

Axis along which the operation is performed. The default (None)
performs the operation over all axes, returning a scalar value.
Axis may be negative, in which case it counts from the last to the
first axis.
Copy link
Contributor
@sgillies sgillies Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

axis is new to me. It comes from Pandas, right? I think boundaries between Pandas/GeoPandas and Shapely are a good thing and that we should resist blurring the projects if we can. What does axis really mean to Shapely, given that inputs can only be a scalar geometry or a 1-D array of geometries? Can't we infer axis from the type of input?

That aside, the doc for axis may be technically correct but isn't useful to a developer who isn't super fluent with numpy or Pandas. Can we make it more approachable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes directly from shapely.union_all and other predicate _all functions which have the same notion of using axis to define what should be treated as a single group. So it is not new.

What does axis really mean to Shapely, given that inputs can only be a scalar geometry or a 1-D array of geometries?

That is not correct as far as I'm aware. While some operations will not work on a N-D array (STRtree notably), most of shapely operations are perfectly capable of handling N-D arrays of geometries and there's a clear use case for that in the vector data cube space for dealing with time-dependent geometries.

If you think this should change, we should do that across all functions that use it, so I'd keep it for a separate discussion and PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, since the low-level operations were written like ufuncs, I shouldn't be surprised that they would work on N-D arrays.

Even, so why would we want to ravel in one case, and rollaxis in the other? What does this mean to a GIS programmer? If lib.coverage_simplify() can accept N-D arrays, can the caller not ravel or rollaxis or neither (perhaps?) as they need before calling the method?

In matters of software design, I tend to be anti-convenience. Instead of accepting an axis argument and then modifying input based on the value, callers should just modify their inputs in their own applications. Then we have less code, fewer special cases and branches to test. I would make an exception for situations where GEOS C API methods must be called, because how is a user going to do that? But here the user already has numpy installed and can easily reshape their inputs.

Copy link
Contributor
@sgillies sgillies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need some clarification on what axis really stands for in this method and I'd like a discussion about whether this is a real feature or just a convenience.

Axis along which the operation is performed. The default (None)
performs the operation over all axes, returning a scalar value.
Axis may be negative, in which case it counts from the last to the
first axis.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, since the low-level operations were written like ufuncs, I shouldn't be surprised that they would work on N-D arrays.

Even, so why would we want to ravel in one case, and rollaxis in the other? What does this mean to a GIS programmer? If lib.coverage_simplify() can accept N-D arrays, can the caller not ravel or rollaxis or neither (perhaps?) as they need before calling the method?

In matters of software design, I tend to be anti-convenience. Instead of accepting an axis argument and then modifying input based on the value, callers should just modify their inputs in their own applications. Then we have less code, fewer special cases and branches to test. I would make an exception for situations where GEOS C API methods must be called, because how is a user going to do that? But here the user already has numpy installed and can easily reshape their inputs.

@martinfleis
Copy link
Contributor Author

I mean, I am not attached to the axis keyword. I even said above

The axis keyword is not yet tested but please have a look if this makes more sense API-wise.

The reason I included it in this way is that we need to collect individual geometries in some way before passing them to GEOS, otherwise we expect user to do that. The latter was originally implemented here but as pointed out by @jorisvandenbossche above, it may not be what user expects from this function.

When collecting geometries, shapely has a precedent of using the axis keyword. I simply followed that for the sake of consistency. Coverage simplification then works as aggregation, since it returns a GeometryCollection, following practically the same pattern as union_all and other functions.

I fully understand your concern but I am not sure if this function is different enough to deviate from this pattern. But in the end, that is not my call to make.

I can remove the axis and follow the route currently implemented for axis=None, I don't mind that, if you think that it does not belong here. But then you should maybe have a discussion if it belongs elsewhere as well.

@jorisvandenbossche
Copy link
Member
jorisvandenbossche commented Aug 31, 2024

axis is new to me. It comes from Pandas, right? I think boundaries between Pandas/GeoPandas and Shapely are a good thing and that we should resist blurring the projects if we can.

To be clear, the axis keyword does not come from pandas (I don't think we actually even use it in geopandas in any of the shapely methods we call), but it comes from numpy.
See https://numpy.org/doc/stable/glossary.html#term-axis and https://numpy.org/doc/stable/glossary.html#term-along-an-axis for operations that can be performed along an axis. Such operations have an axis keyword, and so for example all reduction functions or generalized ufuncs in numpy have that keyword.

Given that the core GEOS functions we now wrap are implemented as numpy ufuncs, I think providing the axis keyword in reducing ufuncs is very natural (for people familiar with numpy, and we can of course do a better job explaining this keyword in our docs for people less familiar).

Even, so why would we want to ravel in one case, and rollaxis in the other? What does this mean to a GIS programmer? If lib.coverage_simplify() can accept N-D arrays, can the caller not ravel or rollaxis or neither (perhaps?) as they need before calling the method?

The ravel or rollaxis call is an implementation detail that currently ensures the function behaves as expected for a numpy reduction: the default of axis=None means reducing over all values (producing a single value), specifying an axis means reducing along that axis (producing an array with the specified axis removed from the result' shape).

Can you the user do this themselves? Potentially yes, depending on what signature we would provide instead.
But would it be expected that you need to do this yourself? I would say no, for someone who expects an interface consistent with numpy (and I would personally also not call this just "convenience", using rollaxis is quite advanced, and my guess is the many numpy users will know the axis keyword, but will never have used rollaxis directly)

What does this mean to a GIS programmer?

To be fair, I don't think this specific capability is that useful for typical GIS applications. I think it is quite normal that a user would want to simplify different groups of geometries together, but it is probably not that likely that those different groups are then exactly the same size such that you actually can put them in a multidimensional numpy array and make use of reducing over an axis.
(hypothetical example: you have 5 groups of 10 polygons where you want to simplify per group; in that case you could put those geometries in a (5, 10) array, and call coverage_simplify with axis=1 to get the 5 simplified geometry collections (one per group). But so if those groups are not exactly the same size, you cannot make use of this feature and you will have to manually loop through the groups anyway)

But while this specific case doesn't seem that useful, there are several other aspects about the numpy ufunc capabilities that are very useful, such as the broadcasting behaviour (allowing you to do pairwise operations). And when we essentially provide standard numpy functions with those features, I would personally go for being consistent in following numpy's design entirely, and not only provide that subset of features that we think will be useful.


That said, we still need to make a choice about what kind of function we want to provide, and there are different options which all have some pros and cons (and not all of those options have an axis keyword when following numpy):

  • A plain simple element-wise ufunc (i.e. shape in = shape out): this is what Martin did in the first version of this PR. This means that it is up to the user to collect the polygons in geometry collections, and there would be no axis keyword.
  • A reduction-like function: this is the current state of the PR, where we collect the polygons in a collection under the hood. This returns a single scalar result by default (but can be changed with the axis keyword to reduce only over a specific axis)
  • A generalized ufunc with 1 dimension that gets reduced (this could actually be fully implemented in ufuncs.c in that case). This would be similar to polygonize that has a ufunc signature of (d)->(), which means that it reduces over the last dimension.
  • A final option could also be a function that does both the collection and explosion for the user, i.e a function that has "shape in = shape out" but is not elementwise (in the meaning of processing every element independently). It would simplify all geometries in the input array together, but preserve the shape of the input array (and essentially hide the fact that the core GEOS function is implemented in terms of GeometryCollections and not Polygons)

The first option is the closest to the pure GEOS function, the last option is closest to how PostGIS exposes this as ST_CoverageSimplify (and I think the most user friendly or "convenient").

Note that there is prior art in "hiding" the fact that certain GEOS functions work on a GeometryCollection (instead of on an array of geometries, but that might be because the collection is easier in the C API). The unary_union function that shapely has exposed for a long time, has always collected the input geometries in a collection first, instead of requiring the user to do that (and same for eg polygonize_full)

@martinfleis
Copy link
Contributor Author

I have just tested that doing the option four, meaning collect-simplify-explode, is possible as the order of geometries in the collection seems to be preserved throughout the simplification procedure. That is how I would like this to behave in GeoPandas but within shapely, it feels inconsistent as other reductions like polygonize return a collection of polygons, not an array of individual geometries. But truth is, that a user does not need to know the implementation details in GEOS and we can use the API that more mirrors normal simplify than polygonize or union, even though that is how it practically works.

So I'm happy to update the PR to follow the first option. However, with a N-D array (e.g. time x geom used in vector data cubes), we'd need to slice and loop over 1D arrays, compared to the current state of the PR.

@jorisvandenbossche
Copy link
Member

within shapely, it feels inconsistent as other reductions like polygonize return a collection of polygons, not an array of individual geometries.

Yes, but we could also argue that coverage_simplify is not a reduction. Just like simplify is also not a reduction. Taking option 4 would make simplify and coverage_simplify more similar in usage, with the only difference that simplify works purely element-by-element where coverage_simplify considers the full input as context to simplify each element.

(for polygonize there is also no 1-to-1 mapping between input and output, so that's a bit of a different situation and would neither fit in a reduction or window function if it returned the exploded array, so I think here making it a reduction makes more sense)

@martinfleis
Copy link
Contributor Author

Alright. I'll try to update the PR over the weekend.

@jorisvandenbossche
Copy link
Member

(maybe wait a bit to give Sean time to respond as well)

@jorisvandenbossche
Copy link
Member

@sgillies do you have time to take another look at the discussion above?

@sgillies
Copy link
Contributor
sgillies commented Sep 5, 2024

@jorisvandenbossche @martinfleis thank you for the discussion. I've finally got my head around the current implementation. It makes sense to me. I'm also attracted to the 4th option.

Do either of you know if GEOS coverage simplification always conserves polygons? If pieces of a coverage are ever simplified away to nothing, then we could not implement the fourth option, right?

@martinfleis
Copy link
Contributor Author
martinfleis commented Sep 6, 2024

The VW algorithm reduces a polygon to a triangle at most, so there's always a polygon as an output. I will verify that again but so far I believe that we can implement the fourth option.

@jorisvandenbossche
Copy link
Member

That's indeed what the doc comment in the C API guarantees (https://github.com/libgeos/geos/blob/2b5721bc96d64c3d768bd0e91dba6c3bbca354a9/capi/geos_c.h.in#L3960):

Geometries never disappear, but they may be simplified down to just
a triangle. Also, some invalid geoms (such as Polygons which have too
few non-repeated points) will be returned unchanged.

@sgillies
Copy link
Contributor
sgillies commented Sep 6, 2024

Let's do the fourth option, then, and agree that this isn't a reduction operation?

BTW, I'm happy with the 5 instances in set_operations.py.

@jorisvandenbossche jorisvandenbossche added this to the 2.1 milestone Sep 13, 2024
@martinfleis martinfleis marked this pull request as draft September 23, 2024 13:38
@martinfleis
Copy link
Contributor Author

I have changed the behaviour to follow the fourth model but I don't think it is quite right implementation-wise. Right now, it works in principle but does not work like a ufunc, so casting it on a GeoSeries does not preserve GeoSeries output etc... And I think that this logic should happen in C but I think I've hit the limits of my C and wasn't able to make it work.

So marking this as a draft but I'll probably need to pass this on to someone more capable in C. It got more complicated than anticipated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants