ENH: add coverage_simplify exposing topological simplification of coverages #1969

martinfleis · 2024-01-22T10:41:53Z

GEOS supports this simplification for collections of polygons only. Here, I have currently opted for returning other geometries untouched but the correct behaviour is up to a discussion. When checking for correct geom types, nesting og GeometryCollections is not supported and a GeometryCollection of a GeometryCollections will not be simplified no matter the contents of the inner collection.

If we did not do this check, when points or line strings are passed, shapely returns (empty) GEOS Exception. When passing GeometryCollection with points or line strings, it kills the kernel, so I suppose that the checks I have included now are quite a feasible way forward.

I am mirroring the API of PostGIS, not that of GEOS. So the keyword is simplify_boundary rather than preserve_boundary which seems more intuitive.

coveralls · 2024-01-22T10:58:15Z

Pull Request Test Coverage Report for Build 10995499008

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

8 of 8 (100.0%) changed or added relevant lines in 1 file are covered.
59 unchanged lines in 11 files lost coverage.
Overall coverage decreased (-0.2%) to 87.638%

Files with Coverage Reduction	New Missed Lines	%
shapely/geometry/multipoint.py	1	96.77%
shapely/geometry/polygon.py	1	96.77%
shapely/geometry/multilinestring.py	1	96.88%
shapely/geometry/point.py	2	95.08%
shapely/_ragged_array.py	3	97.84%
shapely/io.py	3	79.1%
shapely/set_operations.py	3	96.7%
shapely/_geometry.py	4	96.33%
shapely/geos.py	6	0.0%
shapely/constructive.py	15	90.38%

Totals
Change from base Build 10631218520:	-0.2%
Covered Lines:	2616
Relevant Lines:	2985

💛 - Coveralls

brendan-ward

Thanks for working on this @martinfleis !

It would be good to report test cases that crash the kernel to GEOS; those should be handled upstream.

I assume you are going to expand the test cases?

shapely/constructive.py

src/ufuncs.c

martinfleis · 2024-01-22T18:06:23Z

I'll fish those cases out and report to GEOS, good point.

I actually wrote more tests but apparently did not push the commit. Will do once I'll get to laptop.

Co-authored-by: Brendan Ward <bcward@astutespruce.com>

theroggy · 2024-01-24T10:55:06Z

I suppose that the input coverage needs to be entirely topologically sound? Eg. that if two polygons touch not in a common point but a point touches a vector of another polygon between two points, this will not be treated as a common boundary?

If this is the case, I think it will be useful to explicitly mention this in the doc to avoid misunderstandings...

jorisvandenbossche · 2024-01-26T14:39:50Z

We also need to decide how to treat the input geometry argument. AFAIK in the current state of the PR, you just pass each input geometry to the GEOS function, and so this assumes that the user already collected their set of polygons into a collection.

Alternatively, we can also do that for the user. Similarly as ST_CoverageSimplify is a window function (thus taking a set of geometries to simplify together), and similarly as for union_all we also collect the input geometries in a collection first before passing to GEOSUnaryUnion.

martinfleis · 2024-01-26T18:53:17Z

That's a good point. I initially wanted to do that on the GeoPandas side (collect, simplify, explode) but it may make sense to do the first two steps (collect and simplify) already in shapely and return the collection. I think this covers majority of use cases and if you want to simplify per geometry, you can always loop.

I'll leave this decision to you as you have a better feeling for what the optimal behaviour of shapely is.

(Another example of this is polygonize.)

martinfleis · 2024-03-15T15:18:51Z

I have changed the behaviour to follow the logic of union_all or ST_CoverageSimplify to make the function operate on the whole array and return a collection. The axis keyword is not yet tested but please have a look if this makes more sense API-wise.

sgillies · 2024-08-30T02:13:48Z

shapely/constructive.py

+    returned collected but unchanged.
+
+    If the geometry is polygonal but does not form a valid coverage due to overlaps,
+    it will be simplified but it may result in invalid topology.


@martinfleis Lovely docstring, thank you!

sgillies · 2024-08-30T02:17:57Z

shapely/constructive.py

+        Axis along which the operation is performed. The default (None)
+        performs the operation over all axes, returning a scalar value.
+        Axis may be negative, in which case it counts from the last to the
+        first axis.


axis is new to me. It comes from Pandas, right? I think boundaries between Pandas/GeoPandas and Shapely are a good thing and that we should resist blurring the projects if we can. What does axis really mean to Shapely, given that inputs can only be a scalar geometry or a 1-D array of geometries? Can't we infer axis from the type of input?

That aside, the doc for axis may be technically correct but isn't useful to a developer who isn't super fluent with numpy or Pandas. Can we make it more approachable?

This comes directly from shapely.union_all and other predicate _all functions which have the same notion of using axis to define what should be treated as a single group. So it is not new.

What does axis really mean to Shapely, given that inputs can only be a scalar geometry or a 1-D array of geometries?

That is not correct as far as I'm aware. While some operations will not work on a N-D array (STRtree notably), most of shapely operations are perfectly capable of handling N-D arrays of geometries and there's a clear use case for that in the vector data cube space for dealing with time-dependent geometries.

If you think this should change, we should do that across all functions that use it, so I'd keep it for a separate discussion and PR.

Yeah, since the low-level operations were written like ufuncs, I shouldn't be surprised that they would work on N-D arrays.

Even, so why would we want to ravel in one case, and rollaxis in the other? What does this mean to a GIS programmer? If lib.coverage_simplify() can accept N-D arrays, can the caller not ravel or rollaxis or neither (perhaps?) as they need before calling the method?

In matters of software design, I tend to be anti-convenience. Instead of accepting an axis argument and then modifying input based on the value, callers should just modify their inputs in their own applications. Then we have less code, fewer special cases and branches to test. I would make an exception for situations where GEOS C API methods must be called, because how is a user going to do that? But here the user already has numpy installed and can easily reshape their inputs.

sgillies

I need some clarification on what axis really stands for in this method and I'd like a discussion about whether this is a real feature or just a convenience.

sgillies · 2024-08-30T15:53:00Z

shapely/constructive.py

+        Axis along which the operation is performed. The default (None)
+        performs the operation over all axes, returning a scalar value.
+        Axis may be negative, in which case it counts from the last to the
+        first axis.


Yeah, since the low-level operations were written like ufuncs, I shouldn't be surprised that they would work on N-D arrays.

Even, so why would we want to ravel in one case, and rollaxis in the other? What does this mean to a GIS programmer? If lib.coverage_simplify() can accept N-D arrays, can the caller not ravel or rollaxis or neither (perhaps?) as they need before calling the method?

In matters of software design, I tend to be anti-convenience. Instead of accepting an axis argument and then modifying input based on the value, callers should just modify their inputs in their own applications. Then we have less code, fewer special cases and branches to test. I would make an exception for situations where GEOS C API methods must be called, because how is a user going to do that? But here the user already has numpy installed and can easily reshape their inputs.

martinfleis · 2024-08-30T16:21:52Z

I mean, I am not attached to the axis keyword. I even said above

The axis keyword is not yet tested but please have a look if this makes more sense API-wise.

The reason I included it in this way is that we need to collect individual geometries in some way before passing them to GEOS, otherwise we expect user to do that. The latter was originally implemented here but as pointed out by @jorisvandenbossche above, it may not be what user expects from this function.

When collecting geometries, shapely has a precedent of using the axis keyword. I simply followed that for the sake of consistency. Coverage simplification then works as aggregation, since it returns a GeometryCollection, following practically the same pattern as union_all and other functions.

I fully understand your concern but I am not sure if this function is different enough to deviate from this pattern. But in the end, that is not my call to make.

I can remove the axis and follow the route currently implemented for axis=None, I don't mind that, if you think that it does not belong here. But then you should maybe have a discussion if it belongs elsewhere as well.

jorisvandenbossche · 2024-08-31T10:47:06Z

axis is new to me. It comes from Pandas, right? I think boundaries between Pandas/GeoPandas and Shapely are a good thing and that we should resist blurring the projects if we can.

To be clear, the axis keyword does not come from pandas (I don't think we actually even use it in geopandas in any of the shapely methods we call), but it comes from numpy.
See https://numpy.org/doc/stable/glossary.html#term-axis and https://numpy.org/doc/stable/glossary.html#term-along-an-axis for operations that can be performed along an axis. Such operations have an axis keyword, and so for example all reduction functions or generalized ufuncs in numpy have that keyword.

Given that the core GEOS functions we now wrap are implemented as numpy ufuncs, I think providing the axis keyword in reducing ufuncs is very natural (for people familiar with numpy, and we can of course do a better job explaining this keyword in our docs for people less familiar).

Even, so why would we want to ravel in one case, and rollaxis in the other? What does this mean to a GIS programmer? If lib.coverage_simplify() can accept N-D arrays, can the caller not ravel or rollaxis or neither (perhaps?) as they need before calling the method?

The ravel or rollaxis call is an implementation detail that currently ensures the function behaves as expected for a numpy reduction: the default of axis=None means reducing over all values (producing a single value), specifying an axis means reducing along that axis (producing an array with the specified axis removed from the result' shape).

Can you the user do this themselves? Potentially yes, depending on what signature we would provide instead.
But would it be expected that you need to do this yourself? I would say no, for someone who expects an interface consistent with numpy (and I would personally also not call this just "convenience", using rollaxis is quite advanced, and my guess is the many numpy users will know the axis keyword, but will never have used rollaxis directly)

What does this mean to a GIS programmer?

To be fair, I don't think this specific capability is that useful for typical GIS applications. I think it is quite normal that a user would want to simplify different groups of geometries together, but it is probably not that likely that those different groups are then exactly the same size such that you actually can put them in a multidimensional numpy array and make use of reducing over an axis.
(hypothetical example: you have 5 groups of 10 polygons where you want to simplify per group; in that case you could put those geometries in a (5, 10) array, and call coverage_simplify with axis=1 to get the 5 simplified geometry collections (one per group). But so if those groups are not exactly the same size, you cannot make use of this feature and you will have to manually loop through the groups anyway)

But while this specific case doesn't seem that useful, there are several other aspects about the numpy ufunc capabilities that are very useful, such as the broadcasting behaviour (allowing you to do pairwise operations). And when we essentially provide standard numpy functions with those features, I would personally go for being consistent in following numpy's design entirely, and not only provide that subset of features that we think will be useful.

That said, we still need to make a choice about what kind of function we want to provide, and there are different options which all have some pros and cons (and not all of those options have an axis keyword when following numpy):

A plain simple element-wise ufunc (i.e. shape in = shape out): this is what Martin did in the first version of this PR. This means that it is up to the user to collect the polygons in geometry collections, and there would be no axis keyword.
A reduction-like function: this is the current state of the PR, where we collect the polygons in a collection under the hood. This returns a single scalar result by default (but can be changed with the axis keyword to reduce only over a specific axis)
A generalized ufunc with 1 dimension that gets reduced (this could actually be fully implemented in ufuncs.c in that case). This would be similar to polygonize that has a ufunc signature of (d)->(), which means that it reduces over the last dimension.
A final option could also be a function that does both the collection and explosion for the user, i.e a function that has "shape in = shape out" but is not elementwise (in the meaning of processing every element independently). It would simplify all geometries in the input array together, but preserve the shape of the input array (and essentially hide the fact that the core GEOS function is implemented in terms of GeometryCollections and not Polygons)

The first option is the closest to the pure GEOS function, the last option is closest to how PostGIS exposes this as ST_CoverageSimplify (and I think the most user friendly or "convenient").

Note that there is prior art in "hiding" the fact that certain GEOS functions work on a GeometryCollection (instead of on an array of geometries, but that might be because the collection is easier in the C API). The unary_union function that shapely has exposed for a long time, has always collected the input geometries in a collection first, instead of requiring the user to do that (and same for eg polygonize_full)

martinfleis · 2024-08-31T11:18:37Z

I have just tested that doing the option four, meaning collect-simplify-explode, is possible as the order of geometries in the collection seems to be preserved throughout the simplification procedure. That is how I would like this to behave in GeoPandas but within shapely, it feels inconsistent as other reductions like polygonize return a collection of polygons, not an array of individual geometries. But truth is, that a user does not need to know the implementation details in GEOS and we can use the API that more mirrors normal simplify than polygonize or union, even though that is how it practically works.

So I'm happy to update the PR to follow the first option. However, with a N-D array (e.g. time x geom used in vector data cubes), we'd need to slice and loop over 1D arrays, compared to the current state of the PR.

jorisvandenbossche · 2024-08-31T11:54:19Z

within shapely, it feels inconsistent as other reductions like polygonize return a collection of polygons, not an array of individual geometries.

Yes, but we could also argue that coverage_simplify is not a reduction. Just like simplify is also not a reduction. Taking option 4 would make simplify and coverage_simplify more similar in usage, with the only difference that simplify works purely element-by-element where coverage_simplify considers the full input as context to simplify each element.

(for polygonize there is also no 1-to-1 mapping between input and output, so that's a bit of a different situation and would neither fit in a reduction or window function if it returned the exploded array, so I think here making it a reduction makes more sense)

martinfleis · 2024-08-31T12:04:28Z

Alright. I'll try to update the PR over the weekend.

jorisvandenbossche · 2024-08-31T12:34:12Z

(maybe wait a bit to give Sean time to respond as well)

jorisvandenbossche · 2024-09-05T20:11:55Z

@sgillies do you have time to take another look at the discussion above?

sgillies · 2024-09-05T23:25:28Z

@jorisvandenbossche @martinfleis thank you for the discussion. I've finally got my head around the current implementation. It makes sense to me. I'm also attracted to the 4th option.

Do either of you know if GEOS coverage simplification always conserves polygons? If pieces of a coverage are ever simplified away to nothing, then we could not implement the fourth option, right?

martinfleis · 2024-09-06T04:44:41Z

The VW algorithm reduces a polygon to a triangle at most, so there's always a polygon as an output. I will verify that again but so far I believe that we can implement the fourth option.

jorisvandenbossche · 2024-09-06T06:49:06Z

That's indeed what the doc comment in the C API guarantees (https://github.com/libgeos/geos/blob/2b5721bc96d64c3d768bd0e91dba6c3bbca354a9/capi/geos_c.h.in#L3960):

Geometries never disappear, but they may be simplified down to just
a triangle. Also, some invalid geoms (such as Polygons which have too
few non-repeated points) will be returned unchanged.

sgillies · 2024-09-06T22:21:22Z

Let's do the fourth option, then, and agree that this isn't a reduction operation?

BTW, I'm happy with the 5 instances in set_operations.py.

martinfleis · 2024-09-23T13:42:34Z

I have changed the behaviour to follow the fourth model but I don't think it is quite right implementation-wise. Right now, it works in principle but does not work like a ufunc, so casting it on a GeoSeries does not preserve GeoSeries output etc... And I think that this logic should happen in C but I think I've hit the limits of my C and wasn't able to make it work.

So marking this as a draft but I'll probably need to pass this on to someone more capable in C. It got more complicated than anticipated.

martinfleis added 7 commits January 21, 2024 21:41

basic implementation

e80209b

sanitize geometry types

1557704

add to __all__

15ddc56

geos version checks

05dc7d8

basic test

952b5fb

docstring

d672248

define within if clause

5eb3690

martinfleis mentioned this pull request Jan 22, 2024

ENH: Topological simplification geopandas/geopandas#1387

Open

brendan-ward reviewed Jan 22, 2024

View reviewed changes

shapely/constructive.py Outdated Show resolved Hide resolved

shapely/constructive.py Outdated Show resolved Hide resolved

shapely/constructive.py Outdated Show resolved Hide resolved

src/ufuncs.c Outdated Show resolved Hide resolved

martinfleis and others added 4 commits January 22, 2024 22:18

Apply suggestions from code review

fdda364

Co-authored-by: Brendan Ward <bcward@astutespruce.com>

tests

7560895

switch the order of the condtion

5417730

skip all

8de9d6e

martinfleis mentioned this pull request Jan 23, 2024

Unsupported geometry types in coverageSimplify result in segfaults libgeos/geos#1031

Closed

clarify the coverage rules

228feb5

jorisvandenbossche mentioned this pull request Jan 26, 2024

Functions to implement from GEOS #1570

Open

37 tasks

martinfleis added 2 commits March 15, 2024 15:19

Merge remote-tracking branch 'upstream/main' into vw

f6a5de2

collect all geoms prior simplification

c8edee4

martinfleis added 4 commits July 18, 2024 16:38

Merge remote-tracking branch 'upstream/main' into vw

beb89cc

Merge remote-tracking branch 'upstream/main' into vw

813dd8a

lint

8f733de

numpy fix

1164f79

sgillies reviewed Aug 30, 2024

View reviewed changes

martinfleis added 2 commits August 30, 2024 10:43

Merge remote-tracking branch 'upstream/main' into vw

2e63e62

Merge remote-tracking branch 'upstream/main' into vw

8260baa

sgillies requested changes Aug 30, 2024

View reviewed changes

jorisvandenbossche added this to the 2.1 milestone Sep 13, 2024

This was referenced Sep 13, 2024

2.1.0 release #2144

Open

ENH: add coverage validation function #2156

Draft

poc of api

fecad53

martinfleis marked this pull request as draft September 23, 2024 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add coverage_simplify exposing topological simplification of coverages #1969

ENH: add coverage_simplify exposing topological simplification of coverages #1969

ENH: add coverage_simplify exposing topological simplification of coverages #1969

Are you sure you want to change the base?

ENH: add coverage_simplify exposing topological simplification of coverages #1969

Conversation

Pull Request Test Coverage Report for Build 10995499008

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment