-
-
Notifications
You must be signed in to change notification settings - Fork 972
WIP - Cythonize geometry series operations #467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
geopandas/_geoseries.pyx
Outdated
elif op == 'covered_by': | ||
func = GEOSPreparedCoveredBy_r | ||
# elif op == 'equals': | ||
# func = GEOSEquals_r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes the compiler to complain, I suspect because it was not intended to work on prepared geometries.
Codecov Report
@@ Coverage Diff @@
## master #467 +/- ##
==========================================
- Coverage 88.97% 88.92% -0.06%
==========================================
Files 29 29
Lines 3002 2988 -14
==========================================
- Hits 2671 2657 -14
Misses 331 331
Continue to review full report at Codecov.
|
geopandas/_geoseries.pyx
Outdated
|
||
|
||
def _series_op(this, other, op, **kwargs): | ||
if kwargs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of lines 80, 81; would this if statement also need to check if the op
argument was set to equals
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not just that, but many operators that are not currently supported. Currently this is handled by the NotImplementedError
try-except block below. We would want to extend this approach to all relevant operations before considering merging though (if we want to consider that at all).
Also to be clear, I'd love for anyone to help on this. I'm mostly just cobbling together a combination of @eriknw 's cython setup and @jorisvandenbossche 's notebook. There are several other functions to parallelize that, I suspect, would be doable by copy-paste-and-modify coding techniques. |
A lookup is still being done - @jorisvandenbossche 's >x200 is certainly cheating! One way to get towards the bigger speed-up could indeed be a hash table if id(object)->geos int; this will help in the case that the object is referenced more than once. I tried with a python dict for the table, but that is slower than doing the lookup every time (because the system checks whether the key exists, etc.,), so would need to implement a pure-C hash-table instead.
(suspect may break on new pypy version of numpy). |
as suggested by martindurant
Neat, I've pushed the use of I think that there are some more critical things to consider on this PR before then though, notably how to extend this speedup to a greater fraction of operations, in particular |
I see - rewriting without shapely would definitely be possible, but take some work. What is the appropriate output when one or other of the comparisand elements is None? In cython, |
I've added the For reference, my benchmark is the following: import geopandas as gpd
import random, numpy as np
import shapely
import time
point = shapely.geometry.Point(random.random(), random.random())
triangles = np.array([shapely.geometry.Polygon([(random.random(),
random.random())
for _ in range(3)])
for _ in range(100000)], dtype=object)
gdf = gpd.GeoDataFrame({'geometry': triangles, 'x': 1})
start = time.time()
for i in range(10):
gdf.geometry.contains(point)
end = time.time()
print((end - start) / 10) The contains call takes around 700ms in master and 15ms in this branch. |
In order to build, I need the following:
... are you building with |
Yes, I'm just using |
I got an email saying I had permission to this PR, but I don't seem to be able to push. Here is a suggested patch:
The small changes above seem to improve timing by ~20% or so, but tests fail for me locally. They were failing with the version above mine that passed, so I am assuming this is my system only. |
Why the bytestring changes? |
The doc suggests encoding python strings when within a cdef - but the intent there may have been only for when passing the string on the C functions. I did a timing just now, and it seems that unicode string comp and bytes string comp may nearly the same time; so maybe it's cleaner not the bother. In any case, this only happens once per function call, whatever the number of elements.
(x and x2, y and y2 are equal but not the same object) |
Do you have latest benchmarks, and do you know if my suggestions helped for you? I like pulling out the |
My only benchmark is what's above. I haven't yet tried your changes, but will soon. Currently I'm trying to get tests to pass. |
Now supporting non-prepared geometry operations and releasing the GIL during GEOS function calls This branchIn [1]: import geopandas as gpd
...: import random, numpy as np
...: import shapely
...: import time
...:
...: point = shapely.geometry.Point(random.random(), random.random())
...:
...: triangles = np.array([shapely.geometry.Polygon([(random.random(),
...: random.random())
...: for _ in range(3)])
...: for _ in range(1000000)], dtype=object)
...:
...: gdf = gpd.GeoDataFrame({'geometry': triangles, 'x': 1})
...:
In [2]: %load_ext ptime
In [3]: %ptime -n 4 gdf.geometry.contains(point)
Total serial time: 0.48 s
Total parallel time: 0.25 s
For a 1.89X speedup across 4 threads Master
|
Oh, hrm, for some reason it looks like my previous comment didn't make it through. Pushed. There are still a couple of failures with contains and within. We seem to have different behavior when the polygons are exact. |
s = pd.Series(shapes, index=list('abcdefghij'), name='foo') | ||
g = GeoSeries(s) | ||
|
||
assert [a.equals(b) for a, b in zip(s, g)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this always True if the list is non-empty, even if the elements are False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think a np.all
around it is missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Resolved in a recent commit
|
@jorisvandenbossche your comment above: #467 (comment)
should work; this ultimately calls BlockManager.insert which should not be changing the tenor of the block itself (nor coercing). you need to make sure that Then you need to modify (and this is prob hacky at this point and NOT very pluggable):
|
For the moment at least it looks like we're not passing through cleanly. In [1]: import geopandas as gpd
In [2]: gdf = gpd.GeoDataFrame({'x': [1]})
In [3]: gdf
Out[3]:
x
0 1
In [4]: from shapely.geometry import Polygon
In [5]: gs = gpd.GeoSeries([Polygon([(0, 0), (0, 1), (1, 1)])])
In [6]: gdf['y'] = gs
I am densified (external_values, 1 elements)
In [7]: gdf
Out[7]:
x y
0 1 POLYGON ((0 0, 0 1, 1 1, 0 0))
In [8]: gdf._data
Out[8]:
BlockManager
Items: Index(['x', 'y'], dtype='object')
Axis 1: RangeIndex(start=0, stop=1, step=1)
IntBlock: slice(0, 1, 1), 1 x 1, dtype: int64
ObjectBlock: slice(1, 2, 1), 1 x 1, dtype: object
In [9]: gs._data._block
Out[9]: GeometryBlock: 1 dtype: object |
@jreback If you look at the current implementation of We could change this behaviour in pandas (either by preserving the block class if a block is passed instead of array-like, or by having some kind of registry of blocks so pandas can create the correct one). For this reason, we do this "unraveling the blockmanager, adapt blocks/axes, reconstruct blockmanager" logic to overcome that limitation (which works for the GeoDataFrame constructor and other code in geopandas, but not for eg concat) |
geopandas/vectorized.pyx
Outdated
@@ -921,14 +921,6 @@ class GeometryArray(object): | |||
self.data = geoms | |||
self.parent = None | |||
|
|||
@property | |||
def x(self): | |||
return get_coordinate_point(self.data, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is OK to keep this, we just need to deal with the case when it are not all Points. I recently merged a PR of @jdmcbr to add this to GeoSeries (not yet in this branch), see https://github.com/geopandas/geopandas/pull/383/files. There we raise a ValueError if not all geoms are points.
(which of course does not mean GeometryArray should have this, as we can also put this logic in GeoSeries, and use there the get_coordinate_point
)
geopandas/vectorized.pyx
Outdated
@@ -1075,6 +1099,23 @@ class GeometryArray(object): | |||
return buffer(self.data, distance, resolution, cap_style, join_style, | |||
mitre_limit) | |||
|
|||
def types(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current GeoSeries, this is called geom_type
geopandas/vectorized.pyx
Outdated
x = vec_type(self.data) | ||
|
||
types = GEOMETRY_TYPES[:] | ||
x[x == 255] = len(types) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 255 returned by GEOSGeomTypeId
to indicate missing? (not included in the types)
If so, I think you need to put this to -1 (this is used as missing value indicator in codes):
In [36]: pd.Categorical.from_codes([0,1,2], ['a', 'b'])
...
ValueError: codes need to be between -1 and len(categories)-1
In [37]: pd.Categorical.from_codes([0,1,-1], ['a', 'b'])
Out[37]:
[a, b, NaN]
Categories (2, object): [a, b]
I merged pandas-dev/pandas#17143, so the (non-truncated) repr should now work when using pandas master |
The old sjoin algorithm depends on |
FYI, I am just working on cleaning this up and making a separate branch for this work. |
OK. I'll hold off for a bit. Thank you for doing the organization here. |
This follows shapely behavior
This cythonizes a class of geometry operations within geoseries. It builds off of @jorisvandenbossche notebook in #430 by extending the set of operations and setting up a proper build environment.
Some things to note