8000 WIP - Cythonize geometry series operations by mrocklin · Pull Request #467 · geopandas/geopandas · GitHub
[go: up one dir, main page]

Skip to content

WIP - Cythonize geometry series operations #467

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 91 commits into from

Conversation

mrocklin
Copy link
Member
@mrocklin mrocklin commented Jul 17, 2017

This cythonizes a class of geometry operations within geoseries. It builds off of @jorisvandenbossche notebook in #430 by extending the set of operations and setting up a proper build environment.

Some things to note

  1. There is a Cython build setup provided by @eriknw
  2. Cython code is only used if available, falling back on the previous Python solution
  3. @wmay has another attempt at Vectorize all GEOS functions shapely/shapely#501
  4. I get around a 50-100x speedup on a simple comparison
  5. There are some inconsistencies between this and what shapely does that I haven't yet tracked won
  6. There are still plenty of operations to do. This just expands on @jorisvandenbossche 's work

elif op == 'covered_by':
func = GEOSPreparedCoveredBy_r
# elif op == 'equals':
# func = GEOSEquals_r
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes the compiler to complain, I suspect because it was not intended to work on prepared geometries.

@codecov
Copy link
codecov bot commented Jul 17, 2017

Codecov Report

Merging #467 into master will decrease coverage by 0.05%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #467      +/-   ##
==========================================
- Coverage   88.97%   88.92%   -0.06%     
==========================================
  Files          29       29              
  Lines        3002     2988      -14     
==========================================
- Hits         2671     2657      -14     
  Misses        331      331
Impacted Files Coverage Δ
geopandas/base.py 90.9% <100%> (+0.22%) ⬆️
geopandas/geoseries.py 90.54% <100%> (-0.37%) ⬇️
geopandas/tests/test_geom_methods.py 98.55% <100%> (-0.07%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cfb3120...e605935. Read the comment docs.



def _series_op(this, other, op, **kwargs):
if kwargs:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of lines 80, 81; would this if statement also need to check if the op argument was set to equals?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just that, but many operators that are not currently supported. Currently this is handled by the NotImplementedError try-except block below. We would want to extend this approach to all relevant operations before considering merging though (if we want to consider that at all).

@mrocklin
Copy link
Member Author

Also to be clear, I'd love for anyone to help on this. I'm mostly just cobbling together a combination of @eriknw 's cython setup and @jorisvandenbossche 's notebook. There are several other functions to parallelize that, I suspect, would be doable by copy-paste-and-modify coding techniques.

@martindurant
Copy link

__geom__ is a ctypes pointer to the same value as _geom - using it within a cdef is much faster:

%%cython
cdef cycle(p):
    cdef int i
    cdef long j
    for i in range(10000):
        j = p._geom

def run(p):
    cycle(p)

%timeit run(p)
1.45 ms ± 54.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%cython
cdef cycle2(p):
    cdef int i
    cdef long j
    for i in range(10000):
        j = p.__geom__

def run2(p):
    cycle2(p)

%timeit run2(p)

254 µs ± 42.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

A lookup is still being done - @jorisvandenbossche 's >x200 is certainly cheating!

One way to get towards the bigger speed-up could indeed be a hash table if id(object)->geos int; this will help in the case that the object is referenced more than once. I tried with a python dict for the table, but that is slower than doing the lookup every time (because the system checks whether the key exists, etc.,), so would need to implement a pure-C hash-table instead.
For reference, python id() is fast, but the following is the fastest way to get the id from an object array

%timeit np.frombuffer(a.tostring(), dtype='uint64')

4.37 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

(suspect may break on new pypy version of numpy).

as suggested by martindurant
@mrocklin
Copy link
Member Author

Neat, I've pushed the use of __geom__. I think that @jorisvandenbossche 's 200x difference would be possible if we didn't create shapely objects, but rather just kept around the pointers to GEOS C objects. This would require a more specialized geoseries object, but seems doable to me.

I think that there are some more critical things to consider on this PR before then though, notably how to extend this speedup to a greater fraction of operations, in particular GEOSEquals calls, and how to address Nones within the series (currently tests fail).

@martindurant
Copy link

I see - rewriting without shapely would definitely be possible, but take some work.

What is the appropriate output when one or other of the comparisand elements is None? In cython, is None should be fast and not change the execution time much.

@mrocklin
Copy link
Member Author

I've added the None check. As you suggest this does not hugely impact the time.

For reference, my benchmark is the following:

import geopandas as gpd
import random, numpy as np
import shapely
import time

point = shapely.geometry.Point(random.random(), random.random())

triangles = np.array([shapely.geometry.Polygon([(random.random(),
                                                 random.random())
                                                for _ in range(3)])
                      for _ in range(100000)], dtype=object)
gdf = gpd.GeoDataFrame({'geometry': triangles, 'x': 1})


start = time.time()
for i in range(10):
    gdf.geometry.contains(point)
end = time.time()
print((end - start) / 10)

The contains call takes around 700ms in master and 15ms in this branch.

@martindurant
Copy link

In order to build, I need the following:

--- a/setup.py
+++ b/setup.py
@@ -29,7 +29,7 @@ from distutils.core import Extension
 from distutils.command.build_ext import build_ext
 from distutils.errors import (CCompilerError, DistutilsExecError,
                               DistutilsPlatformError)
-
+import numpy
@@ -72,7 +72,8 @@ suffix = '.pyx' if use_cython else '.c'
 ext_modules = []
 for modname in ['_geoseries']:
     ext_modules.append(Extension('geopandas.' + modname,
-                                 ['geopandas/' + modname + suffix]))
+                                 ['geopandas/' + modname + suffix],
+                                 include_dirs=[numpy.get_include()]))

... are you building with setup.py build_ext --with-cython without these lines?

@mrocklin
Copy link
Member Author

Yes, I'm just using make inplace as defined in the Makefile.

@martindurant
Copy link

I got an email saying I had permission to this PR, but I don't seem to be able to push. Here is a suggested patch:

diff --git a/geopandas/_geoseries.pxd b/geopandas/_geoseries.pxd
index 09cab52..fba297b 100644
--- a/geopandas/_geoseries.pxd
+++ b/geopandas/_geoseries.pxd
@@ -1,3 +1,4 @@
-cdef _cy_series_op_fast(this, other, op)
+cimport numpy as np
+cdef _cy_series_op_fast(np.ndarray[object, ndim=1] array, object geometry, str op)
 cdef _cy_series_op_slow(this, other, op, kwargs)

diff --git a/geopandas/_geoseries.pyx b/geopandas/_geoseries.pyx
index 9ba4922..63228e1 100644
--- a/geopandas/_geoseries.pyx
+++ b/geopandas/_geoseries.pyx
@@ -49,9 +49,10 @@ cdef _cy_series_op_slow(this, other, op, kwargs):

 @cython.boundscheck(False)
 @cython.wraparound(False)
-cdef _cy_series_op_fast(array, geometry, op):
+cdef _cy_series_op_fast(np.ndarray[object, ndim=1] array, object geometry, str op):

     cdef Py_ssize_t idx
+    cdef bytes op2 = op.encode()
     cdef unsigned int n = array.size
     cdef np.ndarray[np.uint8_t, ndim=1, cast=True] result = np.empty(n, dtype=np.uint8)

@@ -60,23 +61,23 @@ cdef _cy_series_op_fast(array, geometry, op):
     cdef GEOSGeometry *geom2
     cdef uintptr_t geos_geom

-    if op == 'contains':
+    if op2 == b'contains':
         func = GEOSPreparedContains_r
-    elif op == 'intersects':
+    elif op2 == b'intersects':
         func = GEOSPreparedIntersects_r
-    elif op == 'touches':
+    elif op2 == b'touches':
         func = GEOSPreparedTouches_r
-    elif op == 'crosses':
+    elif op2 == b'crosses':
         func = GEOSPreparedCrosses_r
-    elif op == 'within':
+    elif op2 == b'within':
         func = GEOSPreparedWithin_r
-    elif op == 'contains_properly':
+    elif op2 == b'contains_properly':
         func = GEOSPreparedContainsProperly_r
-    elif op == 'overlaps':
+    elif op2 == b'overlaps':
         func = GEOSPreparedOverlaps_r
-    elif op == 'covers':
+    elif op2 == b'covers':
         func = GEOSPreparedCovers_r
-    elif op == 'covered_by':
+    elif op2 == b'covered_by':
         func = GEOSPreparedCoveredBy_r
     # elif op == 'equals':
     #     func = GEOSEquals_r
diff --git a/setup.py b/setup.py
index 101ef63..c7f52c0 100644
--- a/setup.py
+++ b/setup.py
@@ -29,7 +29,7 @@ from distutils.core import Extension
 from distutils.command.build_ext import build_ext
 from distutils.errors import (CCompilerError, DistutilsExecError,
                               DistutilsPlatformError)
-
+import numpy
 import versioneer

 LONG_DESCRIPTION = """GeoPandas is a project to add support for geographic data to
@@ -72,7 +72,8 @@ suffix = '.pyx' if use_cython else '.c'
 ext_modules = []
 for modname in ['_geoseries']:
     ext_modules.append(Extension('geopandas.' + modname,
-                                 ['geopandas/' + modname + suffix]))
+                                 ['geopandas/' + modname + suffix],
+                                 include_dirs=[numpy.get_include()]))
 if use_cython:
     # Set global Cython options
     # http://docs.cython.org/en/latest/src/reference/compilation.html#compiler-directives

The small changes above seem to improve timing by ~20% or so, but tests fail for me locally. They were failing with the version above mine that passed, so I am assuming this is my system only.

@mrocklin
Copy link
Member Author

Why the bytestring changes?

@martindurant
Copy link
martindurant commented Jul 18, 2017

Why the bytestring changes?

The doc suggests encoding python strings when within a cdef - but the intent there may have been only for when passing the string on the C functions.

I did a timing just now, and it seems that unicode string comp and bytes string comp may nearly the same time; so maybe it's cleaner not the bother. In any case, this only happens once per function call, whatever the number of elements.

In [28]: %%cython
    ...: def comp(str x, str y):
    ...:     cdef int i
    ...:     for i in range(10000):
    ...:         x == y
    ...:

In [29]: %timeit comp(x, x2)
7.06 µs ± 129 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [31]: %%cython
    ...: def comp(bytes x, bytes y):
    ...:     cdef int i
    ...:     for i in range(10000):
    ...:         x == y
    ...:

In [32]: %timeit comp(y, y2)
5.58 µs ± 286 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

(x and x2, y and y2 are equal but not the same object)

@martindurant
Copy link

Do you have latest benchmarks, and do you know if my suggestions helped for you?

I like pulling out the __geom__ extraction from the loop. Since it's the same for both code-paths, it could stand separately, with a view to using raw geos ids later. The same is true for creating the result arrays - it would be reasonable to (optionally) provide an array to assign into.
Also, I would check that we can handle the op outside the function, to minimize what gets hidden inside cdefs. I wouldn't expect any of these to have an impact on benchmarks.

@mrocklin
Copy link
Member Author

My only benchmark is what's above. I haven't yet tried your changes, but will soon. Currently I'm trying to get tests to pass.

@mrocklin
Copy link
Member Author

Now supporting non-prepared geometry operations and releasing the GIL during GEOS function calls

This branch

In [1]: import geopandas as gpd
   ...: import random, numpy as np
   ...: import shapely
   ...: import time
   ...: 
   ...: point = shapely.geometry.Point(random.random(), random.random())
   ...: 
   ...: triangles = np.array([shapely.geometry.Polygon([(random.random(),
   ...:                                                  random.random())
   ...:                                                 for _ in range(3)])
   ...:                       for _ in range(1000000)], dtype=object)
   ...:                       
   ...: gdf = gpd.GeoDataFrame({'geometry': triangles, 'x': 1})
   ...: 

In [2]: %load_ext ptime

In [3]: %ptime -n 4 gdf.geometry.contains(point)
Total serial time:   0.48 s
Total parallel time: 0.25 s
For a 1.89X speedup across 4 threads

Master

In [3]: %ptime -n 4 gdf.geometry.contains(point)
Total serial time:   37.13 s
Total parallel time: 46.74 s
For a 0.79X speedup across 4 threads

@mrocklin
Copy link
Member Author

Oh, hrm, for some reason it looks like my previous comment didn't make it through. Pushed.

There are still a couple of failures with contains and within. We seem to have different behavior when the polygons are exact.

s = pd.Series(shapes, index=list('abcdefghij'), name='foo')
g = GeoSeries(s)

assert [a.equals(b) for a, b in zip(s, g)]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this always True if the list is non-empty, even if the elements are False?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think a np.all around it is missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Resolved in a recent commit

@mrocklin
Copy link
Member Author
mrocklin commented Aug 3, 2017
==== 46 failed, 223 passed, 6 skipped, 3 xfailed, 12 warnings in 92.38 seconds ==== 

@jreback
Copy link
jreback commented Aug 3, 2017

@jorisvandenbossche your comment above: #467 (comment)

df = ....
df['foo'] = GeoSeries(...)

should work; this ultimately calls BlockManager.insert which should not be changing the tenor of the block itself (nor coercing).

you need to make sure that DataFrame._sanitize_column passes it thru.

Then you need to modify (and this is prob hacky at this point and NOT very pluggable):

  • make_block
  • is_extension_type

@mrocklin
Copy link
Member Author
mrocklin commented Aug 4, 2017

For the moment at least it looks like we're not passing through cleanly.

In [1]: import geopandas as gpd

In [2]: gdf = gpd.GeoDataFrame({'x': [1]})

In [3]: gdf
Out[3]: 
   x
0  1

In [4]: from shapely.geometry import Polygon

In [5]: gs = gpd.GeoSeries([Polygon([(0, 0), (0, 1), (1, 1)])])

In [6]: gdf['y'] = gs
I am densified (external_values, 1 elements)

In [7]: gdf
Out[7]: 
   x                               y
0  1  POLYGON ((0 0, 0 1, 1 1, 0 0))

In [8]: gdf._data
Out[8]: 
BlockManager
Items: Index(['x', 'y'], dtype='object')
Axis 1: RangeIndex(start=0, stop=1, step=1)
IntBlock: slice(0, 1, 1), 1 x 1, dtype: int64
ObjectBlock: slice(1, 2, 1), 1 x 1, dtype: object

In [9]: gs._data._block
Out[9]: GeometryBlock: 1 dtype: object

@jdmcbr jdmcbr mentioned this pull request Aug 4, 2017
@jorisvandenbossche
Copy link
Member

@jreback If you look at the current implementation of BlockManager.insert, it is written to accept an array-like, not a block. The passed values are converted to a block by pandas, and thus it will not preserve a block type pandas is not aware of (https://github.com/pandas-dev/pandas/blob/929c66fd74da221078a67ea7fd3dbcbe21d642e0/pandas/core/internals.py#L3895-L3920)

We could change this behaviour in pandas (either by preserving the block class if a block is passed instead of array-like, or by having some kind of registry of blocks so pandas can create the correct one).
Also for example concat does not preserve block types, which is something I will have to try to fix in pandas to get it working for geopandas.

For this reason, we do this "unraveling the blockmanager, adapt blocks/axes, reconstruct blockmanager" logic to overcome that limitation (which works for the GeoDataFrame constructor and other code in geopandas, but not for eg concat)

@@ -921,14 +921,6 @@ class GeometryArray(object):
self.data = geoms
self.parent = None

@property
def x(self):
return get_coordinate_point(self.data, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is OK to keep this, we just need to deal with the case when it are not all Points. I recently merged a PR of @jdmcbr to add this to GeoSeries (not yet in this branch), see https://github.com/geopandas/geopandas/pull/383/files. There we raise a ValueError if not all geoms are points.

(which of course does not mean GeometryArray should have this, as we can also put this logic in GeoSeries, and use there the get_coordinate_point)

@@ -1075,6 +1099,23 @@ class GeometryArray(object):
return buffer(self.data, distance, resolution, cap_style, join_style,
mitre_limit)

def types(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current GeoSeries, this is called geom_type

x = vec_type(self.data)

types = GEOMETRY_TYPES[:]
x[x == 255] = len(types)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 255 returned by GEOSGeomTypeId to indicate missing? (not included in the types)

If so, I think you need to put this to -1 (this is used as missing value indicator in codes):

In [36]: pd.Categorical.from_codes([0,1,2], ['a', 'b'])
...
ValueError: codes need to be between -1 and len(categories)-1

In [37]: pd.Categorical.from_codes([0,1,-1], ['a', 'b'])
Out[37]: 
[a, b, NaN]
Categories (2, object): [a, b]

@jorisvandenbossche
Copy link
Member

I merged pandas-dev/pandas#17143, so the (non-truncated) repr should now work when using pandas master

@mrocklin
Copy link
Member Author
mrocklin commented Aug 4, 2017

The old sjoin algorithm depends on pd.merge working well, which it currently doesn't. For now I've decided to keep the new sjoin algorithm in this PR under a new_sjoin function. I've expanded this to support the how= keyword. Doing this while using only supported pandas operations was an interesting challenge, but it seems to work out.

@jorisvandenbossche
Copy link
Member

FYI, I am just working on cleaning this up and making a separate branch for this work.

@mrocklin
Copy link
Member Author
mrocklin commented Aug 4, 2017

OK. I'll hold off for a bit. Thank you for doing the organization here.

This follows shapely behavior
@jorisvandenbossche
Copy link
Member
jorisvandenbossche commented Aug 5, 2017

Closing, this is merged in #472, and follow-up issue in #473. For the sjoin functionality, a new PR can be opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0