WIP - Cythonize geometry series operations #467

mrocklin · 2017-07-17T14:48:59Z

This cythonizes a class of geometry operations within geoseries. It builds off of @jorisvandenbossche notebook in #430 by extending the set of operations and setting up a proper build environment.

Some things to note

There is a Cython build setup provided by @eriknw
Cython code is only used if available, falling back on the previous Python solution
@wmay has another attempt at Vectorize all GEOS functions shapely/shapely#501
I get around a 50-100x speedup on a simple comparison
There are some inconsistencies between this and what shapely does that I haven't yet tracked won
There are still plenty of operations to do. This just expands on @jorisvandenbossche 's work

mrocklin · 2017-07-17T15:00:37Z

geopandas/_geoseries.pyx

+    elif op == 'covered_by':
+        func = GEOSPreparedCoveredBy_r
+    # elif op == 'equals':
+    #     func = GEOSEquals_r


This causes the compiler to complain, I suspect because it was not intended to work on prepared geometries.

codecov · 2017-07-17T15:27:44Z

Codecov Report

Merging #467 into master will decrease coverage by 0.05%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #467      +/-   ##
==========================================
- Coverage   88.97%   88.92%   -0.06%     
==========================================
  Files          29       29              
  Lines        3002     2988      -14     
==========================================
- Hits         2671     2657      -14     
  Misses        331      331

Impacted Files	Coverage Δ
geopandas/base.py	`90.9% <100%> (+0.22%)`	⬆️
geopandas/geoseries.py	`90.54% <100%> (-0.37%)`	⬇️
geopandas/tests/test_geom_methods.py	`98.55% <100%> (-0.07%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cfb3120...e605935. Read the comment docs.

kuanb · 2017-07-17T17:33:28Z

geopandas/_geoseries.pyx

+
+
+def _series_op(this, other, op, **kwargs):
+    if kwargs:


Because of lines 80, 81; would this if statement also need to check if the op argument was set to equals?

Not just that, but many operators that are not currently supported. Currently this is handled by the NotImplementedError try-except block below. We would want to extend this approach to all relevant operations before considering merging though (if we want to consider that at all).

mrocklin · 2017-07-17T21:51:00Z

Also to be clear, I'd love for anyone to help on this. I'm mostly just cobbling together a combination of @eriknw 's cython setup and @jorisvandenbossche 's notebook. There are several other functions to parallelize that, I suspect, would be doable by copy-paste-and-modify coding techniques.

martindurant · 2017-07-18T13:39:09Z

__geom__ is a ctypes pointer to the same value as _geom - using it within a cdef is much faster:

%%cython
cdef cycle(p):
    cdef int i
    cdef long j
    for i in range(10000):
        j = p._geom

def run(p):
    cycle(p)

%timeit run(p)
1.45 ms ± 54.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%cython
cdef cycle2(p):
    cdef int i
    cdef long j
    for i in range(10000):
        j = p.__geom__

def run2(p):
    cycle2(p)

%timeit run2(p)

254 µs ± 42.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

A lookup is still being done - @jorisvandenbossche 's >x200 is certainly cheating!

One way to get towards the bigger speed-up could indeed be a hash table if id(object)->geos int; this will help in the case that the object is referenced more than once. I tried with a python dict for the table, but that is slower than doing the lookup every time (because the system checks whether the key exists, etc.,), so would need to implement a pure-C hash-table instead.
For reference, python id() is fast, but the following is the fastest way to get the id from an object array

%timeit np.frombuffer(a.tostring(), dtype='uint64')

4.37 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

(suspect may break on new pypy version of numpy).

as suggested by martindurant

mrocklin · 2017-07-18T13:54:39Z

Neat, I've pushed the use of __geom__. I think that @jorisvandenbossche 's 200x difference would be possible if we didn't create shapely objects, but rather just kept around the pointers to GEOS C objects. This would require a more specialized geoseries object, but seems doable to me.

I think that there are some more critical things to consider on this PR before then though, notably how to extend this speedup to a greater fraction of operations, in particular GEOSEquals calls, and how to address Nones within the series (currently tests fail).

martindurant · 2017-07-18T14:11:36Z

I see - rewriting without shapely would definitely be possible, but take some work.

What is the appropriate output when one or other of the comparisand elements is None? In cython, is None should be fast and not change the execution time much.

mrocklin · 2017-07-18T14:24:51Z

I've added the None check. As you suggest this does not hugely impact the time.

For reference, my benchmark is the following:

import geopandas as gpd
import random, numpy as np
import shapely
import time

point = shapely.geometry.Point(random.random(), random.random())

triangles = np.array([shapely.geometry.Polygon([(random.random(),
                                                 random.random())
                                                for _ in range(3)])
                      for _ in range(100000)], dtype=object)
gdf = gpd.GeoDataFrame({'geometry': triangles, 'x': 1})


start = time.time()
for i in range(10):
    gdf.geometry.contains(point)
end = time.time()
print((end - start) / 10)

The contains call takes around 700ms in master and 15ms in this branch.

martindurant · 2017-07-18T14:36:09Z

In order to build, I need the following:

--- a/setup.py
+++ b/setup.py
@@ -29,7 +29,7 @@ from distutils.core import Extension
 from distutils.command.build_ext import build_ext
 from distutils.errors import (CCompilerError, DistutilsExecError,
                               DistutilsPlatformError)
-
+import numpy
@@ -72,7 +72,8 @@ suffix = '.pyx' if use_cython else '.c'
 ext_modules = []
 for modname in ['_geoseries']:
     ext_modules.append(Extension('geopandas.' + modname,
-                                 ['geopandas/' + modname + suffix]))
+                                 ['geopandas/' + modname + suffix],
+                                 include_dirs=[numpy.get_include()]))

... are you building with setup.py build_ext --with-cython without these lines?

mrocklin · 2017-07-18T14:37:07Z

Yes, I'm just using make inplace as defined in the Makefile.

martindurant · 2017-07-18T15:36:01Z

I got an email saying I had permission to this PR, but I don't seem to be able to push. Here is a suggested patch:

diff --git a/geopandas/_geoseries.pxd b/geopandas/_geoseries.pxd
index 09cab52..fba297b 100644
--- a/geopandas/_geoseries.pxd
+++ b/geopandas/_geoseries.pxd
@@ -1,3 +1,4 @@
-cdef _cy_series_op_fast(this, other, op)
+cimport numpy as np
+cdef _cy_series_op_fast(np.ndarray[object, ndim=1] array, object geometry, str op)
 cdef _cy_series_op_slow(this, other, op, kwargs)

diff --git a/geopandas/_geoseries.pyx b/geopandas/_geoseries.pyx
index 9ba4922..63228e1 100644
--- a/geopandas/_geoseries.pyx
+++ b/geopandas/_geoseries.pyx
@@ -49,9 +49,10 @@ cdef _cy_series_op_slow(this, other, op, kwargs):

 @cython.boundscheck(False)
 @cython.wraparound(False)
-cdef _cy_series_op_fast(array, geometry, op):
+cdef _cy_series_op_fast(np.ndarray[object, ndim=1] array, object geometry, str op):

     cdef Py_ssize_t idx
+    cdef bytes op2 = op.encode()
     cdef unsigned int n = array.size
     cdef np.ndarray[np.uint8_t, ndim=1, cast=True] result = np.empty(n, dtype=np.uint8)

@@ -60,23 +61,23 @@ cdef _cy_series_op_fast(array, geometry, op):
     cdef GEOSGeometry *geom2
     cdef uintptr_t geos_geom

-    if op == 'contains':
+    if op2 == b'contains':
         func = GEOSPreparedContains_r
-    elif op == 'intersects':
+    elif op2 == b'intersects':
         func = GEOSPreparedIntersects_r
-    elif op == 'touches':
+    elif op2 == b'touches':
         func = GEOSPreparedTouches_r
-    elif op == 'crosses':
+    elif op2 == b'crosses':
         func = GEOSPreparedCrosses_r
-    elif op == 'within':
+    elif op2 == b'within':
         func = GEOSPreparedWithin_r
-    elif op == 'contains_properly':
+    elif op2 == b'contains_properly':
         func = GEOSPreparedContainsProperly_r
-    elif op == 'overlaps':
+    elif op2 == b'overlaps':
         func = GEOSPreparedOverlaps_r
-    elif op == 'covers':
+    elif op2 == b'covers':
         func = GEOSPreparedCovers_r
-    elif op == 'covered_by':
+    elif op2 == b'covered_by':
         func = GEOSPreparedCoveredBy_r
     # elif op == 'equals':
     #     func = GEOSEquals_r
diff --git a/setup.py b/setup.py
index 101ef63..c7f52c0 100644
--- a/setup.py
+++ b/setup.py
@@ -29,7 +29,7 @@ from distutils.core import Extension
 from distutils.command.build_ext import build_ext
 from distutils.errors import (CCompilerError, DistutilsExecError,
                               DistutilsPlatformError)
-
+import numpy
 import versioneer

 LONG_DESCRIPTION = """GeoPandas is a project to add support for geographic data to
@@ -72,7 +72,8 @@ suffix = '.pyx' if use_cython else '.c'
 ext_modules = []
 for modname in ['_geoseries']:
     ext_modules.append(Extension('geopandas.' + modname,
-                                 ['geopandas/' + modname + suffix]))
+                                 ['geopandas/' + modname + suffix],
+                                 include_dirs=[numpy.get_include()]))
 if use_cython:
     # Set global Cython options
     # http://docs.cython.org/en/latest/src/reference/compilation.html#compiler-directives

The small changes above seem to improve timing by ~20% or so, but tests fail for me locally. They were failing with the version above mine that passed, so I am assuming this is my system only.

mrocklin · 2017-07-18T15:39:34Z

Why the bytestring changes?

martindurant · 2017-07-18T16:46:27Z

Why the bytestring changes?

The doc suggests encoding python strings when within a cdef - but the intent there may have been only for when passing the string on the C functions.

I did a timing just now, and it seems that unicode string comp and bytes string comp may nearly the same time; so maybe it's cleaner not the bother. In any case, this only happens once per function call, whatever the number of elements.

In [28]: %%cython
    ...: def comp(str x, str y):
    ...:     cdef int i
    ...:     for i in range(10000):
    ...:         x == y
    ...:

In [29]: %timeit comp(x, x2)
7.06 µs ± 129 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [31]: %%cython
    ...: def comp(bytes x, bytes y):
    ...:     cdef int i
    ...:     for i in range(10000):
    ...:         x == y
    ...:

In [32]: %timeit comp(y, y2)
5.58 µs ± 286 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

(x and x2, y and y2 are equal but not the same object)

martindurant · 2017-07-19T16:29:32Z

Do you have latest benchmarks, and do you know if my suggestions helped for you?

I like pulling out the __geom__ extraction from the loop. Since it's the same for both code-paths, it could stand separately, with a view to using raw geos ids later. The same is true for creating the result arrays - it would be reasonable to (optionally) provide an array to assign into.
Also, I would check that we can handle the op outside the function, to minimize what gets hidden inside cdefs. I wouldn't expect any of these to have an impact on benchmarks.

mrocklin · 2017-07-19T16:37:43Z

My only benchmark is what's above. I haven't yet tried your changes, but will soon. Currently I'm trying to get tests to pass.

mrocklin · 2017-07-19T16:51:17Z

Now supporting non-prepared geometry operations and releasing the GIL during GEOS function calls

This branch

In [1]: import geopandas as gpd
   ...: import random, numpy as np
   ...: import shapely
   ...: import time
   ...: 
   ...: point = shapely.geometry.Point(random.random(), random.random())
   ...: 
   ...: triangles = np.array([shapely.geometry.Polygon([(random.random(),
   ...:                                                  random.random())
   ...:                                                 for _ in range(3)])
   ...:                       for _ in range(1000000)], dtype=object)
   ...:                       
   ...: gdf = gpd.GeoDataFrame({'geometry': triangles, 'x': 1})
   ...: 

In [2]: %load_ext ptime

In [3]: %ptime -n 4 gdf.geometry.contains(point)
Total serial time:   0.48 s
Total parallel time: 0.25 s
For a 1.89X speedup across 4 threads

Master

In [3]: %ptime -n 4 gdf.geometry.contains(point)
Total serial time:   37.13 s
Total parallel time: 46.74 s
For a 0.79X speedup across 4 threads

mrocklin · 2017-07-19T16:51:59Z

Oh, hrm, for some reason it looks like my previous comment didn't make it through. Pushed.

There are still a couple of failures with contains and within. We seem to have different behavior when the polygons are exact.

martindurant · 2017-08-03T21:06:10Z

geopandas/tests/test_geoseries.py

+    s = pd.Series(shapes, index=list('abcdefghij'), name='foo')
+    g = GeoSeries(s)
+
+    assert [a.equals(b) for a, b in zip(s, g)]


Isn't this always True if the list is non-empty, even if the elements are False?

yes, I think a np.all around it is missing?

Thanks. Resolved in a recent commit

mrocklin · 2017-08-03T22:23:08Z

==== 46 failed, 223 passed, 6 skipped, 3 xfailed, 12 warnings in 92.38 seconds ====

jreback · 2017-08-03T22:38:25Z

@jorisvandenbossche your comment above: #467 (comment)

df = ....
df['foo'] = GeoSeries(...)

should work; this ultimately calls BlockManager.insert which should not be changing the tenor of the block itself (nor coercing).

you need to make sure that DataFrame._sanitize_column passes it thru.

Then you need to modify (and this is prob hacky at this point and NOT very pluggable):

make_block
is_extension_type

mrocklin · 2017-08-04T01:08:22Z

For the moment at least it looks like we're not passing through cleanly.

In [1]: import geopandas as gpd

In [2]: gdf = gpd.GeoDataFrame({'x': [1]})

In [3]: gdf
Out[3]: 
   x
0  1

In [4]: from shapely.geometry import Polygon

In [5]: gs = gpd.GeoSeries([Polygon([(0, 0), (0, 1), (1, 1)])])

In [6]: gdf['y'] = gs
I am densified (external_values, 1 elements)

In [7]: gdf
Out[7]: 
   x                               y
0  1  POLYGON ((0 0, 0 1, 1 1, 0 0))

In [8]: gdf._data
Out[8]: 
BlockManager
Items: Index(['x', 'y'], dtype='object')
Axis 1: RangeIndex(start=0, stop=1, step=1)
IntBlock: slice(0, 1, 1), 1 x 1, dtype: int64
ObjectBlock: slice(1, 2, 1), 1 x 1, dtype: object

In [9]: gs._data._block
Out[9]: GeometryBlock: 1 dtype: object

jorisvandenbossche · 2017-08-04T13:23:39Z

@jreback If you look at the current implementation of BlockManager.insert, it is written to accept an array-like, not a block. The passed values are converted to a block by pandas, and thus it will not preserve a block type pandas is not aware of (https://github.com/pandas-dev/pandas/blob/929c66fd74da221078a67ea7fd3dbcbe21d642e0/pandas/core/internals.py#L3895-L3920)

We could change this behaviour in pandas (either by preserving the block class if a block is passed instead of array-like, or by having some kind of registry of blocks so pandas can create the correct one).
Also for example concat does not preserve block types, which is something I will have to try to fix in pandas to get it working for geopandas.

For this reason, we do this "unraveling the blockmanager, adapt blocks/axes, reconstruct blockmanager" logic to overcome that limitation (which works for the GeoDataFrame constructor and other code in geopandas, but not for eg concat)

jorisvandenbossche · 2017-08-04T13:28:35Z

geopandas/vectorized.pyx

@@ -921,14 +921,6 @@ class GeometryArray(object):
        self.data = geoms
        self.parent = None

-    @property
-    def x(self):
-        return get_coordinate_point(self.data, 0)


I think it is OK to keep this, we just need to deal with the case when it are not all Points. I recently merged a PR of @jdmcbr to add this to GeoSeries (not yet in this branch), see https://github.com/geopandas/geopandas/pull/383/files. There we raise a ValueError if not all geoms are points.

(which of course does not mean GeometryArray should have this, as we can also put this logic in GeoSeries, and use there the get_coordinate_point)

jorisvandenbossche · 2017-08-04T13:36:11Z

geopandas/vectorized.pyx

@@ -1075,6 +1099,23 @@ class GeometryArray(object):
        return buffer(self.data, distance, resolution, cap_style, join_style,
                      mitre_limit)

+    def types(self):


In the current GeoSeries, this is called geom_type

jorisvandenbossche · 2017-08-04T13:39:12Z

geopandas/vectorized.pyx

+        x = vec_type(self.data)
+
+        types = GEOMETRY_TYPES[:]
+        x[x == 255] = len(types)


Is 255 returned by GEOSGeomTypeId to indicate missing? (not included in the types)

If so, I think you need to put this to -1 (this is used as missing value indicator in codes):

In [36]: pd.Categorical.from_codes([0,1,2], ['a', 'b']) ... ValueError: codes need to be between -1 and len(categories)-1 In [37]: pd.Categorical.from_codes([0,1,-1], ['a', 'b']) Out[37]: [a, b, NaN] Categories (2, object): [a, b]

jorisvandenbossche · 2017-08-04T15:36:14Z

I merged pandas-dev/pandas#17143, so the (non-truncated) repr should now work when using pandas master

mrocklin · 2017-08-04T22:42:11Z

The old sjoin algorithm depends on pd.merge working well, which it currently doesn't. For now I've decided to keep the new sjoin algorithm in this PR under a new_sjoin function. I've expanded this to support the how= keyword. Doing this while using only supported pandas operations was an interesting challenge, but it seems to work out.

jorisvandenbossche · 2017-08-04T22:43:48Z

FYI, I am just working on cleaning this up and making a separate branch for this work.

mrocklin · 2017-08-04T22:46:24Z

OK. I'll hold off for a bit. Thank you for doing the organization here.

This follows shapely behavior

jorisvandenbossche · 2017-08-05T00:16:31Z

Closing, this is merged in #472, and follow-up issue in #473. For the sjoin functionality, a new PR can be opened.

eriknw and others added 5 commits July 15, 2017 17:42

WIP: Allow cythonizing or compiling geoseries functions if possible.

bbda67c

add naive series_op function and import to base

bce0634

add joris' contains solution

4dc4ad6

extend cython series op to more operations

66bbfaa

revert changes to series_op_slow

b9d9060

mrocklin force-pushed the cythonize branch from 3573547 to b9d9060 Compare 8000 July 17, 2017 14:52

mrocklin commented Jul 17, 2017

View reviewed changes

make install in .travis.yml

bf4e9d3

use __geom__ rather than _geom

e64d5c1

as suggested by martindurant

handle Nones in cython code

7bde9d7

add python-dev to apt-get in travis.yml

9e2bf0f

mrocklin added 2 commits July 19, 2017 12:04

support equals and other unprepared geometry operations

81f8485

release gil when calling GEOS operations

ba80ba8

avoid calling func on nulls

9db10b9

martindurant reviewed Aug 3, 2017

View reviewed changes

add all calls around lists

31d33a1

jorisvandenbossche mentioned this pull request Aug 3, 2017

REF: repr - allow block to override values that get formatted pandas-dev/pandas#17143

Merged

mrocklin added 2 commits August 3, 2017 15:19

Move new sjoin function to new function

c181f6f

Don't coerce to GeoSeries in apply

f503c95

mrocklin added 4 commits August 3, 2017 16:08

Remove x, y, rpredicate methods

9d275bc

add types to GeometryArray

4917966

add install command to Makefile

4efe28f

Construct block manager even if no geometry columns

e0932d4

jdmcbr mentioned this pull request Aug 4, 2017

RLS: geopandas 0.3 release #470

Closed

jorisvandenbossche reviewed Aug 4, 2017

View reviewed changes

mrocklin added 2 commits August 4, 2017 08:18

types -> geom_type

4165d25

connect geom_type to base

b34bf65

support how= parameter in new sjoin

aedb1d0

geom_type returns a string

e605935

This follows shapely behavior

This was referenced Aug 5, 2017

Refactor: cythonize geometry series operations #472

Merged

Follow-up - Refactor cythonize geometry series operations #473

Closed

jorisvandenbossche closed this Aug 5, 2017

jorisvandenbossche added this to the 1.0 milestone Aug 5, 2017

jorisvandenbossche added the geopandas-cython label May 14, 2019

Uh oh!

WIP - Cythonize geometry series operations #467

WIP - Cythonize geometry series operations #467

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This branch

Master

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!