8000 [MRG+2] Rewrite setup.py files to handle cython dependencies by lesteve · Pull Request #7719 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+2] Rewrite setup.py files to handle cython dependencies #7719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Nov 2, 2016

Conversation

lesteve
Copy link
Member
@lesteve lesteve commented Oct 21, 2016

Fixes #7094 and maybe related issues if there are other. Closes #7708.

What does this implement/fix? Explain your changes.

Basically build_tools/cythonize.py doesn't take into account cython files dependencies. For example if you change _criterion.pxd the dependents .pyx don't get recompiled, potentially yielding to confusing segmentation faults.

The thing is that cython takes into acount correctly if you write your setup.py carefull (mostly by using Cython.Build.cythonize and switching sources between .c and .pyx as appropriate). Compared to #7708 which is a work-around this PR is trying to tackle the root problem.

Upsides:

  • make in in is almost instant if you do not change cython files. If you do change cython files, only affected cython files are generated. As as developer this cuts down the time of compiling scikit-learn by a whole lot.
  • dependency taken correctly into account. Try touching sklearn/tree/_criterion.pxd, and you'll see that all the sklearn/tree cython files gets recompiled.

Downsides:

This is WIP because I haven't disabled caching on Travis yet. Just wanted to show the changes in setup.py files and get feed-back for now.

ping @raghavrv, @jnothman, @amueller.

  • pandas changes its sdist to cythonize all the files, maybe we need to do something like this depending how we release. I need to double-check this. Edit: sdist seems to work fine, pandas approach is a bit different where they have custom classes they pass to cmdclass in setup.
  • try with older versions of cython. pandas has a > 0.19.1 requirement, not sure why. Edit: Cython >= 0.23 is mentioned in the doc/developers/advanced_installation.rst and I checked it does compile fine with 0.23 and the tests run fine on my machine.

< 8000 svg aria-label="Loading..." style="box-sizing: content-box; color: var(--color-icon-primary);" width="32" height="32" viewBox="0 0 16 16" fill="none" role="img" data-view-component="true" class="anim-rotate">

@raghavrv
Copy link
Member

+10000 Thanks heaps for working on this...

@lesteve lesteve force-pushed the rewrite-setup-cythonize branch from 2f13e88 to ed6b671 Compare October 21, 2016 11:35
Copy link
Member Author
@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few inline comments to help the review

is_cpp_filename = filename.endswith('.cpp')

# files in src are .c and .cpp files that are not cython-generated
if 'src/' in filename and (is_c_filename or is_cpp_filename):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the most brittle part of the PR I think. I don't know whether there is a better way of doing it, suggestions welcome.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one possible improvement would be to add an additional argument in add_cython_extension like cython_generated_sources so that we can discriminate more easily betwen the .c files coming from cython and the ones in the repo.

sources = [get_cython_source(filename) for filename in sources]
config.add_extension(name, sources, **kwargs)

config.ext_modules = cythonize(config.ext_modules)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I need to double-check this, maybe it should be inside an if is_dev_version

raise ValueError('Please install cython in order '
'to build a scikit-learn development version')

sources = [get_cython_source(filename) for filename in sources]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part that switches between using .c{,pp} files when building a release and the .pyx when building a dev version.

By using Cython.Build.cythonize and switching between .c and .pyx files
as appropriate cython dependencies are correctly taken into account.
@lesteve lesteve force-pushed the rewrite-setup-cythonize branch from ed6b671 to 1eade7d Compare October 21, 2016 11:49
@raghavrv
Copy link
Member

Just to note that sometimes there are inter module dependencies that needs to be recythonized too... For instance any change in the tree's cython sources needs a recythonizing of the gradient boosting... Does this PR handle that?

@lesteve
Copy link
Member Author
lesteve commented Oct 21, 2016

Just to note that sometimes there are inter module dependencies that needs to be recythonized too... For instance any change in the tree's cython sources needs a recythonizing of the gradient boosting... Does this PR handle that?

It seems like it does:

make in
touch sklearn/tree/_criterion.pxd
make in

Excerpt of the output:

Compiling sklearn/ensemble/_gradient_boosting.pyx because it depends on ./sklearn/tree/_criterion.pxd.
Compiling sklearn/tree/_tree.pyx because it depends on ./sklearn/tree/_criterion.pxd.
Compiling sklearn/tree/_splitter.pyx because it depends on ./sklearn/tree/_criterion.pxd.
Compiling sklearn/tree/_criterion.pyx because it depends on sklearn/tree/_criterion.pxd.
Compiling sklearn/tree/_utils.pyx because it depends on ./sklearn/tree/_criterion.pxd.
[1/5] Cythonizing sklearn/tree/_criterion.pyx
[2/5] Cythonizing sklearn/tree/_splitter.pyx
[3/5] Cythonizing sklearn/ensemble/_gradient_boosting.pyx
[4/5] Cythonizing sklearn/tree/_tree.pyx
[5/5] Cythonizing sklearn/tree/_utils.pyx

@raghavrv
Copy link
Member
raghavrv commented Oct 21, 2016

Amazing *_*

@TomDLT
Copy link
Member
TomDLT commented Oct 21, 2016

This is nice !

Copy link
Member Author
@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few of my comments that got folded in outdated diffs.

is_cpp_filename = filename.endswith('.cpp')

# files in src are .c and .cpp files that are not cython-generated
if ('src' + os.sep) in filename and (is_c_filename or is_cpp_filename):
Copy link
Member Author
@lesteve lesteve Oct 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the most brittle part of the PR I think. I don't know whether 8000 there is a better way of doing it, suggestions welcome.

I guess one possible improvement would be to add an additional argument in add_cython_extension like cython_generated_sources so that we can discriminate more easily betwen the .c files coming from cython and the ones in the repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine I think. Optionally you could just hardcode a list with all the non cython source paths...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about checking if there exists a pyx file with the same name in the same folder?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

8000
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a simple way of doing it indeed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could add .pyx extension when it's a cython file, and add a function like:

import os.path

def no_cythonize(extensions, **_ignore):
    for extension in extensions:
        sources = []
        for sfile in extension.sources:
            path, ext = os.path.splitext(sfile)
            if ext in ('.pyx', '.py'):
                if extension.language == 'c++':
                    ext = '.cpp'
                else:
                    ext = '.c'
                sfile = path + ext
            sources.append(sfile)
        extension.sources[:] = sources
    return extensions

(copy-paste from cython doc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense this way around indeed. The .pyx would be mentioned in the setup.py and you would do the switching to .c or .cpp only for releases. I have to say I didn't fully understand the no_cythonize function when I read the doc but it makes more sense now (in particular the naming is confusing because it's not really related to Cython.Build.cythonize).

is_dev_version = not os.path.exists(os.path.join(top_path, 'PKG-INFO'))

if is_dev_version:
sources = [get_cython_source(filename) for filename in sources]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part that switches between using .c{,pp} files when building a release and the .pyx when building a dev version.

Copy link
Member
@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good if it works as expected. Also, I feel like this shouldn't live in scikit-learn alone. I feel like at least numpy, pandas and scikit-image need the same. Shouldn't we share a solution?

is_cpp_filename = filename.endswith('.cpp')

# files in src are .c and .cpp files that are not cython-generated
if ('src' + os.sep) in filename and (is_c_filename or is_cpp_filename):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about checking if there exists a pyx file with the same name in the same folder?

raise ValueError('Please install cython in order '
'to build a scikit-learn development version')

config.ext_modules = cythonize(config.ext_modules)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that takes care of the caching?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://cython.readthedocs.io/en/latest/src/reference/compilation.html#compiling-with-distutils:

The cythonize command also allows for multi-threaded compilation and dependency resolution. Recompilation will be skipped if the target file is up to date with its main source file and dependencies.

@lesteve
Copy link
Member Author
lesteve commented Oct 24, 2016

This looks pretty good if it works as expected.

It does work as expected in the few tests I run like touching sklearn/tree/_criterion.pxd and making sure that all the dependents are recompiled (see #7719 (comment)). If you have more complicated scenarios that could go wrong I'd be happy to try them out.

Also, I feel like this shouldn't live in scikit-learn alone. I feel like at least numpy, pandas and scikit-image need the same. Shouldn't we share a solution?

Maybe we could have this in scikit-learn for some time so that it gets stress-tested a bit before consolidating across multiple projects?

@amueller
Copy link
Member

Maybe we could have this in scikit-learn for some time so that it gets stress-tested a bit before consolidating across multiple projects?

Yes, but we could also try reaching out to others and see what their approaches and requirements are?

@jnothman
Copy link
Member

Why WIP?

@raghavrv
Copy link
Member

Remove build caching from Travis

+:100:

@lesteve
Copy link
Member Author
lesteve commented Oct 25, 2016

As I said in the issue description:

This is WIP because I haven't disabled caching on Travis yet. Just wanted to show the changes in setup.py files and get feed-back for now.

No one complained about the Travis cache being useless as a side effect of this PR so I just pushed a commit that removes the build products from theTravis cache and changed the status to MRG.

Cython dependencies are taken care of by Cython.Build.cythonize and
based on file timestamps, so .C and .so files will always be rebuild
from scratch on each build in Travis.
@lesteve lesteve force-pushed the rewrite-setup-cythonize branch from 23ee373 to 7c731a6 Compare October 25, 2016 14:29
More natural this way. Tweak the extensions to generate from .c and .cpp
files for a release.
@lesteve lesteve force-pushed the rewrite-setup-cythonize branch from 043f6d6 to 69c5ae1 Compare October 25, 2016 16:09
@lesteve lesteve changed the title [WIP] Rewrite setup.py files to handle cython dependencies [MRG] Rewrite setup.py files to handle cython dependencies Oct 25, 2016
@GaelVaroquaux
Copy link
Member
Also, I feel like this shouldn't live in scikit-learn alone. I feel like at
least numpy, pandas and scikit-image need the same. Shouldn't we share a
solution?

Maybe we could have this in scikit-learn for some time so that it gets
stress-tested a bit before consolidating across multiple projects?

+1. And then contribute it to Cython, IMHO.

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jnothman jnothman changed the title [MRG] Rewrite setup.py files to handle cython dependencies [MRG+1] Rewrite setup.py files to handle cython dependencies Nov 1, 2016
Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not tested that this works. Otherwise, it looks good to me...

@@ -1,21 +1,18 @@
import numpy



Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove added newline

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly flake8_diff.sh is not able to spot these. Quickly looking at it this is because we do the diff without context (--unified=0) in order to avoid noise coming from lines that the PR hasn't touched.

try:
from Cython.Build import cythonize
except ImportError:
raise ValueError('Please install cython in order '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we keep having issues related to cython versioning, I think we should be checking for or specifying version.

extension.sources = sources


def tweak_extensions(top_path, config):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this something less opaque: conditional_cythonize or cythonize_if_development?

@@ -33,3 +35,38 @@ def atlas_not_found(blas_info_):
cblas_libs = blas_info.pop('libraries', [])

return cblas_libs, blas_info


def tweak_extensions_to_build_from_c_and_cpp_files(extensions):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just rename_cython_extensions?

@lesteve
Copy link
Member Author
lesteve commented Nov 2, 2016

@jnothman I have tackled your comments.

else:
message = ('Please install cython with a version >= 0.23'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really pull this min version out as a constant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

@jnothman
Copy link
Member
jnothman commented Nov 2, 2016

Otherwise LGTM.

@jnothman jnothman changed the title [MRG+1] Rewrite setup.py files to handle cython dependencies [MRG+2] Rewrite setup.py files to handle cython dependencies Nov 2, 2016
@jnothman
Copy link
Member
jnothman commented Nov 2, 2016

You can merge after CI approves. thanks!

@jnothman jnothman merged commit 6a2d8d5 into scikit-learn:master Nov 2, 2016
@raghavrv
Copy link
Member
raghavrv commented Nov 2, 2016

Thank a lot for this @lesteve!

@lesteve lesteve deleted the rewrite-setup-cythonize branch November 2, 2016 12:08
@jnothman
Copy link
Member
jnothman commented Nov 2, 2016

Is there a benefit to having this in 0.18.1?

On 2 November 2016 at 23:00, Raghav RV notifications@github.com wrote:

Thank a lot for this @lesteve https://github.com/lesteve!


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#7719 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz69hzUbKLEXZENHwRSUVnqSZEiufFks5q6HtfgaJpZM4KdIXJ
.

@raghavrv
Copy link
Member
raghavrv commented Nov 2, 2016

Having this included in the backports will help avoid travis cache based error in the backport PR #7672 :D

@jnothman jnothman added this to the 0.18.1 milestone Nov 2, 2016
@lesteve
Copy link
Member Author
lesteve commented Nov 2, 2016

Thank a lot for this @lesteve!

@raghavrv your work-around PR motivated me to find a better solution, so you kind of initiated this ;-) !

amueller pushed a commit to amueller/scikit-learn that referenced this pull request Nov 9, 2016
…cikit-learn#7719)

* Rewriting of cythonization in setup.py

By using Cython.Build.cythonize and switching between .c and .pyx files
as appropriate cython dependencies are correctly taken into account.

* Use cythonize once on the root config rather than in each subpackage

* Fix for Windows

* Remove caching from Travis

Cython dependencies are taken care of by Cython.Build.cythonize and
based on file timestamps, so .C and .so files will always be rebuild
from scratch on each build in Travis.

* Specify .pyx in setup.files for cython generated extensions

More natural this way. Tweak the extensions to generate from .c and .cpp
files for a release.

* COSMIT Remove commented out code

* Check cython version is greater than 0.23

* COSMIT better names for functions

* flake8 fix (imported module not at top of file)

* Install cython 0.23 for Python 2.6

now that cython >= 0.23 requirement is enforced in setup.py

* Use module constant for minimum required cython version

* Fix Travis install.sh

No easy way to put comments inside multi-line command
@TomDLT
Copy link
Member
TomDLT commented Nov 24, 2016

Really enjoying this change, compiling is very much faster !

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
…cikit-learn#7719)

* Rewriting of cythonization in setup.py

By using Cython.Build.cythonize and switching between .c and .pyx files
as appropriate cython dependencies are correctly taken into account.

* Use cythonize once on the root config rather than in each subpackage

* Fix for Windows

* Remove caching from Travis

Cython dependencies are taken care of by Cython.Build.cythonize and
based on file timestamps, so .C and .so files will always be rebuild
from scratch on each build in Travis.

* Specify .pyx in setup.files for cython generated extensions

More natural this way. Tweak the extensions to generate from .c and .cpp
files for a release.

* COSMIT Remove commented out code

* Check cython version is greater than 0.23

* COSMIT better names for functions

* flake8 fix (imported module not at top of file)

* Install cython 0.23 for Python 2.6

now that cython >= 0.23 requirement is enforced in setup.py

* Use module constant for minimum required cython version

* Fix Travis install.sh

No easy way to put comments inside multi-line command
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…cikit-learn#7719)

* Rewriting of cythonization in setup.py

By using Cython.Build.cythonize and switching between .c and .pyx files
as appropriate cython dependencies are correctly taken into account.

* Use cythonize once on the root config rather than in each subpackage

* Fix for Windows

* Remove caching from Travis

Cython dependencies are taken care of by Cython.Build.cythonize and
based on file timestamps, so .C and .so files will always be rebuild
from scratch on each build in Travis.

* Specify .pyx in setup.files for cython generated extensions

More natural this way. Tweak the extensions to generate from .c and .cpp
files for a release.

* COSMIT Remove commented out code

* Check cython version is greater than 0.23

* COSMIT better names for functions

* flake8 fix (imported module not at top of file)

* Install cython 0.23 for Python 2.6

now that cython >= 0.23 requirement is enforced in setup.py

* Use module constant for minimum required cython version

* Fix Travis install.sh

No easy way to put comments inside multi-line command
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…cikit-learn#7719)

* Rewriting of cythonization in setup.py

By using Cython.Build.cythonize and switching between .c and .pyx files
as appropriate cython dependencies are correctly taken into account.

* Use cythonize once on the root config rather than in each subpackage

* Fix for Windows

* Remove caching from Travis

Cython dependencies are taken care of by Cython.Build.cythonize and
based on file timestamps, so .C and .so files will
BA14
 always be rebuild
from scratch on each build in Travis.

* Specify .pyx in setup.files for cython generated extensions

More natural this way. Tweak the extensions to generate from .c and .cpp
files for a release.

* COSMIT Remove commented out code

* Check cython version is greater than 0.23

* COSMIT better names for functions

* flake8 fix (imported module not at top of file)

* Install cython 0.23 for Python 2.6

now that cython >= 0.23 requirement is enforced in setup.py

* Use module constant for minimum required cython version

* Fix Travis install.sh

No easy way to put comments inside multi-line command
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…cikit-learn#7719)

* Rewriting of cythonization in setup.py

By using Cython.Build.cythonize and switching between .c and .pyx files
as appropriate cython dependencies are correctly taken into account.

* Use cythonize once on the root config rather than in each subpackage

* Fix for Windows

* Remove caching from Travis

Cython dependencies are taken care of by Cython.Build.cythonize and
based on file timestamps, so .C and .so files will always be rebuild
from scratch on each build in Travis.

* Specify .pyx in setup.files for cython generated extensions

More natural this way. Tweak the extensions to generate from .c and .cpp
files for a release.

* COSMIT Remove commented out code

* Check cython version is greater than 0.23

* COSMIT better names for functions

* flake8 fix (imported module not at top of file)

* Install cython 0.23 for Python 2.6

now that cython >= 0.23 requirement is enforced in setup.py

* Use module constant for minimum required cython version

* Fix Travis install.sh

No easy way to put comments inside multi-line command
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests failing on master due to tree recythonization issues
7 participants
0