10000 update documentation to reflect fit, transform, and predict parameter… by shanglun · Pull Request #7156 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

update documentation to reflect fit, transform, and predict parameter… #7156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

shanglun
Copy link
@shanglun shanglun commented Aug 7, 2016

Reference Issue

This PR is a response to Scikit-learn issue #7142 which is a documentation addition for "rolling your own estimator"

What does this implement/fix? Explain your changes.

This PR adds two additional paragraphs detailing conventions for additional parameters to pass to the fit, predict, and transform functions of the custom estimator.

Any other comments?

If you'd like me to change the wording or add/remove sections, please feel free to reach out.

… passing rules


This change is Reviewable

should be restricted to variables that have the shape ``n_samples``.
This allows for all relevant parameters to be sliced cleanly during
cross-validation. All other parameters should be passed to
``__init__`` or ``set_params``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its more clear to say "All parameters should be passed into init or set after construction with set_params

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like parameters that affect the behavior of fit should always go into __init__... this also serves to maintain compatibility with things like GridSearchCV

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "init" without the surrounding underscores clearer, do you think? Elsewhere in the documentation, the init function is referred to using the underscore notation. I do agree with the wording of set_params. I will make the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry I didn't put the underscores with init in a code block and it bolded. I didn't mean to suggest a change to remove underscores, that's good as it is

Copy link
Author
@shanglun shanglun Aug 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameters that affect behavior of fit does should always go into init. The idea with the fit parameters is to accommodate variables like sample_weights, which need to sliced up along with the X and Y during cross-validation. That should be clear in the documentation - I'll make the change.
Edit: Nice catch on the init function there, @nelson-liu

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, make sure to use underscores and not bold init ;) (__init__)

cross-validation. All other parameters should be passed to
``__init__`` or ``set_params``.
should be restricted to variables that need to be sliced during
cross-validation. Consequently, these variables would be either arrays with shape=[N] or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "Consequently" fits here, you can just omit it.

@@ -952,6 +952,22 @@ The easiest and recommended way to accomplish this is to
All logic behind estimator parameters,
like translating string arguments into functions, should be done in ``fit``.

Parameters can be passed to the ``fit`` method. Generally, these parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "why it is possible to pass parameter to the fit method, these should be ``` and also explain why (pipelining, grid-search). Also, the shape can be anything as long as the first dimension in n_samples.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! The change has been made. Please review when you have a chance.

should be restricted to variables that need to be sliced during
cross-validation. These variables would be either arrays with shape=[N] or
array-like with shape=[N,] where N is the number of samples. An example of
cross-validation. These variables would be either arrays with shape=[N] where N=n_samples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only require shape[0] == n_samples. Why not just These would be arrays withshape[0] == n_samples``?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are many ways to express the same idea here. I like the
method you're suggesting as it's very clear and concise. I'll make this
change today.

On Sep 6, 2016 3:03 PM, "Andreas Mueller" notifications@github.com wrote:

In doc/developers/contributing.rst
#7156 (comment)
:

should be restricted to variables that need to be sliced during
-cross-validation. These variables would be either arrays with shape=[N] or
-array-like with shape=[N,] where N is the number of samples. An example of
+cross-validation. These variables would be either arrays with shape=[N] where N=n_samples

I think we only require shape[0] == n_samples. Why not just These would
be arrays withshape[0] == n_samples``?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/7156/files/43fdd093b0bf14a22d7371334503949d61dfb491..01ccec3d297b421d86876d15f35df65aef11b511#r77698216,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKNQ_CnJUt4oMYWJLxQR-DtpLDO9IfjDks5qnbkTgaJpZM4JeZ4l
.

@jnothman
Copy link
Member
jnothman commented Sep 8, 2016

I'm sure this all needs a rewrite to some extent, but this is covered, at least with respect to fit above in the "Fitting" section. Perhaps it should point back? Or some other amalgamation?

@shanglun
Copy link
Author
shanglun commented Sep 8, 2016

@jnothman Definitely makes sense to point back to related part of the doc. I'll look at it and figure out the best wording. I believe the ticket was created because the shape[0]==n_samples requirement was not explicitly mentioned in the rolling your own estimator section and created some confusion.

Unrelated, but the AppVeyor and Travis CI builds are failing. Checked my PR a few times and don't think they should cause the errors that the output are showing. Went into doc and tried make html and it works currently. Any ideas what I can do to make the builds pass?

@maniteja123
Copy link
Contributor

Hi, it seems that the test failure is unrelated to the changes in this PR. I suppose the master was failing and it is now fixed in 62ad5fe.

@shanglun
Copy link
Author

I pushed another change in the documentation directory. It appears the travis-ci tests are still failing. Is there something I can change to help this CI process go through correctly?

@nelson-liu
Copy link
Contributor

looking at it again, seems like the flake8 diff script got a hold of the wrong ancestor, triggering it to diff every file touched for around 19k commits..

@amueller
Copy link
Member

please don't merge the master branch. that really messes up history. You can rebase your branch on top of master if you like, but it's usually not necessary.

While it is possible to pass parameters into the ``fit`` method, these
should be restricted to variables that need to be sliced during
cross-validation. These variables need to be arrays with shape[0]==n_samples (see
:ref:`fitting` above). An example of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

67ED

weird linebreak.

Copy link
Member
@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good apart from my comment.

``__init__`` or set after construction using ``set_params``.

As a general rule, ``transform`` and ``predict`` should not take additional
parameters. In cases where additional parameters would be useful, e.g.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

< A3DB /form>

I don't think this is a good explanation of what we want. I would say "In case parameters need to be changed after the call to __init__, e.g. when setting a threshold value for feature selection, this can be done by setting the attribute directly on the estimator or calling set_parms.

@dalmia
Copy link
Contributor
dalmia commented Nov 8, 2016

This seems inactive now. Should I create a fresh pull request incorporating all the changes discussed so far in the thread?

@shanglun
Copy link
Author
shanglun commented Nov 8, 2016

Didn't realize there was another suggestion to implement. I'll issue a PR
today

On Nov 8, 2016 1:06 PM, "Aman Dalmia" notifications@github.com wrote:

This seems inactive now. Should I create a fresh pull request
incorporating all the changes discussed so far in the thread?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7156 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKNQ_BDC6_0P4w2vkwF5EfNJRManF4yhks5q8LozgaJpZM4JeZ4l
.

@dalmia
Copy link
Contributor
dalmia commented Nov 8, 2016

Hi @shanglun, I don't suppose you need to create a new PR. Just commit the required changes. That should work.

@shanglun
Copy link
Author
shanglun commented Nov 8, 2016

Right, my English :p

On Nov 8, 2016 1:15 PM, "Aman Dalmia" notifications@github.com wrote:

Hi @shanglun https://github.com/shanglun, I don't suppose you need to
create a new PR. Just commit the required changes. That should work.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7156 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKNQ_DYkrdNcTdt3wxpFBWQCRnE-PZsEks5q8LwlgaJpZM4JeZ4l
.

@shanglun
Copy link
Author
shanglun commented Nov 9, 2016

Hmm, I'm getting this error when trying to build the documentation

Exception occurred: File "C:\Users\shang\Anaconda3\lib\zipfile.py", line 1433, in write st = os.stat(filename) FileNotFoundError: [WinError 2] The system cannot find the file specified: 'auto_examples\\tree\\unveil_tree_structure.ipynb' The full traceback has been saved in C:\Users\shang\AppData\Local\Temp\sphinx-err-ueft6pnv.log, if you want to report the issue to the developers. Please also report this if it was a user error, so that a better error message can be provided next time. A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
Any ideas on how to fix this? (besides getting a Linux machine, heh)

@dalmia
Copy link
Contributor
dalmia commented Nov 9, 2016

Seems like an issue with the file path you specified. This might help you.

@shanglun
Copy link
Author
shanglun commented Nov 9, 2016

Yeah, figured that. Looks like the backslashes are in fact escaped, and
previous makes have succeeded. Git status doesn't think anything is missing.

On Nov 9, 2016 12:47 AM, "Aman Dalmia" notifications@github.com wrote:

Seems like an issue with the file path you specified. This
http://stackoverflow.com/questions/19767179/python-file-not-found-error
might help you.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7156 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKNQ_Fh-PNHENOJyxtq_S__mchbOlKlUks5q8V5hgaJpZM4JeZ4l
.

@shanglun
Copy link
Author

I'm having trouble building the documentation, looks like there is a plot_unveil_tree_structure.py but no unveil_tree_structure.ipynb

I thought maybe this is just an issue with my branch, but when I clone a fresh copy of the master branch and try to build the docs, I still get errors.

Are others able to build documentation without problems?

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Nov 14, 2016 via email

@shanglun
Copy link
Author

That build the documentation page in question. Thank you for the help. For future reference, what specifically caused the error? Was it a permissions issue?

As a general rule, ``transform`` and ``predict`` should not take additional
parameters. In cases where additional parameters would be useful, e.g. when
setting a threshold value for feature selection, the custom parameters
should be passed to ``__init__``. In case parameters need to be changed after the call to __init__, e.g. when setting a threshold value for feature selection, this can be done by setting the attribute directly on the estimator or calling set_parms.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please format so that each line is <80 characters.

@@ -974,6 +976,21 @@ The easiest and recommended way to accomplish this is to
**not do any parameter validation in** ``__init__``.
All logic behind estimator parameters,
like translating string arguments into functions, should be done in ``fit``.
While it is possible to pass parameters into the ``fit`` method, these
should be restricted to variables that need to be sliced during
cross-validation. These variables need to be arrays with shape[0]==n_samples (see
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please double backticks around code.

While it is possible to pass parameters into the ``fit`` method, these
should be restricted to variables that need to be sliced during
cross-validation. These variables need to be arrays with shape[0]==n_samples (see
:ref:`fitting` above). An example of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem to fit on a single line.

@@ -798,6 +798,8 @@ The reason for postponing the validation is that the same validation
would have to be performed in ``set_params``,
which is used in algorithms like ``GridSearchCV``.

.. _fitting:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used anywhere?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is linked to in a previous section - line 982.

@cmarmo
Copy link
Contributor
cmarmo commented Jul 9, 2020

@shanglun , @amueller, the documentatition about "Rolling your own estimation" has been moved to its own section (and file). Are this PR and the correspondent issue (#7142) still relevant? Thanks!

@shanglun
Copy link
Author
shanglun commented Jul 9, 2020 via email

@cmarmo
Copy link
Contributor
cmarmo commented Jul 10, 2020

Thanks @shanglun (Sean), this was your first pull request to scikit-learn, though and this is not a nice way to end it... :(
Please, feel free to pick a new issue and we will try to do better. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0