8000 DOC Adds more docstring standards by thomasjpfan · Pull Request #14744 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@thomasjpfan
Copy link
Member
@thomasjpfan thomasjpfan commented Aug 23, 2019

Reference Issues/PRs

Resolves #14664

What does this implement/fix? Explain your changes.

  1. Details what to do when default=None.
  2. Details how to use types with list and ndarrays.

CC: @glemaitre @NicolasHug

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cosmetic

4. 1D or 2D data can be a subset of
``{array-like, ndarray, sparse matrix, dataframe}``. Note that ``array-like``
can also be a ``list``, while ``ndarray`` is explicitly only a ``numpy.ndarray``.
5. When specifying the data type of an list``, use ``of`` as a delimiter:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5. When specifying the data type of an list``, use ``of`` as a delimiter:
5. When specifying the data type of a list, use ``of`` as a delimiter:

Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments otherwise LGTM.

of the mentioned shapes. The default value is
`np.ones(shape=(n_samples,))`.

list_param : list of int of shape (n_classes,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for builtin types it would have been nice to the typing system (i.e. List[int]. Basically we are re-inventing typing semantics here.

But I guess it's will be a bit more work to adopt and people not familiar with it might find it confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using python typing syntax in the docstring itself is kind of nice.

We can have stuff like this:

sample_weight : ndarray[shape=(n_features,)], default=None

I am a bit use to parsing code to know if this is "human readable".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample_weight : ndarray[shape=(n_features,)], default=None

The problem is that as far as I know it's still no standard way of representing ndarray shape with types numpy/numpy#7370 (comment)

Also readability can indeed be challenging..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If https://www.python.org/dev/peps/pep-0593/ happens, maybe we can have:

sample_weight : Annotated[ndarray, Shape('n_features',), DType(np.int32)]

(I have mixed feelings about this one)

can also be a ``list``, while ``ndarray`` is explicitly only a ``numpy.ndarray``.
5. When specifying the data type of an list``, use ``of`` as a delimiter:
``list of int``.
6. When specifying the dtype of an ndarray, use ``dtype=int`` after
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
6. When specifying the dtype of an ndarray, use ``dtype=int`` after
6. When specifying the dtype of an ndarray, use e.g. ``dtype=np.int32` after

as int is not a native numpy dtype and will be cast to np.int_ which is a long integer. So that makes it a somewhat confusing example.

5. When specifying the data type of an list``, use ``of`` as a delimiter:
``list of int``.
6. When specifying the dtype of an ndarray, use ``dtype=int`` after
defining the shape: ``ndarray of shape (n_samples,), dtype=int``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defining the shape: ``ndarray of shape (n_samples,), dtype=int``.
defining the shape: ``ndarray of shape (n_samples,), dtype=np.int32``.

Copy link
Member
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we didn't want to change what we had decided too much.


list_param : list of int

typed_ndarray : ndarray of shape (n_samples,), dtype=np.int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when did we decide to put the dtype at the end, instead of np.int32 ndarray of shape (n_samples,)? This also doesn't play well with the other possible input types, if present.

Copy link
Member Author
@thomasjpfan thomasjpfan Aug 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have not decided where to put the dtype of an array. This PR is trying to figure that out

Copy link
Member Author
@thomasjpfan thomasjpfan Aug 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this case are we going with:

Option 1

list_param : int list

tuple_of_list : (int, int) list

mixed_tuple_of_list : {(int,), (int,int)} list

Option 2

list_param : list of int

tuple_of_list : list of (int, int)

mixed_tuple_of_list : list of {(int,), (int,int)} 

If the type goes before the element, then it should be option 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I prefer option 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 for option 2. It's really not clear to me that int list is a list of int as it's contrary to typing conventions I am aware of , I would probably read that as int or list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am starting prefer option 2. For the complicated case of mixed_tuple_of_list:

mixed_tuple_of_list: list of (int,) or list of (int, int)

For array-like:

d : array-like of int

For the various combinations of ndarrays:

b : ndarray of shape (n_features,)

c : ndarray of int

I am on the fence about the following:

a : ndarray of shape (n_features,) dtype int

I am hoping this is read as "ndarray of shape (n_features,) and of dtype int."

2. Use parenthesis for defining shapes: ``array-like of shape (n_samples,)``
or ``array-like of shape (n_samples, n_features)``
2. Use parenthesis for defining shapes: ``array-like, shape=(n_samples,)``
or ``array-like, shape=(n_samples, n_features)``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I missed the point where we decided to go back to using this syntax, where did that happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was suppose to be reverted, we are sticking with the of shape syntax

Copy link
Member
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I can update the other PR fixing the docstrings once this one is merged.

Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do feel that type annotations are the final solution to this (if only in ensuring that documentation matches code behavior), and I wouldn't put too much effort in making sure that all docstrings follow this new conventions. Though people are welcome to try.

In any case, it is an improvement and I think we can merge this and iterate on it later if needed.

@
9E88
rth
Copy link
Member
rth commented Sep 13, 2019

Merging, all comments were addressed as far as I can tell.

@rth rth changed the title [MRG] DOC Adds more docstring standards DOC Adds more docstring standards Sep 13, 2019
@rth rth merged commit d18cc29 into scikit-learn:master Sep 13, 2019
@thomasjpfan
Copy link
Member Author

I do feel that type annotations are the final solution to this

If the type annotation system was sufficient in describe our types, I would push heavily for this. I may be missing something regarding type annotations... Do you see a python type annotation way to say "ndarray of shape (n_samples, n_features), dtype=np.float32"?

@rth
Copy link
Member
rth commented Sep 13, 2019

If the type annotation system was sufficient in describe our types, I would push heavily for this.

Yeah, how to represent shapes with types is an open question (and if it even makes sense to do so),

For the rest you can very well do something like Array2[np.float64] or Array[Ndim2, np.float64]. But it's doesn't make much sense to do so, until it is normalized on the numpy side.

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Sep 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add more dtype details to docstring standard doc

4 participants

0