From 43d2351541c59e627ff3c2dad37a56a4d4325bf6 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Thu, 27 Dec 2018 10:37:43 -0500 Subject: [PATCH 1/6] Added tips for reading the code base --- doc/developers/tips.rst | 53 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 53 insertions(+) diff --git a/doc/developers/tips.rst b/doc/developers/tips.rst index 64efec199f95f..af5615adef17a 100644 --- a/doc/developers/tips.rst +++ b/doc/developers/tips.rst @@ -256,3 +256,56 @@ give you clues as to the source of your memory error. For more information on valgrind and the array of options it has, see the tutorials and documentation on the `valgrind web site `_. + +Reading the existing code base +============================== + +Reading and digesting an existing code base is always a difficult exercise that +takes time and experience to master. Even though we try to write simple code in +scikit-learn, understanding the code can seem overwhelming at first, given the +sheer size of the project. Here is a list of tips that may help make this task +easier and faster (in no particular order). + +- Get acquainted with the :ref:`api_overview`: understand what :term:`fit`, + :term:`predict`, :term:`transform`, etc. are used for. +- Before diving into reading the code of a function / class, go through the + docstrings first and try to get an idea of what each parameter / attribute + is doing. It may also help to stop a minute and think *how would I do this + myself if I had to?*. +- The trickiest thing is often to identify which portions of the code are + relevent, and which are not. In scikit-learn **a lot** of input checking + is performed, especially at the beginning of the :term:`fit` methods. + Sometimes, only a very small portion of the code is doing the actual job. For + example looking at the ``fit()`` method of + :class:`sklearn.linear_model.LinearRegression`, what you're looking for + might just be the call the ``scipy.linalg.lstsq``, but it is burried into + multiple lines of input checking and the handling of different kinds of + parameters. +- Sometimes, reading the tests for a given function will give you an idea of + what is its intended purpose. You can use ``git grep`` (see below) to find + all the tests written for a function. +- You'll often see code looking like this: + ``out = Parallel(...)(delayed(some_function)(param) for param in + some_iterable)``. This runs ``some_function`` in parallel using `Joblib + `_. ``out`` is then an iterable containing + the values returned by ``some_function`` for each call. +- We use `Cython `_ to write fast code. Cython code is + located in ``.pyx`` and ``.pxd`` files. Cython code has a more C-like + flavor: we use pointers, perform manual memory allocation, use OUT + variables (variables whose value is changed after a function call, which + is frowned upon in pure Python but extremely common in C), etc. Having + some minimal experience in C / C++ is pretty much mandatory here. +- Master your tools. + + - With such a big project, being efficient with your favorite editor or + IDE goes a long way towards digesting the code base. Being able to quickly + jump (or *peek*) to a function/class/attribute definition helps a lot. + So does being able to quickly see where a given name is used in a file. + - `git `_ also has some built-in killer + features. It is often useful to understand how a file changed over time, + using e.g. ``git blame`` (`manual + `_). This can also be done directly + on GitHub. ``git grep`` (`examples + `_) is also extremely + useful to see every occurence of a pattern (e.g. a function call or a + variable) in the code base. From d1c6e7b1cbccdff837de4dd5a617beb2c01db304 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Thu, 27 Dec 2018 10:39:46 -0500 Subject: [PATCH 2/6] Put it in contributing.rst --- doc/developers/contributing.rst | 53 +++++++++++++++++++++++++++++++++ doc/developers/tips.rst | 53 --------------------------------- 2 files changed, 53 insertions(+), 53 deletions(-) diff --git a/doc/developers/contributing.rst b/doc/developers/contributing.rst index a1a6068d53623..ab1bce69dfa30 100644 --- a/doc/developers/contributing.rst +++ b/doc/developers/contributing.rst @@ -1395,3 +1395,56 @@ that implement common linear model patterns. The :mod:`sklearn.utils.multiclass` module contains useful functions for working with multiclass and multilabel problems. + +Reading the existing code base +============================== + +Reading and digesting an existing code base is always a difficult exercise that +takes time and experience to master. Even though we try to write simple code in +scikit-learn, understanding the code can seem overwhelming at first, given the +sheer size of the project. Here is a list of tips that may help make this task +easier and faster (in no particular order). + +- Get acquainted with the :ref:`api_overview`: understand what :term:`fit`, + :term:`predict`, :term:`transform`, etc. are used for. +- Before diving into reading the code of a function / class, go through the + docstrings first and try to get an idea of what each parameter / attribute + is doing. It may also help to stop a minute and think *how would I do this + myself if I had to?*. +- The trickiest thing is often to identify which portions of the code are + relevent, and which are not. In scikit-learn **a lot** of input checking + is performed, especially at the beginning of the :term:`fit` methods. + Sometimes, only a very small portion of the code is doing the actual job. For + example looking at the ``fit()`` method of + :class:`sklearn.linear_model.LinearRegression`, what you're looking for + might just be the call the ``scipy.linalg.lstsq``, but it is burried into + multiple lines of input checking and the handling of different kinds of + parameters. +- Sometimes, reading the tests for a given function will give you an idea of + what is its intended purpose. You can use ``git grep`` (see below) to find + all the tests written for a function. +- You'll often see code looking like this: + ``out = Parallel(...)(delayed(some_function)(param) for param in + some_iterable)``. This runs ``some_function`` in parallel using `Joblib + `_. ``out`` is then an iterable containing + the values returned by ``some_function`` for each call. +- We use `Cython `_ to write fast code. Cython code is + located in ``.pyx`` and ``.pxd`` files. Cython code has a more C-like + flavor: we use pointers, perform manual memory allocation, use OUT + variables (variables whose value is changed after a function call, which + is frowned upon in pure Python but extremely common in C), etc. Having + some minimal experience in C / C++ is pretty much mandatory here. +- Master your tools. + + - With such a big project, being efficient with your favorite editor or + IDE goes a long way towards digesting the code base. Being able to quickly + jump (or *peek*) to a function/class/attribute definition helps a lot. + So does being able to quickly see where a given name is used in a file. + - `git `_ also has some built-in killer + features. It is often useful to understand how a file changed over time, + using e.g. ``git blame`` (`manual + `_). This can also be done directly + on GitHub. ``git grep`` (`examples + `_) is also extremely + useful to see every occurence of a pattern (e.g. a function call or a + variable) in the code base. diff --git a/doc/developers/tips.rst b/doc/developers/tips.rst index af5615adef17a..64efec199f95f 100644 --- a/doc/developers/tips.rst +++ b/doc/developers/tips.rst @@ -256,56 +256,3 @@ give you clues as to the source of your memory error. For more information on valgrind and the array of options it has, see the tutorials and documentation on the `valgrind web site `_. - -Reading the existing code base -============================== - -Reading and digesting an existing code base is always a difficult exercise that -takes time and experience to master. Even though we try to write simple code in -scikit-learn, understanding the code can seem overwhelming at first, given the -sheer size of the project. Here is a list of tips that may help make this task -easier and faster (in no particular order). - -- Get acquainted with the :ref:`api_overview`: understand what :term:`fit`, - :term:`predict`, :term:`transform`, etc. are used for. -- Before diving into reading the code of a function / class, go through the - docstrings first and try to get an idea of what each parameter / attribute - is doing. It may also help to stop a minute and think *how would I do this - myself if I had to?*. -- The trickiest thing is often to identify which portions of the code are - relevent, and which are not. In scikit-learn **a lot** of input checking - is performed, especially at the beginning of the :term:`fit` methods. - Sometimes, only a very small portion of the code is doing the actual job. For - example looking at the ``fit()`` method of - :class:`sklearn.linear_model.LinearRegression`, what you're looking for - might just be the call the ``scipy.linalg.lstsq``, but it is burried into - multiple lines of input checking and the handling of different kinds of - parameters. -- Sometimes, reading the tests for a given function will give you an idea of - what is its intended purpose. You can use ``git grep`` (see below) to find - all the tests written for a function. -- You'll often see code looking like this: - ``out = Parallel(...)(delayed(some_function)(param) for param in - some_iterable)``. This runs ``some_function`` in parallel using `Joblib - `_. ``out`` is then an iterable containing - the values returned by ``some_function`` for each call. -- We use `Cython `_ to write fast code. Cython code is - located in ``.pyx`` and ``.pxd`` files. Cython code has a more C-like - flavor: we use pointers, perform manual memory allocation, use OUT - variables (variables whose value is changed after a function call, which - is frowned upon in pure Python but extremely common in C), etc. Having - some minimal experience in C / C++ is pretty much mandatory here. -- Master your tools. - - - With such a big project, being efficient with your favorite editor or - IDE goes a long way towards digesting the code base. Being able to quickly - jump (or *peek*) to a function/class/attribute definition helps a lot. - So does being able to quickly see where a given name is used in a file. - - `git `_ also has some built-in killer - features. It is often useful to understand how a file changed over time, - using e.g. ``git blame`` (`manual - `_). This can also be done directly - on GitHub. ``git grep`` (`examples - `_) is also extremely - useful to see every occurence of a pattern (e.g. a function call or a - variable) in the code base. From c0e934d4500015f3e0c8d5ed7baa0a998787b0c4 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Thu, 27 Dec 2018 10:42:23 -0500 Subject: [PATCH 3/6] typos --- doc/developers/contributing.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/developers/contributing.rst b/doc/developers/contributing.rst index ab1bce69dfa30..7ec960f5ab3ab 100644 --- a/doc/developers/contributing.rst +++ b/doc/developers/contributing.rst @@ -1412,12 +1412,12 @@ easier and faster (in no particular order). is doing. It may also help to stop a minute and think *how would I do this myself if I had to?*. - The trickiest thing is often to identify which portions of the code are - relevent, and which are not. In scikit-learn **a lot** of input checking + relevant, and which are not. In scikit-learn **a lot** of input checking is performed, especially at the beginning of the :term:`fit` methods. Sometimes, only a very small portion of the code is doing the actual job. For example looking at the ``fit()`` method of :class:`sklearn.linear_model.LinearRegression`, what you're looking for - might just be the call the ``scipy.linalg.lstsq``, but it is burried into + might just be the call the ``scipy.linalg.lstsq``, but it is buried into multiple lines of input checking and the handling of different kinds of parameters. - Sometimes, reading the tests for a given function will give you an idea of @@ -1446,5 +1446,5 @@ easier and faster (in no particular order). `_). This can also be done directly on GitHub. ``git grep`` (`examples `_) is also extremely - useful to see every occurence of a pattern (e.g. a function call or a + useful to see every occurrence of a pattern (e.g. a function call or a variable) in the code base. From d4a04fd3fcb9f847ca3b03e4c0a1179542c31efa Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Fri, 28 Dec 2018 12:17:24 -0500 Subject: [PATCH 4/6] Added bullet point about inheritance --- doc/developers/contributing.rst | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/doc/developers/contributing.rst b/doc/developers/contributing.rst index 7ec960f5ab3ab..d4ab9c66deb0f 100644 --- a/doc/developers/contributing.rst +++ b/doc/developers/contributing.rst @@ -1420,6 +1420,12 @@ easier and faster (in no particular order). might just be the call the ``scipy.linalg.lstsq``, but it is buried into multiple lines of input checking and the handling of different kinds of parameters. +- Due to the use of `Inheritance + `_, + some methods may be implemented in parent classes. All estimators inherit + at least from ``BaseEstimator``, and from a ``Mixin`` class that enables + default behaviour depending on the nature of the estimator (classifier, + regressor, transformer, etc.). - Sometimes, reading the tests for a given function will give you an idea of what is its intended purpose. You can use ``git grep`` (see below) to find all the tests written for a function. From 6e554a9c164f6c0dc7ba4d48f1fb4aea41d5a484 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Mon, 31 Dec 2018 11:20:19 -0500 Subject: [PATCH 5/6] Addressed comments --- CONTRIBUTING.md | 1 + doc/developers/contributing.rst | 25 +++++++++++++------------ 2 files changed, 14 insertions(+), 12 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 32ca91f49f6aa..bca3508478ba5 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -34,6 +34,7 @@ Quick links to: * [Submitting a bug report or feature request](http://scikit-learn.org/dev/developers/contributing.html#submitting-a-bug-report-or-a-feature-request) * [Contributing code](http://scikit-learn.org/dev/developers/contributing.html#contributing-code) * [Coding guidelines](http://scikit-learn.org/dev/developers/contributing.html#coding-guidelines) +* [Tips to read current code](http://scikit-learn.org/dev/developers/contributing.html#reading-code) Code of Conduct --------------- diff --git a/doc/developers/contributing.rst b/doc/developers/contributing.rst index d4ab9c66deb0f..2b2e37d8d89ad 100644 --- a/doc/developers/contributing.rst +++ b/doc/developers/contributing.rst @@ -1396,26 +1396,28 @@ that implement common linear model patterns. The :mod:`sklearn.utils.multiclass` module contains useful functions for working with multiclass and multilabel problems. +.. _reading-code: + Reading the existing code base ============================== -Reading and digesting an existing code base is always a difficult exercise that -takes time and experience to master. Even though we try to write simple code in -scikit-learn, understanding the code can seem overwhelming at first, given the -sheer size of the project. Here is a list of tips that may help make this task -easier and faster (in no particular order). +Reading and digesting an existing code base is always a difficult exercise +that takes time and experience to master. Even though we try to write simple +code in general, understanding the code can seem overwhelming at first, +given the sheer size of the project. Here is a list of tips that may help +make this task easier and faster (in no particular order). - Get acquainted with the :ref:`api_overview`: understand what :term:`fit`, :term:`predict`, :term:`transform`, etc. are used for. - Before diving into reading the code of a function / class, go through the docstrings first and try to get an idea of what each parameter / attribute is doing. It may also help to stop a minute and think *how would I do this - myself if I had to?*. + myself if I had to?* - The trickiest thing is often to identify which portions of the code are relevant, and which are not. In scikit-learn **a lot** of input checking is performed, especially at the beginning of the :term:`fit` methods. - Sometimes, only a very small portion of the code is doing the actual job. For - example looking at the ``fit()`` method of + Sometimes, only a very small portion of the code is doing the actual job. + For example looking at the ``fit()`` method of :class:`sklearn.linear_model.LinearRegression`, what you're looking for might just be the call the ``scipy.linalg.lstsq``, but it is buried into multiple lines of input checking and the handling of different kinds of @@ -1428,7 +1430,8 @@ easier and faster (in no particular order). regressor, transformer, etc.). - Sometimes, reading the tests for a given function will give you an idea of what is its intended purpose. You can use ``git grep`` (see below) to find - all the tests written for a function. + all the tests written for a function. Most tests for a specific + function/class are placed under the ``tests/`` folder of the module - You'll often see code looking like this: ``out = Parallel(...)(delayed(some_function)(param) for param in some_iterable)``. This runs ``some_function`` in parallel using `Joblib @@ -1436,9 +1439,7 @@ easier and faster (in no particular order). the values returned by ``some_function`` for each call. - We use `Cython `_ to write fast code. Cython code is located in ``.pyx`` and ``.pxd`` files. Cython code has a more C-like - flavor: we use pointers, perform manual memory allocation, use OUT - variables (variables whose value is changed after a function call, which - is frowned upon in pure Python but extremely common in C), etc. Having + flavor: we use pointers, perform manual memory allocation, etc. Having some minimal experience in C / C++ is pretty much mandatory here. - Master your tools. From 3a97f5696e98179e188e1b5af1697f08000c81a1 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Mon, 31 Dec 2018 12:12:00 -0500 Subject: [PATCH 6/6] Addressed comments from Adrin --- doc/developers/contributing.rst | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/doc/developers/contributing.rst b/doc/developers/contributing.rst index 2b2e37d8d89ad..2ddfea49b6924 100644 --- a/doc/developers/contributing.rst +++ b/doc/developers/contributing.rst @@ -1425,11 +1425,12 @@ make this task easier and faster (in no particular order). - Due to the use of `Inheritance `_, some methods may be implemented in parent classes. All estimators inherit - at least from ``BaseEstimator``, and from a ``Mixin`` class that enables - default behaviour depending on the nature of the estimator (classifier, - regressor, transformer, etc.). + at least from :class:`BaseEstimator `, and + from a ``Mixin`` class (e.g. :class:`ClassifierMixin + `) that enables default behaviour depending + on the nature of the estimator (classifier, regressor, transformer, etc.). - Sometimes, reading the tests for a given function will give you an idea of - what is its intended purpose. You can use ``git grep`` (see below) to find + what its intended purpose is. You can use ``git grep`` (see below) to find all the tests written for a function. Most tests for a specific function/class are placed under the ``tests/`` folder of the module - You'll often see code looking like this: