diff --git a/doc/modules/neural_networks_supervised.rst b/doc/modules/neural_networks_supervised.rst
index 388f32e7c6925..61b9c83509979 100644
--- a/doc/modules/neural_networks_supervised.rst
+++ b/doc/modules/neural_networks_supervised.rst
@@ -29,44 +29,51 @@ between the input and the output layer, there can be one or more non-linear
 layers, called hidden layers. Figure 1 shows a one hidden layer MLP with scalar
 output.
 
-.. figure:: ../images/multilayerperceptron_network.png
-   :align: center
-   :scale: 60%
+.. container:: dropdown
+
+   .. topic:: Figure 1: One hidden layer MLP
+
+      .. figure:: ../images/multilayerperceptron_network.png
+         :align: center
+         :scale: 60%
+
+         **Figure 1 : One hidden layer MLP.**
+
+      The leftmost layer, known as the input layer, consists of a set of neurons
+      :math:`\{x_i | x_1, x_2, ..., x_m\}` representing the input features. Each
+      neuron in the hidden layer transforms the values from the previous layer with
+      a weighted linear summation :math:`w_1x_1 + w_2x_2 + ... + w_mx_m`, followed
+      by a non-linear activation function :math:`g(\cdot):R \rightarrow R` - like
+      the hyperbolic tan function. The output layer receives the values from the
+      last hidden layer and transforms them into output values.
+
+.. container:: dropdown
 
-   **Figure 1 : One hidden layer MLP.**
+   .. topic:: Details about `coefs_` and `intercepts_`
 
-The leftmost layer, known as the input layer, consists of a set of neurons
-:math:`\{x_i | x_1, x_2, ..., x_m\}` representing the input features. Each
-neuron in the hidden layer transforms the values from the previous layer with
-a weighted linear summation :math:`w_1x_1 + w_2x_2 + ... + w_mx_m`, followed
-by a non-linear activation function :math:`g(\cdot):R \rightarrow R` - like
-the hyperbolic tan function. The output layer receives the values from the
-last hidden layer and transforms them into output values.
+      The module contains the public attributes ``coefs_`` and ``intercepts_``.
+      ``coefs_`` is a list of weight matrices, where weight matrix at index
+      :math:`i` represents the weights between layer :math:`i` and layer
+      :math:`i+1`. ``intercepts_`` is a list of bias vectors, where the vector
+      at index :math:`i` represents the bias values added to layer :math:`i+1`.
 
-The module contains the public attributes ``coefs_`` and ``intercepts_``.
-``coefs_`` is a list of weight matrices, where weight matrix at index
-:math:`i` represents the weights between layer :math:`i` and layer
-:math:`i+1`. ``intercepts_`` is a list of bias vectors, where the vector
-at index :math:`i` represents the bias values added to layer :math:`i+1`.
+.. topic:: Advantages of Multi-layer Perceptron
 
-The advantages of Multi-layer Perceptron are:
+   The advantages of Multi-layer Perceptron are:
 
-    + Capability to learn non-linear models.
+   - Capability to learn non-linear models.
+   - Capability to learn models in real-time (on-line learning) using ``partial_fit``.
 
-    + Capability to learn models in real-time (on-line learning)
-      using ``partial_fit``.
 
+.. topic:: Disadvantages of Multi-layer Perceptron (MLP)
 
-The disadvantages of Multi-layer Perceptron (MLP) include:
+   The disadvantages of Multi-layer Perceptron (MLP) include:
 
-    + MLP with hidden layers have a non-convex loss function where there exists
-      more than one local minimum. Therefore different random weight
-      initializations can lead to different validation accuracy.
+   - MLP with hidden layers have a non-convex loss function where there exists more than one local minimum. Therefore different random weight initializations can lead to different validation accuracy.
 
-    + MLP requires tuning a number of hyperparameters such as the number of
-      hidden neurons, layers, and iterations.
+   - MLP requires tuning a number of hyperparameters such as the number of hidden neurons, layers, and iterations.
 
-    + MLP is sensitive to feature scaling.
+   - MLP is sensitive to feature scaling.
 
 Please see :ref:`Tips on Practical Use <mlp_tips>` section that addresses
 some of these disadvantages.
@@ -308,38 +315,25 @@ when the improvement in loss is below a certain, small number.
 
 .. _mlp_tips:
 
-Tips on Practical Use
-=====================
-
-  * Multi-layer Perceptron is sensitive to feature scaling, so it
-    is highly recommended to scale your data. For example, scale each
-    attribute on the input vector X to [0, 1] or [-1, +1], or standardize
-    it to have mean 0 and variance 1. Note that you must apply the *same*
-    scaling to the test set for meaningful results.
-    You can use :class:`~sklearn.preprocessing.StandardScaler` for standardization.
-
-      >>> from sklearn.preprocessing import StandardScaler  # doctest: +SKIP
-      >>> scaler = StandardScaler()  # doctest: +SKIP
-      >>> # Don't cheat - fit only on training data
-      >>> scaler.fit(X_train)  # doctest: +SKIP
-      >>> X_train = scaler.transform(X_train)  # doctest: +SKIP
-      >>> # apply same transformation to test data
-      >>> X_test = scaler.transform(X_test)  # doctest: +SKIP
-
-    An alternative and recommended approach is to use
-    :class:`~sklearn.preprocessing.StandardScaler` in a
-    :class:`~sklearn.pipeline.Pipeline`
-
-  * Finding a reasonable regularization parameter :math:`\alpha` is best done
-    using :class:`~sklearn.model_selection.GridSearchCV`, usually in the range
-    ``10.0 ** -np.arange(1, 7)``.
-
-  * Empirically, we observed that `L-BFGS` converges faster and
-    with better solutions on small datasets. For relatively large
-    datasets, however, `Adam` is very robust. It usually converges
-    quickly and gives pretty good performance. `SGD` with momentum or
-    nesterov's momentum, on the other hand, can perform better than
-    those two algorithms if learning rate is correctly tuned.
+.. container:: dropdown
+
+   .. topic:: Tips on Practical Use
+
+      - Multi-layer Perceptron is sensitive to feature scaling, so it is highly recommended to scale your data. For example, scale each attribute on the input vector X to [0, 1] or [-1, +1], or standardize it to have mean 0 and variance 1. Note that you must apply the *same* scaling to the test set for meaningful results. You can use :class:`~sklearn.preprocessing.StandardScaler` for standardization.
+
+        >>> from sklearn.preprocessing import StandardScaler  # doctest: +SKIP
+        >>> scaler = StandardScaler()  # doctest: +SKIP
+        >>> # Don't cheat - fit only on training data
+        >>> scaler fit(X_train)  # doctest: +SKIP
+        >>> X_train = scaler.transform(X_train)  # doctest: +SKIP
+        >>> # apply the same transformation to the test data
+        >>> X_test = scaler.transform(X_test)  # doctest: +SKIP
+
+      - An alternative and recommended approach is to use :class:`~sklearn.preprocessing.StandardScaler` in a :class:`~sklearn.pipeline.Pipeline`.
+
+      - Finding a reasonable regularization parameter :math:`\alpha` is best done using :class:`~sklearn.model_selection.GridSearchCV`, usually in the range ``10.0 ** -np.arange(1, 7)``.
+
+      - Empirically, we observed that `L-BFGS` converges faster and with better solutions on small datasets. For relatively large datasets, however, `Adam` is very robust. It usually converges quickly and gives pretty good performance. `SGD` with momentum or nesterov's momentum, on the other hand, can perform better than those two algorithms if the learning rate is correctly tuned.
 
 More control with warm_start
 ============================