scikit-learn
diff --git a/‎doc/datasets/index.rst
Lines changed: 93 additions & 13 deletions b/‎doc/datasets/index.rst
Lines changed: 93 additions & 13 deletions
diff --git a/‎doc/sphinxext/gen_rst.py
Lines changed: 1 addition & 1 deletion b/‎doc/sphinxext/gen_rst.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/datasets/plot_random_dataset.py
Lines changed: 26 additions & 14 deletions b/‎examples/datasets/plot_random_dataset.py
Lines changed: 26 additions & 14 deletions
diff --git a/‎examples/datasets/plot_random_multilabel_dataset.py
Lines changed: 96 additions & 0 deletions b/‎examples/datasets/plot_random_multilabel_dataset.py
Lines changed: 96 additions & 0 deletions
@@ -108,33 +108,113 @@ Sample generators
 In addition, scikit-learn includes various random sample generators that
 can be used to build artificial datasets of controlled size and complexity.
 
+Generators for classification and clustering
+--------------------------------------------
+
+These generators produce a matrix of features and corresponding discrete
+targets.
+
+Single label
+~~~~~~~~~~~~
+
+Both :func:`make_blobs` and :func:`make_classification` create multiclass
+datasets by allocating each class one or more normally-distributed clusters of
+points.  :func:`make_blobs` provides greater control regarding the centers and
+standard deviations of each cluster, and is used to demonstrate clustering.
+:func:`make_classification` specialises in introducing noise by way of:
+correlated, redundant and uninformative features; multiple Gaussian clusters
+per class; and linear transformations of the feature space.
+
+:func:`make_gaussian_quantiles` divides a single Gaussian cluster into
+near-equal-size classes separated by concentric hyperspheres.
+:func:`make_hastie_10_2` generates a similar binary, 10-dimensional problem.
+
 .. image:: ../auto_examples/datasets/images/plot_random_dataset_001.png
    :target: ../auto_examples/datasets/plot_random_dataset.html
    :scale: 50
    :align: center
 
+:func:`make_circles` and :func:`make_moons` generate 2d binary classification
+datasets that are challenging to certain algorithms (e.g. centroid-based
+clustering or linear classification), including optional Gaussian noise.
+They are useful for visualisation. produces Gaussian
+data with a spherical decision boundary for binary classification.
+
+Multilabel
+~~~~~~~~~~
+
+:func:`make_multilabel_classification` generates random samples with multiple
+labels, reflecting a bag of words drawn from a mixture of topics. The number of
+topics for each document is drawn from a Poisson distribution, and the topics
+themselves are drawn from a fixed random distribution. Similarly, the number of
+words is drawn from Poisson, with words drawn from a multinomial, where each
+topic defines a probability distribution over words. Simplifications with
+respect to true bag-of-words mixtures include:
+
+* Per-topic word distributions are independently drawn, where in reality all
+  would be affected by a sparse base distribution, and would be correlated.
+* For a document generated from multiple topics, all topics are weighted
+  equally in generating its bag of words.
+* Documents without labels words at random, rather than from a base
+  distribution.
+
+.. image:: ../auto_examples/datasets/images/plot_random_multilabel_dataset_001.png
+   :target: ../auto_examples/datasets/plot_random_multilabel_dataset.html
+   :scale: 50
+   :align: center
+
+Biclustering
+~~~~~~~~~~~~
+
+.. autosummary::
+
+   :toctree: ../modules/generated/
+   :template: function.rst
+
+   make_biclusters
+   make_checkerboard
+
+
+Generators for regression
+-------------------------
+
+:func:`make_regression` produces regression targets as an optionally-sparse
+random linear combination of random features, with noise. Its informative
+features may be uncorrelated, or low rank (few features account for most of the
+variance).
+
+Other regression generators generate functions deterministically from
+randomized features.  :func:`make_sparse_uncorrelated` produces a target as a
+linear combination of four features with fixed coefficients.
+Others encode explicitly non-linear relations:
+:func:`make_friedman1` is related by polynomial and sine transforms;
+:func:`make_friedman2` includes feature multiplication and reciprocation; and
+:func:`make_friedman3` is similar with an arctan transformation on the target.
+
+Generators for manifold learning
+--------------------------------
+
+.. autosummary::
+
+   :toctree: ../modules/generated/
+   :template: function.rst
+
+   make_s_curve
+   make_swiss_roll
+
+Generators for decomposition
+----------------------------
+
 .. autosummary::
 
    :toctree: ../modules/generated/
    :template: function.rst
 
-   make_classification
-   make_multilabel_classification
-   make_regression
-   make_blobs
-   make_friedman1
-   make_friedman2
-   make_friedman3
-   make_hastie_10_2
    make_low_rank_matrix
    make_sparse_coded_signal
-   make_sparse_uncorrelated
    make_spd_matrix
-   make_swiss_roll
-   make_s_curve
    make_sparse_spd_matrix
-   make_biclusters
-   make_checkerboard
+
 
 .. _libsvm_loader:
 
 
@@ -870,7 +870,7 @@ def generate_file_rst(fname, target_dir, src_dir, root_dir, plot_gallery):
                     my_stdout = my_stdout.replace(
                         my_globals['__doc__'],
                         '')
-                my_stdout = my_stdout.strip()
+                my_stdout = my_stdout.strip().expandtabs()
                 if my_stdout:
                     stdout = '**Script output**::\n\n  %s\n\n' % (
                         '\n  '.join(my_stdout.split('\n')))
 
@@ -4,46 +4,58 @@
 ==============================================
 
 Plot several randomly generated 2D classification datasets.
-This example illustrates the `datasets.make_classification`
-function.
+This example illustrates the :func:`datasets.make_classification`
+:func:`datasets.make_blobs` and :func:`datasets.make_gaussian_quantiles`
+functions.
 
-Three binary and two multi-class classification datasets
-are generated, with different numbers of informative
-features and clusters per class.
-"""
+For ``make_classification``, three binary and two multi-class classification
+datasets are generated, with different numbers of informative features and
+clusters per class.  """
 
 print(__doc__)
 
 import matplotlib.pyplot as plt
 
 from sklearn.datasets import make_classification
+from sklearn.datasets import make_blobs
+from sklearn.datasets import make_gaussian_quantiles
 
-plt.figure(figsize=(8, 6))
+plt.figure(figsize=(8, 8))
 plt.subplots_adjust(bottom=.05, top=.9, left=.05, right=.95)
 
-plt.subplot(221)
-plt.title("One informative feature, one cluster", fontsize='small')
+plt.subplot(321)
+plt.title("One informative feature, one cluster per class", fontsize='small')
 X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=1,
                              n_clusters_per_class=1)
 plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)
 
-plt.subplot(222)
-plt.title("Two informative features, one cluster", fontsize='small')
+plt.subplot(322)
+plt.title("Two informative features, one cluster per class", fontsize='small')
 X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=2,
                              n_clusters_per_class=1)
 plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)
 
-plt.subplot(223)
-plt.title("Two informative features, two clusters", fontsize='small')
+plt.subplot(323)
+plt.title("Two informative features, two clusters per class", fontsize='small')
 X2, Y2 = make_classification(n_features=2, n_redundant=0, n_informative=2)
 plt.scatter(X2[:, 0], X2[:, 1], marker='o', c=Y2)
 
 
-plt.subplot(224)
+plt.subplot(324)
 plt.title("Multi-class, two informative features, one cluster",
           fontsize='small')
 X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=2,
                              n_clusters_per_class=1, n_classes=3)
 plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)
 
+plt.subplot(325)
+plt.title("Three blobs", fontsize='small')
+X1, Y1 = make_blobs(n_features=2, centers=3)
+plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)
+
+plt.subplot(326)
+plt.title("Gaussian divided into three quantiles", fontsize='small')
+X1, Y1 = make_gaussian_quantiles(n_features=2, n_classes=3)
+plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)
+
 plt.show()
@@ -0,0 +1,96 @@
+"""
+==============================================
+Plot randomly generated multilabel dataset
+==============================================
+
+This illustrates the `datasets.make_multilabel_classification` dataset
+generator. Each sample consists of counts of two features (up to 50 in
+total), which are differently distributed in each of two classes.
+
+Points are labeled as follows, where Y means the class is present:
+
+    =====  =====  =====  ======
+      1      2      3    Color
+    =====  =====  =====  ======
+      Y      N      N    Red
+      N      Y      N    Blue
+      N      N      Y    Yellow
+      Y      Y      N    Purple
+      Y      N      Y    Orange
+      Y      Y      N    Green
+      Y      Y      Y    Brown
+    =====  =====  =====  ======
+
+A star marks the expected sample for each class; its size reflects the
+probability of selecting that class label.
+
+The left and right examples highlight the ``n_labels`` parameter:
+more of the samples in the right plot have 2 or 3 labels.
+
+Note that this two-dimensional example is very degenerate:
+generally the number of features would be much greater than the
+"document length", while here we have much larger documents than vocabulary.
+Similarly, with ``n_classes > n_features``, it is much less likely that a
+feature distinguishes a particular cluss.
+"""
+
+from __future__ import print_function
+import numpy as np
+import matplotlib.pyplot as plt
+
+from sklearn.datasets import make_multilabel_classification as make_ml_clf
+
+print(__doc__)
+
+COLORS = np.array(['!',
+                   '#FF3333',  # red
+                   '#0198E1',  # blue
+                   '#BF5FFF',  # purple
+                   '#FCD116',  # yellow
+                   '#FF7216',  # orange
+                   '#4DBD33',  # green
+                   '#87421F'   # brown
+                   ])
+
+# Use same random seed for multiple calls to make_multilabel_classification to
+# ensure same distributions
+RANDOM_SEED = np.random.randint(2 ** 10)
+
+
+def plot_2d(ax, n_labels=1, n_classes=3, length=50):
+    X, Y, p_c, p_w_c = make_ml_clf(n_samples=150, n_features=2,
+                                   n_classes=n_classes, n_labels=n_labels,
+                                   length=length, allow_unlabeled=False,
+                                   return_indicator=True,
+                                   return_distributions=True,
+                                   random_state=RANDOM_SEED)
+
+    ax.scatter(X[:, 0], X[:, 1], color=COLORS.take((Y * [1, 2, 4]
+                                                    ).sum(axis=1)),
+               marker='.')
+    ax.scatter(p_w_c[0] * length, p_w_c[1] * length,
+               marker='*', linewidth=.5, edgecolor='black',
+               s=20 + 1500 * p_c ** 2,
+               color=COLORS.take([1, 2, 4]))
+    ax.set_xlabel('Feature 0 count')
+    return p_c, p_w_c
+
+
+_, (ax1, ax2) = plt.subplots(1, 2, sharex='row', sharey='row', figsize=(8, 4))
+plt.subplots_adjust(bottom=.15)
+
+p_c, p_w_c = plot_2d(ax1, n_labels=1)
+ax1.set_title('n_labels=1, length=50')
+ax1.set_ylabel('Feature 1 count')
+
+plot_2d(ax2, n_labels=3)
+ax2.set_title('n_labels=3, length=50')
+ax2.set_xlim(left=0, auto=True)
+ax2.set_ylim(bottom=0, auto=True)
+
+plt.show()
+
+print('The data was generated from (random_state=%d):' % RANDOM_SEED)
+print('Class', 'P(C)', 'P(w0|C)', 'P(w1|C)', sep='\t')
+for k, p, p_w in zip(['red', 'blue', 'yellow'], p_c, p_w_c.T):
+    print('%s\t%0.2f\t%0.2f\t%0.2f' % (k, p, p_w[0], p_w[1]))