8000 Merge pull request #3001 from jnothman/doc_sample_gens · scikit-learn/scikit-learn@da7f009 · GitHub
[go: up one dir, main page]

Skip to content

Commit da7f009

Browse files
committed
Merge pull request #3001 from jnothman/doc_sample_gens
[MRG] improve documentation on sample generators
2 parents 69ff0b2 + 787e4a0 commit da7f009

File tree

6 files changed

+273
-29
lines changed

6 files changed

+273
-29
lines changed

doc/datasets/index.rst

Lines changed: 93 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -108,33 +108,113 @@ Sample generators
108108
In addition, scikit-learn includes various random sample generators that
109109
can be used to build artificial datasets of controlled size and complexity.
110110

111+
Generators for classification and clustering
112+
--------------------------------------------
113+
114+
These generators produce a matrix of features and corresponding discrete
115+
targets.
116+
117+
Single label
118+
~~~~~~~~~~~~
119+
120+
Both :func:`make_blobs` and :func:`make_classification` create multiclass
121+
datasets by allocating each class one or more normally-distributed clusters of
122+
points. :func:`make_blobs` provides greater control regarding the centers and
123+
standard deviations of each cluster, and is used to demonstrate clustering.
124+
:func:`make_classification` specialises in introducing noise by way of:
125+
correlated, redundant and uninformative features; multiple Gaussian clusters
126+
per class; and linear transformations of the feature space.
127+
128+
:func:`make_gaussian_quantiles` divides a single Gaussian cluster into
129+
near-equal-size classes separated by concentric hyperspheres.
130+
:func:`make_hastie_10_2` generates a similar binary, 10-dimensional problem.
131+
111132
.. image:: ../auto_examples/datasets/images/plot_random_dataset_001.png
112133
:target: ../auto_examples/datasets/plot_random_dataset.html
113134
:scale: 50
114135
:align: center
115136

137+
:func:`make_circles` and :func:`make_moons` generate 2d binary classification
138+
datasets that are challenging to certain algorithms (e.g. centroid-based
139+
clustering or linear classification), including optional Gaussian noise.
140+
They are useful for visualisation. produces Gaussian
141+
data with a spherical decision boundary for binary classification.
142+
143+
Multilabel
144+
~~~~~~~~~~
145+
146+
:func:`make_multilabel_classification` generates random samples with multiple
147+
labels, reflecting a bag of words drawn from a mixture of topics. The number of
148+
topics for each document is drawn from a Poisson distribution, and the topics
149+
themselves are drawn from a fixed random distribution. Similarly, the number of
150+
words is drawn from Poisson, with words drawn from a multinomial, where each
151+
topic defines a probability distribution over words. Simplifications with
152+
respect to true bag-of-words mixtures include:
153+
154+
* Per-topic word distributions are independently drawn, where in reality all
155+
would be affected by a sparse base distribution, and would be correlated.
156+
* For a document generated from multiple topics, all topics are weighted
157+
equally in generating its bag of words.
158+
* Documents without labels words at random, rather than from a base
159+
distribution.
160+
161+
.. image:: ../auto_examples/datasets/images/plot_random_multilabel_dataset_001.png
162+
:target: ../auto_examples/datasets/plot_random_multilabel_dataset.html
163+
:scale: 50
164+
:align: center
165+
166+
Biclustering
167+
~~~~~~~~~~~~
168+
169+
.. autosummary::
170+
171+
:toctree: ../modules/generated/
172+
:template: function.rst
173+
174+
make_biclusters
175+
make_checkerboard
176+
177+
178+
Generators for regression
179+
-------------------------
180+
181+
:func:`make_regression` produces regression targets as an optionally-sparse
182+
random linear combination of random features, with noise. Its informative
183+
features may be uncorrelated, or low rank (few features account for most of the
184+
variance).
185+
186+
Other regression generators generate functions deterministically from
187+
randomized features. :func:`make_sparse_uncorrelated` produces a target as a
188+
linear combination of four features with fixed coefficients.
189+
Others encode explicitly non-linear relations:
190+
:func:`make_friedman1` is related by polynomial and sine transforms;
191+
:func:`make_friedman2` includes feature multiplication and reciprocation; and
192+
:func:`make_friedman3` is similar with an arctan transformation on the target.
193+
194+
Generators for manifold learning
195+
--------------------------------
196+
197+
.. autosummary::
198+
199+
:toctree: ../modules/generated/
200+
:template: function.rst
201+
202+
make_s_curve
203+
make_swiss_roll
204+
205+
Generators for decomposition
206+
----------------------------
207+
116208
.. autosummary::
117209

118210
:toctree: ../modules/generated/
119211
:template: function.rst
120212

121-
make_classification
122-
make_multilabel_classification
123-
make_regression
124-
make_blobs
125-
make_friedman1
126-
make_friedman2
127-
make_friedman3
128-
make_hastie_10_2
129213
make_low_rank_matrix
130214
make_sparse_coded_signal
131-
make_sparse_uncorrelated
132215
make_spd_matrix
133-
make_swiss_roll
134-
make_s_curve
135216
make_sparse_spd_matrix
136-
make_biclusters
137-
make_checkerboard
217+
138218

139219
.. _libsvm_loader:
140220

doc/sphinxext/gen_rst.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -870,7 +870,7 @@ def generate_file_rst(fname, target_dir, src_dir, root_dir, plot_gallery):
870870
my_stdout = my_stdout.replace(
871871
my_globals['__doc__'],
872872
'')
873-
my_stdout = my_stdout.strip()
873+
my_stdout = my_stdout.strip().expandtabs()
874874
if my_stdout:
875875
stdout = '**Script output**::\n\n %s\n\n' % (
876876
'\n '.join(my_stdout.split('\n')))

examples/datasets/plot_random_dataset.py

Lines changed: 26 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -4,46 +4,58 @@
44
==============================================
55
66
Plot several randomly generated 2D classification datasets.
7-
This example illustrates the `datasets.make_classification`
8-
function.
7+
This example illustrates the :func:`datasets.make_classification`
8+
:func:`datasets.make_blobs` and :func:`datasets.make_gaussian_quantiles`
9+
functions.
910
10-
Three binary and two multi-class classification datasets
11-
are generated, with different numbers of informative
12-
features and clusters per class.
13-
"""
11+
For ``make_classification``, three binary and two multi-class classification
12+
datasets are generated, with different numbers of informative features and
13+
clusters per class. """
1414

1515
print(__doc__)
1616

1717
import matplotlib.pyplot as plt
1818

1919
from sklearn.datasets import make_classification
20+
from sklearn.datasets import make_blobs
21+
from sklearn.datasets import make_gaussian_quantiles
2022

21-
plt.figure(figsize=(8, 6))
23+
plt.figure(figsize=(8, 8))
2224
plt.subplots_adjust(bottom=.05, top=.9, left=.05, right=.95)
2325

24-
plt.subplot(221)
25-
plt.title("One informative feature, one cluster", fontsize='small')
26+
plt.subplot(321)
27+
plt.title("One informative feature, one cluster per class", fontsize='small')
2628
X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=1,
2729
n_clusters_per_class=1)
2830
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)
2931

30-
plt.subplot(222)
31-
plt.title("Two informative features, one cluster", fontsize='small')
32+
plt.subplot(322)
33+
plt.title("Two informative features, one cluster per class", fontsize='small')
3234
X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=2,
3335
n_clusters_per_class=1)
3436
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)
3537

36-
plt.subplot(223)
37-
plt.title("Two informative features, two clusters", fontsize='small')
38+
plt.subplot(323)
39+
plt.title("Two informative features, two clusters per class", fontsize='small')
3840
X2, Y2 = make_classification(n_features=2, n_redundant=0, n_informative=2)
3941
plt.scatter(X2[:, 0], X2[:, 1], marker='o', c=Y2)
4042

4143

42-
plt.subplot(224)
44+
plt.subplot(324)
4345
plt.title("Multi-class, two informative features, one cluster",
4446
fontsize='small')
4547
X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=2,
4648
n_clusters_per_class=1, n_classes=3)
4749
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)
4850

51+
plt.subplot(325)
52+
plt.title("Three blobs", fontsize='small')
53+
X1, Y1 = make_blobs(n_features=2, centers=3)
54+
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)
55+
56+
plt.subplot(326)
57+
plt.title("Gaussian divided into three quantiles", fontsize='small')
58+
X1, Y1 = make_gaussian_quantiles(n_features=2, n_classes=3)
59+
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)
60+
4961
plt.show()
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
"""
2+
==============================================
3+
Plot randomly generated multilabel dataset
4+
==============================================
5+
6+
This illustrates the `datasets.make_multilabel_classification` dataset
7+
generator. Each sample consists of counts of two features (up to 50 in
8+
total), which are differently distributed in each of two classes.
9+
10+
Points are labeled as follows, where Y means the class is present:
11+
12+
===== ===== ===== ======
13+
1 2 3 Color
14+
===== ===== ===== ======
15+
Y N N Red
16+
N Y N Blue
17+
N N Y Yellow
18+
Y Y N Purple
19+
Y N Y Orange
20+
Y Y N Green
21+
Y Y Y Brown
22+
===== ===== ===== ======
23+
24+
A star marks the expected sample for each class; its size reflects the
25+
probability of selecting that class label.
26+
27+
The left and right examples highlight the ``n_labels`` parameter:
28+
more of the samples in the right plot have 2 or 3 labels.
29+
30+
Note that this two-dimensional example is very degenerate:
31+
generally the number of features would be much greater than the
32+
"document length", while here we have much larger documents than vocabulary.
33+
Similarly, with ``n_classes > n_features``, it is much less likely that a
34+
feature distinguishes a particular cluss.
35+
"""
36+
37+
from __future__ import print_function
38+
import numpy as np
39+
import matplotlib.pyplot as plt
40+
41+
from sklearn.datasets import make_multilabel_classification as make_ml_clf
42+
43+
print(__doc__)
44+
45+
COLORS = np.array(['!',
46+
'#FF3333', # red
47+
'#0198E1', # blue
48+
'#BF5FFF', # purple
49+
'#FCD116', # yellow
50+
'#FF7216', # orange
51+
'#4DBD33', # green
52+
'#87421F' # brown
53+
])
54+
55+
# Use same random seed for multiple calls to make_multilabel_classification to
56+
# ensure same distributions
57+
RANDOM_SEED = np.random.randint(2 ** 10)
58+
59+
60+
def plot_2d(ax, n_labels=1, n_classes=3, length=50):
61+
X, Y, p_c, p_w_c = make_ml_clf(n_samples=150, n_features=2,
62+
n_classes=n_classes, n_labels=n_labels,
63+
length=length, allow_unlabeled=False,
64+
return_indicator=True,
65+
return_distributions=True,
66+
random_state=RANDOM_SEED)
67+
68+
ax.scatter(X[:, 0], X[:, 1], color=COLORS.take((Y * [1, 2, 4]
69+
).sum(axis=1)),
70+
marker='.')
71+
ax.scatter(p_w_c[0] * length, p_w_c[1] * length,
72+
marker='*', linewidth=.5, edgecolor='black',
73+
s=20 + 1500 * p_c ** 2,
74+
color=COLORS.take([1, 2, 4]))
75+
ax.set_xlabel('Feature 0 count')
76+
return p_c, p_w_c
77+
78+
79+
_, (ax1, ax2) = plt.subplots(1, 2, sharex='row', sharey='row', figsize=(8, 4))
80+
plt.subplots_adjust(bottom=.15)
81+
82+
p_c, p_w_c = plot_2d(ax1, n_labels=1)
83+
ax1.set_title('n_labels=1, length=50')
84+
ax1.set_ylabel('Feature 1 count')
85+
86+
plot_2d(ax2, n_labels=3)
87+
ax2.set_title('n_labels=3, length=50')
88+
ax2.set_xlim(left=0, auto=True)
89+
ax2.set_ylim(bottom=0, auto=True)
90+
91+
plt.show()
92+
93+
print('The data was generated from (random_state=%d):' % RANDOM_SEED)
94+
print('Class', 'P(C)', 'P(w0|C)', 'P(w1|C)', sep='\t')
95+
for k, p, p_w in zip(['red', 'blue', 'yellow'], p_c, p_w_c.T):
96+
print('%s\t%0.2f\t%0.2f\t%0.2f' % (k, p, p_w[0], p_w[1]))

0 commit comments

Comments
 (0)
0