8000 [WIP] Make LabelEncoder more friendly to new labels by mjbommar · Pull Request #3243 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
8000

[WIP] Make LabelEncoder more friendly to new labels #3243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 59 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
4d79789
Adding new_labels argument to LabelEncoder
mjbommar Jun 4, 2014
2bc5686
Adding tests for new_labels argument.
mjbommar Jun 4, 2014
8c1fafe
Changing classes_ update strategy
mjbommar Jun 4, 2014
1ffb24a
Adding nan behavior, renaming to
mjbommar Jun 4, 2014
99f65a9
Updating tests to include nan case and update name
mjbommar Jun 4, 2014
af8c6a9
Fixing docstring for test-doc pass
mjbommar Jun 4, 2014
8ffc839
Fixing docstring for test-doc pass (for real)
mjbommar Jun 4, 2014
e6fbc47
Updating doctests
mjbommar Jun 4, 2014
8000
46118d9
Updating constructor documentation
mjbommar Jun 4, 2014
8d21ec1
Adding specific "label" option to new_labels
mjbommar Jun 5, 2014
343c726
Adding test for "label" option to ``new_labels``
mjbommar Jun 5, 2014
be97c14
Updating docstring for ``new_labels="label"``
mjbommar Jun 5, 2014
cdd7147
pep8
mjbommar Jun 5, 2014
170d00c
Autodoc fix
mjbommar Jun 5, 2014
2d87e88
Fixing rst docs
mjbommar Jun 8, 2014
bb8d9a6
Changing dtypes for new_labels
mjbommar Jun 8, 2014
ab788f7
Adding example for new_labels argument
mjbommar Jun 8, 2014
a597fc3
Adding new_labels handling to fit/fit_transform
mjbommar Jul 15, 2014
291d752
Improving difficulty of test cases with non-increasing unseen labels
mjbommar Jul 15, 2014
fe0141d
Moving ValueError check to fit
mjbommar Jul 15, 2014
e1b7ed5
Improving difficult for new_labels='update' test to include multiple …
mjbommar Jul 15, 2014
9fd7736
Fixing negative indexing, renamed z->out, failing approach for new_la…
mjbommar Jul 15, 2014
e3c14bb
PEP8
mjbommar Jul 15, 2014
fe79736
Removing nan option and corresponding test
mjbommar Jul 19, 2014
b83b37f
Handling repeated transform calls with new_class_mapping_, refactorin…
mjbommar Jul 19, 2014
0b8e63c
Update outlier_detection.rst
pvnguyen Jul 21, 2014
62f1f57
Added directory checking for documentation builds, and corrected for …
kastnerkyle Jul 23, 2014
8cd5d85
Merge pull request #3477 from kastnerkyle/docbuild_fix
kastnerkyle Jul 23, 2014
965b109
Merge pull request #3479 from MechCoder/improve_logcv_docs
larsmans Jul 23, 2014
d814353
MAINT More robust windows installation script
ogrisel Jul 23, 2014
0af2d8f
MAINT move skip for unstable 32bit to _check_transformer
ogrisel Jul 23, 2014
f3afd4e
FIX unstable test on 32 bit windows
ogrisel Jul 23, 2014
376ac51
Merge pull request #3465 from pvnguyen/patch-1
agramfort Jul 24, 2014
4b6978e
Adding new_labels argument to LabelEncoder
mjbommar Jun 4, 2014
d990207
Adding tests for new_labels argument.
mjbommar Jun 4, 2014
a69840b
Changing classes_ update strategy
mjbommar Jun 4, 2014
fce9fb5
Adding nan behavior, renaming to
mjbommar Jun 4, 2014
76921e5
Updating tests to include nan case and update name
mjbommar Jun 4, 2014
0e39a2a
Fixing docstring for test-doc pass
mjbommar Jun 4, 2014
1da2880
Fixing docstring for test-doc pass (for real)
mjbommar Jun 4, 2014
926b166
Updating doctests
mjbommar Jun 4, 2014
5ef9b85
Updating constructor documentation
mjbommar Jun 4, 2014
4dfb4cb
Adding specific "label" option to new_labels
mjbommar Jun 5, 2014
392e54b
Adding test for "label" option to ``new_labels``
mjbommar Jun 5, 2014
e053635
Updating docstring for ``new_labels="label"``
mjbommar Jun 5, 2014
122a98f
pep8
mjbommar Jun 5, 2014
de18372
Autodoc fix
mjbommar Jun 5, 2014
d735ca2
Fixing rst docs
mjbommar Jun 8, 2014
d276565
Changing dtypes for new_labels
mjbommar Jun 8, 2014
a01f8b0
8000 Adding example for new_labels argument
mjbommar Jun 8, 2014
495347c
Adding new_labels handling to fit/fit_transform
mjbommar Jul 15, 2014
dee4ae0
Improving difficulty of test cases with non-increasing unseen labels
mjbommar Jul 15, 2014
c297017
Moving ValueError check to fit
mjbommar Jul 15, 2014
f29800b
Improving difficult for new_labels='update' test to include multiple …
mjbommar Jul 15, 2014
74b7589
Fixing negative indexing, renamed z->out, failing approach for new_la…
mjbommar Jul 15, 2014
3e1be5d
PEP8
mjbommar Jul 15, 2014
abf01cc
Removing nan option and corresponding test
mjbommar Jul 19, 2014
f26a902
Handling repeated transform calls with new_class_mapping_, refactorin…
mjbommar Jul 19, 2014
0725d4c
Rebase
mjbommar Jul 24, 2014
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,18 +13,23 @@ environment:
PYTHON_VERSION: "2.7.8"
PYTHON_ARCH: "32"

- PYTHON: "C:\\Python27"
- PYTHON: "C:\\Python27_64"
PYTHON_VERSION: "2.7.8"
PYTHON_ARCH: "64"

- PYTHON: "C:\\Python34_32"
PYTHON_VERSION: "3.4.1"
PYTHON_ARCH: "32"

- PYTHON: "C:\\Python34"
- PYTHON: "C:\\Python34_64"
PYTHON_VERSION: "3.4.1"
PYTHON_ARCH: "64"

branches:
only:
- master
- 0.15.X

install:
# Install Python (from the official .msi of http://python.org) and pip when
# not already installed.
Expand Down Expand Up @@ -53,7 +58,7 @@ test_script:

# Skip joblib tests that require multiprocessing as they are prone to random
# slow down
- "python -c \"import nose; nose.main()\" -v -s sklearn"
- "python -c \"import nose; nose.main()\" -s sklearn"

artifacts:
# Archive the generated wheel package in the ci.appveyor.com build report.
Expand Down
21 changes: 16 additions & 5 deletions continuous_integration/appveyor/install.ps1
Original file line number Diff line number Diff line change
Expand Up @@ -52,12 +52,17 @@ function InstallPython ($python_version, $architecture, $python_home) {
} else {
$platform_suffix = ".amd64"
}
$filepath = DownloadPython $python_version $platform_suffix
Write-Host "Installing" $filepath "to" $python_home
$msipath = DownloadPython $python_version $platform_suffix
Write-Host "Installing" $msipath "to" $python_home
$install_log = $python_home + ".log"
$args = "/qn /log $install_log /i $filepath TARGETDIR=$python_home"
Write-Host "msiexec.exe" $args
Start-Process -FilePath "msiexec.exe" -ArgumentList $args -Wait -Passthru
$install_args = "/qn /log $install_log /i $msipath TARGETDIR=$python_home"
$uninstall_args = "/qn /x $msipath"
RunCommand "msiexec.exe" $install_args
if (-not(Test-Path $python_home)) {
Write-Host "Python seems to be installed else-where, reinstalling."
RunCommand "msiexec.exe" $uninstall_args
RunCommand "msiexec.exe" $install_args
}
if (Test-Path $python_home) {
Write-Host "Python $python_version ($architecture) installation complete"
} else {
Expand All @@ -67,6 +72,11 @@ function InstallPython ($python_version, $architecture, $python_home) {
}
}

function RunCommand ($command, $command_args) {
Write-Host $command $command_args
Start-Process -FilePath $command -ArgumentList $command_args -Wait -Passthru
}


function InstallPip ($python_home) {
$pip_path = $python_home + "\Scripts\pip.exe"
Expand All @@ -82,6 +92,7 @@ function InstallPip ($python_home) {
}
}


function main () {
InstallPython $env:PYTHON_VERSION $env:PYTHON_ARCH $env:PYTHON
InstallPip $env:PYTHON
Expand Down
12 changes: 9 additions & 3 deletions doc/modules/outlier_detection.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,8 @@ coming from the same population than the initial
observations. Otherwise, if they lay outside the frontier, we can say
that they are abnormal with a given confidence in our assessment.

The One-Class SVM has been introduced in [1] for that purpose and
implemented in the :ref:`svm` module in the
The One-Class SVM has been introduced by Schölkopf et al. for that purpose
and implemented in the :ref:`svm` module in the
:class:`svm.OneClassSVM` object. It requires the choice of a
kernel and a scalar parameter to define a frontier. The RBF kernel is
usually chosen although there exists no exact formula or algorithm to
Expand All @@ -63,6 +63,12 @@ implementation. The :math:`\nu` parameter, also known as the margin of
the One-Class SVM, corresponds to the probability of finding a new,
but regular, observation outside the frontier.

.. topic:: References:

* `Estimating the support of a high-dimensional distribution
<http://dl.acm.org/citation.cfm?id=1119749>`_ Schölkopf,
Bernhard, et al. Neural computation 13.7 (2001): 1443-1471.

.. topic:: Examples:

* See :ref:`example_svm_plot_oneclass.py` for visualizing the
Expand All @@ -73,7 +79,7 @@ but regular, observation outside the frontier.
:target: ../auto_examples/svm/plot_oneclasse.html
:align: center
:scale: 75%


Outlier Detection
=================
Expand Down
18 changes: 16 additions & 2 deletions doc/modules/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -397,7 +397,7 @@ follows::
>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
LabelEncoder(new_label_class=-1, new_labels='raise')
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
Expand All @@ -410,14 +410,28 @@ hashable and comparable) to numerical labels::

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
LabelEncoder(new_label_class=-1, new_labels='raise')
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1])
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

By default, ``LabelEncoder`` will throw a ``ValueError`` in the event that
labels are passed in ``transform`` that were not seen in ``fit``. This
behavior can be handled with the ``new_labels`` parameter, which supports
``"raise"``, ``"nan"``, ``"update"``, and ``"label"`` strategies for
handling new labels. For example, the ``"label"`` strategy will assign
the unseen values a label of ``-1``.

>>> le = preprocessing.LabelEncoder(new_labels="label")
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder(new_label_class=-1, new_labels='label')
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris", "rome"])
array([ 2, 2, 1, -1])

Imputation of missing values
============================
Expand Down
10 changes: 8 additions & 2 deletions doc/sphinxext/gen_rst.py
Original file line number Diff line number Diff line change
Expand Up @@ -468,7 +468,11 @@ def generate_example_rst(app):
examples.
"""
root_dir = os.path.join(app.builder.srcdir, 'auto_examples')
example_dir = os.path.abspath(app.builder.srcdir + '/../' + 'examples')
example_dir = os.path.abspath(os.path.join(app.builder.srcdir, '..',
'examples'))
generated_dir = os.path.abspath(os.path.join(app.builder.srcdir,
'modules', 'generated'))

try:
plot_gallery = eval(app.builder.config.plot_gallery)
except TypeError:
Expand All @@ -477,10 +481,12 @@ def generate_example_rst(app):
os.makedirs(example_dir)
if not os.path.exists(root_dir):
os.makedirs(root_dir)
if not os.path.exists(generated_dir):
os.makedirs(generated_dir)

# we create an index.rst with all examples
fhindex = open(os.path.join(root_dir, 'index.rst'), 'w')
#Note: The sidebar button has been removed from the examples page for now
# Note: The sidebar button has been removed from the examples page for now
# due to how it messes up the layout. Will be fixed at a later point
fhindex.write("""\

Expand Down
4 changes: 2 additions & 2 deletions sklearn/feature_selection/tests/test_feature_select.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@ def test_f_oneway_ints():

# test that is gives the same result as with float
f, p = f_oneway(X.astype(np.float), y)
assert_array_almost_equal(f, fint, decimal=5)
assert_array_almost_equal(p, pint, decimal=5)
assert_array_almost_equal(f, fint, decimal=4)
assert_array_almost_equal(p, pint, decimal=4)


def test_f_classif():
Expand Down
112 changes: 105 additions & 7 deletions sklearn/preprocessing/label.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Authors: Alexandre Gramfort <alexandre.gramfort@inria.fr>
# Mathieu Blondel <mathieu@mblondel.org>
# Olivier Grisel <olivier.grisel@ensta.org>
# Mathieu Blondel <mathieu@mblondel.org>
# Olivier Grisel <olivier.grisel@ensta.org>
# Andreas Mueller <amueller@ais.uni-bonn.de>
# Joel Nothman <joel.nothman@gmail.com>
# Hamzeh Alsalhi <ha258@cornell.edu>
Expand All @@ -10,7 +10,9 @@
import itertools
import array
import warnings
import operator

import operator
import numpy as np
import scipy.sparse as sp

Expand Down Expand Up @@ -53,19 +55,37 @@ def _check_numpy_unicode_bug(labels):
class LabelEncoder(BaseEstimator, TransformerMixin):
"""Encode labels with value between 0 and n_classes-1.

Parameters
----------

new_labels : string, optional (default: "raise")
Determines how to handle new labels, i.e., data
not seen in the training domain.

- If ``"raise"``, then raise ValueError.
- If ``"update"``, then re-map the new labels to
classes ``[N, ..., N+m-1]``, where ``m`` is the number of new labels.
- If an integer value is passed, then use re-label with this value.
N.B. that default values are in [0, 1, ...], so caution should be
taken if a non-negative value is passed to not accidentally
intersect.

Attributes
----------
`classes_` : array of shape (n_class,)
Holds the label for each class.

`new_label_mapping_` : dictionary
Stores the mapping for classes not seen during original ``fit``.

Examples
--------
`LabelEncoder` can be used to normalize labels.

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
LabelEncoder(new_label_class=-1, new_labels='raise')
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6]) #doctest: +ELLIPSIS
Expand All @@ -78,7 +98,7 @@ class LabelEncoder(BaseEstimator, TransformerMixin):

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
LabelEncoder(new_label_class=-1, new_labels='raise')
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"]) #doctest: +ELLIPSIS
Expand All @@ -88,10 +108,34 @@ class LabelEncoder(BaseEstimator, TransformerMixin):

"""

def __init__(self, new_labels="raise"):
"""Constructor"""
self.new_labels = new_labels
self.new_label_mapping_ = {}

def _check_fitted(self):
if not hasattr(self, "classes_"):
raise ValueError("LabelEncoder was not fitted yet.")

def get_classes(self):
"""Get classes that have been observed by the encoder. Note that this
method returns classes seen both at original ``fit`` time (i.e.,
``self.classes_``) and classes seen after ``fit`` (i.e.,
``self.new_label_mapping_.keys()``) for applicable values of
``new_labels``.

Returns
-------
classes : array-like of shape [n_classes]
"""
# If we've seen updates, include them in the order they were added.
if len(self.new_label_mapping_) > 0:
sorted_new, _ = zip(*sorted(self.new_label_mapping_.iteritems(),
key=operator.itemgetter(1)))
return np.append(self.classes_, sorted_new)
else:
return self.classes_

def fit(self, y):
"""Fit label encoder

Expand All @@ -104,6 +148,14 @@ def fit(self, y):
-------
self : returns an instance of self.
"""
# Check new_labels parameter
if self.new_labels not in ["update", "raise"] and \
type(self.new_labels) not in [int]:
# Raise on invalid argument.
raise ValueError("Value of argument `new_labels`={0} "
"is unknown and not integer."
.format(self.new_labels))

y = column_or_1d(y, warn=True)
_check_numpy_unicode_bug(y)
self.classes_ = np.unique(y)
Expand All @@ -121,6 +173,14 @@ def fit_transform(self, y):
-------
y : array-like of shape [n_samples]
"""
# Check new_labels parameter
if self.new_labels not in ["update", "raise"] and \
type(self.new_labels) not in [int]:
# Raise on invalid argument.
raise ValueError("Value of argument `new_labels`={0} "
"is unknown and not integer."
.format(self.new_labels))

y = column_or_1d(y, warn=True)
_check_numpy_unicode_bug(y)
self.classes_, y = np.unique(y, return_inverse=True)
Expand All @@ -142,9 +202,47 @@ def transform(self, y):

classes = np.unique(y)
_check_numpy_unicode_bug(classes)
if len(np.intersect1d(classes, self.classes_)) < len(classes):
diff = np.setdiff1d(classes, self.classes_)
raise ValueError("y contains new labels: %s" % str(diff))
if len(np.intersect1d(classes, self.get_classes())) < len(classes):
# Get the new classes
diff_fit = np.setdiff1d(classes, self.classes_)
diff_new = np.setdiff1d(classes, self.get_classes())

# Create copy of array and return
y = np.array(y)

# If we are mapping new labels, get "new" ID and change in copy.
if self.new_labels == "update":
# Update the new label mapping
next_label = len(self.get_classes())
self.new_label_mapping_.update(dict(zip(diff_new,
range(next_label,
next_label +
len(diff_new)))))

# Find entries with new labels
missing_mask = np.in1d(y, diff_fit)

# Populate return array properly by mask and return
out = np.searchsorted(self.classes_, y)
out[missing_mask] = [self.new_label_mapping_[value]
for value in y[missing_mask]]
return out
elif type(self.new_labels) in [int]:
# Find entries with new labels
missing_mask = np.in1d(y, diff_fit)

# Populate return array properly by mask and return
out = np.searchsorted(self.classes_, y)
out[missing_mask] = self.new_labels
return out
elif self.new_labels == "raise":
# Return ValueError, original behavior.
raise ValueError("y contains new labels: %s" % str(diff_fit))
else:
# Raise on invalid argument.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing on fit is probably kinder to the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

raise ValueError("Value of argument `new_labels`={0} "
"is unknown.".format(self.new_labels))

return np.searchsorted(self.classes_, y)

def inverse_transform(self, y):
Expand Down
Loading
0