8000 [WIP] Basic version of MICE Imputation by sergeyf · Pull Request #8465 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] Basic version of MICE Imputation #8465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 447 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
447 commits
Select commit Hold shift + click to select a range
e48a48d
[MRG+1] DOC adding info about circleci build artifacts (#7855)
dalmia Nov 29, 2016
86cee5c
BUG: for several datasets, ``download_if_missing`` keyword was ignore…
rgommers Nov 29, 2016
2bee348
[MRG+1] DOC adding a warning on the relation between C and alpha (#7860)
dalmia Nov 29, 2016
4105ea7
Fix tests on numpy master (#7946)
lesteve Nov 29, 2016
187450a
[MRG+2] Fix K Means init center bug (#7872)
jkarno Nov 30, 2016
a8effcc
[MRG+1] Add new regression metric - Mean Squared Log Error (#7655)
Nov 30, 2016
44e7488
[MRG + 1] DOC refer to code elements in nested CV example description…
jnothman Nov 30, 2016
599b186
DOC: add bug fix for ``download_if_missing`` behavior to whatsnew. (#…
rgommers Dec 1, 2016
4b4255e
[MRG] Mention keras can run on top of TensorFlow (#7957)
nixtish Dec 1, 2016
7b55d0a
[MRG+2] Adding return_std options for models in linear_model/bayes.py…
sergeyf Dec 1, 2016
371b024
Added 1/2 factor to SSE alpha term (#7962)
FERRIA Dec 2, 2016
815aac5
Harmonized README, added link. (#7965)
habi Dec 2, 2016
596e0d0
added random_state=0 to many instances (#7968)
chenhe95 Dec 2, 2016
ae6b284
[MRG+1] Fix estimators to work if sample_weight parameter is pandas S…
kathyxchen Dec 3, 2016
76c65ee
[MRG+1] Fix confusion matrix example code (#7971)
rashchedrin Dec 3, 2016
0f3af24
Fix version comparison for the numpy 1.12 beta (#7902)
willduan Dec 5, 2016
bf8231f
MAINT remove superflous repo unshallowing in flake8_diff.sh
lesteve Dec 5, 2016
a7f25aa
Adding Columbia logo to sponsors listing (#7964)
amueller Dec 5, 2016
8148aca
DOC Fix typo in plot_unveil_tree_structure (#7988)
bradysalz Dec 6, 2016
7345a6f
[MRG+1] Added override of fit_transform to LabelBinarizer (#7670)
kgilliam125 Dec 6, 2016
7817683
docs(MLPClassifier): add multi-label support in fit docstring and rem…
alexandercbooth Dec 6, 2016
3a0df7f
[MRG + 1] ENH Do not materialise CV splits when unnecessary (#7941)
raghavrv Dec 7, 2016
0ac4bb4
CI report which doc files were likely affected (#7938)
jnothman Dec 7, 2016
49ecb97
[MRG + 1] FIX bug where passing numpy array for weights raises error …
vincentpham1991 Dec 8, 2016
9efc0fd
[MRG+1] BUG: adding check for ipython notebook (#7924)
dalmia Dec 8, 2016
940224a
fixed error in documentation (#8014)
vincentpham1991 Dec 8, 2016
c76d2e4
[MRG + 1] DOC comment on measures in classification_report (#7897)
jnothman Dec 8, 2016
30b9cfa
FIX raise AttributeError in SVC.coef_ for proper duck-typing (#8009)
amueller Dec 9, 2016
cbb5ae0
Revert "CI report which doc files were likely affected (#7938)"
amueller Dec 9, 2016
80a8f13
MAINT use sphinx 1.4 to build the doc
lesteve Dec 9, 2016
eb25bf3
[MRG+1] Housekeeping Deprecations for v0.19 (#7927)
amueller Dec 9, 2016
7e3edf9
CI full doc build only for examples; flag to force quick build (#7950)
jnothman Dec 10, 2016
f123812
CI report which doc files were likely affected (#8032)
jnothman Dec 10, 2016
1865071
DOC fix copy-paste error (#8037)
ohld Dec 11, 2016
172853d
TST Ensure that attributes ending _ are not set in __init__ (#7464)
lesteve Dec 12, 2016
27fa08e
[MRG + 1] Fix failure on numpy master (#8011)
aashil Dec 12, 2016
6ff493e
[MRG+1] Add multiplicative-update solver in NMF, with all beta-diverg…
TomDLT Dec 12, 2016
c2f2bbf
FIX .format arguments were in the wrong order
lesteve Dec 13, 2016
4e124de
left-over deprecation of 1d X (#8045)
amueller Dec 13, 2016
06396ef
[MRG + 1] CI some improvements to the flake8 CI (#8036)
jnothman Dec 13, 2016
b825e84
[MRG] Set min_impurity_split in gradient boosting models (#8007)
sebp Dec 13, 2016
8d7cd88
Use 1.0 not 1 in error message regarding float value
jnothman Dec 12, 2016
5b9010a
DOC add CI details and commands to contributor guide (#8024)
alexandercbooth Dec 15, 2016
f6e93d5
DOC Update LOF.fit_predict() (#8059)
Don86 Dec 15, 2016
0a8c90e
TST fix test case which should ensure empty row (#8056)
jnothman Dec 15, 2016
e180ce6
[MRG+2] ENH add n_jobs to make_union through kwargs (#8031)
alexandercbooth Dec 15, 2016
edd17d2
DOC adding note regarding bessel correction in PCA (#7843)
dalmia Dec 15, 2016
2474f55
Fix plot_svm_margin example plots (#8051)
Dec 16, 2016
208d1fd
DOC fix broken link in carousel
lesteve Dec 19, 2016
2ee48be
[MRG + 1] Reformat the version info and cite us labels in the user-gu…
aashil Dec 19, 2016
7ef8687
[MRG + 1] Fix reference in fetch_kddcup99 (#8071)
b-carter Dec 19, 2016
f6d95d4
[MRG + 1] Issue#8062: JoblibException thrown when passing "fit_params…
xor Dec 19, 2016
0d94be1
[MRG + 1] Fix perplexity method by adding _unnormalized_transform met…
garyForeman Dec 20, 2016
1686565
[MRG+1] allow callable kernels in cross-validation (#8005)
amueller Dec 20, 2016
4fcfe90
DOC Fix doc for CountVectorizer class. (#8085)
aashil Dec 20, 2016
537d022
DOC clarify logisticregression n_jobs param (#8083)
rasbt Dec 20, 2016
e7e5958
CI fix bug in getting changed docs when no sklearn/ files modified
jnothman Dec 20, 2016
4aca8b1
DOC Document _changed.html in contrib docs
jnothman Dec 20, 2016
8b97271
DOC Restructure the version info in the docs to fit in two lines. (#8…
aashil Dec 20, 2016
8c18348
FIX check_array's accept_sparse param now takes true/false/str/list, …
jkarno Dec 20, 2016
8ad37df
DOC Fix output shape in doc for OrthogonalMatchingPursuit (#8091)
weijianzz Dec 20, 2016
1c7be1c
[MRG + 2] Allow f_regression to accept a sparse matrix with centering…
acadiansith Dec 20, 2016
621c308
DOC Improve benchmark on NMF (#5779)
TomDLT Dec 20, 2016
6fc51cc
CI limit diff to commit range in flake8_diff.sh (#8097)
jnothman Dec 21, 2016
28248a6
DOC: Fix the documentation of scoring LogisticCV (#8099)
GaelVaroquaux Dec 22, 2016
b6c2f80
[MRG+1] Corrected sign error in QuantileLossFunction (#6429)
AlexisMignon Dec 22, 2016
ec91436
[MRG+1] Return list instead of 3d array for MultiOutputClassifier.pre…
pjbull Dec 22, 2016
2d72037
[MRG + 1] Add changelog entry for MSLE implemented in #7655. (#8104)
Dec 23, 2016
f93a824
DOC fix link in what's new
jnothman Dec 23, 2016
9b2c315
DOC Note how ariddell/lda differs from sckit-learn's LDA (#5553)
ariddell Dec 27, 2016
e75dce9
COSMIT PEP257
jnothman Dec 27, 2016
223c8c6
[MRG + 1] MAINT Move heapify_up/heapify_down into PriorityHeap as cla…
nelson-liu Dec 27, 2016
f2e5c1d
DOC Fix help link on about page (#8119)
kluangkote Dec 27, 2016
050fd83
[MRG+2] FIX IsolationForest(max_features=0.8).predict(X) fails input …
IshankGulati Dec 27, 2016
a9e03a6
DOC Fix indentation errors and username links (#8121)
kluangkote Dec 27, 2016
3edad83
[MRG] MAINT Python 3.6 fixes (#8123)
ogrisel Dec 27, 2016
4a90032
[MRG+3] Fused types for MultiTaskElasticNet (#8061)
tguillemot Dec 28, 2016
92cfc05
DOC add sklearn-crfsuite to related projects (#7878)
kmike Dec 28, 2016
1efb1e3
[MRG+1] Catch cases for different class size in MLPClassifier with wa…
vincentpham1991 Dec 29, 2016
6b267c0
FIX Split data using _safe_split in _permutaion_test_score (#5697)
Dec 29, 2016
3c37ecb
DOC Fix typo in FAQ (#8132)
kluangkote Dec 29, 2016
bb21e03
[MRG] update copyright years for 2017 (#8138)
nelson-liu Jan 1, 2017
3d6c012
[MRG+1] Fix "cite us" link in sidebar (#8142)
naoyak Jan 2, 2017
406a629
[MRG+1] Add DBSCAN support for additional metric params (#8139)
naoyak Jan 2, 2017
2fa1b0e
[MRG+1] fowlkes_mallows_score: more unit tests (Fixes #8101) (#8140)
devanshdalal Jan 3, 2017
92b9892
DOC: updating GridSearchCV's n_jobs parameter (#8106)
accraze Jan 4, 2017
2edc335
[MRG+1] Deprecate ridge_alpha param on SparsePCA.transform() (#8137)
naoyak Jan 4, 2017
380d92d
FIX sphinx gallery rendering of plot_digits_pipe example
ogrisel Jan 4, 2017
167a2b1
[MRG+1] DOC: complete list of online learners (#8152)
GaelVaroquaux Jan 4, 2017
61560fd
[MRG+2] Avoid failure in first iteration of RANSAC regression (#7914)
mthorrell Jan 5, 2017
62fd734
[MRG] FIX Avoid default mutable argument in constructor of Agglomerat…
glemaitre Jan 5, 2017
544abb2
[MRG + 1] add partial_fit to multioutput module (#8054)
yupbank Jan 5, 2017
d31585a
[MRG + 1] Add fowlkess-mallows and other supervised cluster metrics t…
raghavrv Jan 6, 2017
28fbfc8
Fix Ridge floating point instability (#8154)
lesteve Jan 7, 2017
eedc223
DOC Fix link (#8171)
mrbeann Jan 7, 2017
47b03e3
[MRG + 1] Fix the cross_val_predict function for method='predict_prob…
dalmia Jan 7, 2017
0f6fd76
fixing typo in cs_mse_path_ deprecation (#8176)
perimosocordiae Jan 8, 2017
9c562a9
Clarify error message for min_samples_split. (#8167)
mikebenfield Jan 8, 2017
6fc3983
Upgrade html documentation to jQuery v3.1.1 (#8145)
naoyak Jan 9, 2017
a75a0d1
removed stray space in '__main__ ' (#8203)
BasilBeirouti Jan 15, 2017
904fcb2
DOC additional fixes to 20 newsgroups to prevent TypeError (#8204)
BasilBeirouti Jan 15, 2017
76d1494
DOC add missing parentheses in TfidfTrasnformer docstring
jnothman Jan 16, 2017
5aadcb4
TRAVIS fix flake8_diff.sh check_files (#8208)
lesteve Jan 16, 2017
c43f5a7
[MRG+1] Fixes #8198 - error in datasets.make_moons (#8199)
levy5674 Jan 17, 2017
1319f9b
[MRG + 2] [MAINT] Update to Sphinx-Gallery 0.1.7 (#7986)
Titan-C Jan 17, 2017
c14c717
[MRG+1] Add prominent mention of Laplacian Eigenmaps (#8155)
samsontmr Jan 18, 2017
0414302
MNT/BLD Use GitHub's merge refs to test PRs on CircleCI (#8211)
jakirkham Jan 18, 2017
6868707
FIX Ensure coef_ is an ndarray when fitting LassoLars (#8160)
perimosocordiae Jan 18, 2017
4506bcd
[MRG+3] FIX Memory leak in MAE; Use safe_realloc; Acquire GIL only wh…
raghavrv Jan 18, 2017
b982dde
Call sorted on lfw folder path contents (#7648)
campustrampus Jan 19, 2017
4b1287e
FIX Issue #8173 - pass n_neighbors in MI computation (#8181)
glemaitre Jan 19, 2017
b831a49
TST/FIX Add check for estimator: parameters not modified by `fit` (#7…
kiote Jan 20, 2017
4642af2
[MRG] #8218: in FAQ, link deep learning question to GPU question (#8220)
vincentpham1991 Jan 20, 2017
d3b73e0
CI remove obsolete comment
jnothman Jan 22, 2017
568c998
ENH warn in classification_report when target_names doesn't equal lab…
Jan 24, 2017
921abba
[MRG] Fix aesthetic example roc crossval (#8232)
glemaitre Jan 25, 2017
6bfe0a6
Test sphinx extensions doctests only on Circle. (#8228)
lesteve Jan 25, 2017
738ddcb
TST Change rstrip() to truncation in test function (#8237)
pganssle Jan 26, 2017
280591f
DOC Fixing a bug where entropy included labeled items (#8150)
mdezube Jan 28, 2017
778cdbb
Incorrect number of samples in One Hot Encoder example (#8255)
davidrobles Feb 1, 2017
1a253f1
[MRG] make the ransac example slightly more terse, improve range of p…
amueller Feb 1, 2017
31c4d18
Cosmetic changes to rigde path example (#8260)
rishikksh20 Feb 1, 2017
1d1b360
DOC structure for related projects (#8257)
jnothman Feb 1, 2017
1d71a59
docs: related_projects.rst: fixes xgboost link (#8270)
manu-chroma Feb 2, 2017
c828ef1
MAINT add Python 3.6 classifier in setup.py
lesteve Feb 2, 2017
57275ff
TST: added test that sample_weight can be a list (#8261)
dalmia Feb 3, 2017
5b9b101
[MRG] Remove DeprecationWarnings in examples due to using floats inst…
dalmia Feb 3, 2017
bc15dc6
[MRG] loss function plot y-label slightly confusing (#8283)
Akshay0724 Feb 6, 2017
1913443
DOC more explicit guidelines for WIP (#8299)
jnothman Feb 6, 2017
c7fe965
[MRG+1] Fix bench_rcv1_logreg_convergence.py by adding get_max_square…
Feb 7, 2017
0299764
[MRG+1] Refactor birch-documentation (#8298)
MechCoder Feb 7, 2017
69a4a59
[MRG] Diabetes example with GridSearchCV (#8268)
rishikksh20 Feb 7, 2017
d3f7b30
DOC add missing release date
jnothman Feb 7, 2017
b96c0d8
[MRG+1] Enable codecov for coverage report (#8311)
rishikksh20 Feb 8, 2017
049f4e3
Added Zopa testimonial (#8309)
vlasisva Feb 8, 2017
0e70e6a
DOC: Remove superfluous assignment in tutorial. issue #8285 (#8314)
seanpwilliams Feb 8, 2017
a85943c
[MRG+1] Remove the MLComp text categorization example (#8264)
rth Feb 8, 2017
5ecf187
FIX Add a missing space to an exception message in resample function …
chkoar Feb 9, 2017
4be5dbc
[MRG+1] Accept keyword parameters to hyperparameter search fit method…
Feb 9, 2017
42d58e4
[MRG+1] Add classes_ parameter to hyperparameter CV classes (#8295)
Feb 10, 2017
aa44d7c
Add sample_weight parameter to cohen_kappa_score (#8335)
vpoughon Feb 10, 2017
4fd2459
Remove redefinition of k_fold in model_selection.rst (#8330)
asishm Feb 11, 2017
133b305
spelling mistake (#8341)
anshbansal Feb 12, 2017
0ad838e
DOC Updated documentation for scoring parameter (#8346)
vivekk0903 Feb 13, 2017
a526c3c
[MRG+2] ENH: used SelectorMixin in BaseRandomizedLinearModel (#8263)
dalmia Feb 13, 2017
68099a2
[MRG+3] ENH Caching Pipeline by memoizing transformer (#7990)
glemaitre Feb 13, 2017
84c8c14
DOC: added explanation for LARS (#8310)
dalmia Feb 13, 2017
6266bba
DOC add example regarding feature scaling (#7912)
tylerlanigan Feb 13, 2017
215edc7
[MRG+1] Fix description of l1_ratio for MultiTaskElasticNet (#8343)
tguillemot Feb 13, 2017
aba9cdf
Fix tests on numpy master (#8355)
lesteve Feb 15, 2017
ae1965c
Change "observations" to "features" in description of LassoLarsCV (#8…
Feb 15, 2017
7be0c9e
TRAVIS revert flake8 version to 2.5.1
lesteve Feb 16, 2017
4e70bfa
DOC add missing bugfix to what's new
jnothman Feb 16, 2017
05ef8ab
FIX/MAINT: update my mail etc (#8375)
dengemann Feb 16, 2017
3a4d1d6
[MRG+1] Fix ug in BaseSearchCV.inverse_transform (#8348)
Akshay0724 Feb 17, 2017
3116a79
[MRG+1] add docs that C can receive array in RandomizedLogisticRegre…
pianomania Feb 18, 2017
fc39a57
fix typo (#8390)
Neurrone Feb 18, 2017
03336ce
DOC updated IRC url to working one (#8383)
i-am-xhy Feb 19, 2017
571f438
Explain the meaning of X_m in modules/tree doc. (#8398)
aashil Feb 19, 2017
11fdaf8
[MRG] Add the meaning of MRG and MRG+1 in the PR in docs. (#8406)
aashil Feb 20, 2017
9e8ff47
[MRG] Make tests runnable with pytest without error (#8246)
lesteve Feb 20, 2017
674284f
plot iso-f1 curves in plot_precision_recall (#8378)
SACHIN-13 Feb 20, 2017
bd2ea4c
Ignore py.test generated .cache folder
ogrisel Feb 20, 2017
e2103af
[MRG+1] FIX AdaBoost ZeroDivisionError in proba #7501 (#8371)
dokato Feb 20, 2017
645026a
[MRG+1] Fix pickling bug due to multiple inheritance & __getstate__ …
HolgerPeters Feb 20, 2017
4633d67
[MRG+1] Fix message formatting in exception (#8319)
MMeketon Feb 21, 2017
53609e4
DOC Modify plot_gpc_iris.py for matplotlib v2 (#8385)
rishikksh20 Feb 21, 2017
7a47f20
DOC svm kernel functions docs: rbf equation fixed (#8356) (#8420)
dokato Feb 21, 2017
b91ec72
[MRG+2] Fixed assumption fit attribute means object is estimator. (#8…
drkatnz Feb 21, 2017
93a5013
[MRG] FIX lasso/elasticnet example did not add noise to simulated dat…
NelleV Feb 22, 2017
15e8ec9 8000
Travis add coverage to Python 3 build and oldest version build (#8435)
lesteve Feb 22, 2017
f10ac95
[MRG] Remove unnecessary backticks around parameter name in docstring…
tzano Feb 22, 2017
59bd153
[MRG+1] Refactoring plot_iris svm example. (#8279)
lemonlaug Feb 23, 2017
c22a73e
[MRG] Fix Parameters in tutorials (#8345)
anshbansal Feb 23, 2017
b7a5752
[MRG+1] Fixes incorrect output when input is precomputed sparse matri…
Akshay0724 Feb 23, 2017
341fc34
DOC fix MultiTaskElasticNet doc (#8442)
tzano Feb 23, 2017
79e645d
Travis: tweak test_script.sh (#8444)
lesteve Feb 23, 2017
cfe35c4
[MRG+1] Add note about the size of default random forest model #6276 …
Morikko Feb 23, 2017
36b5354
[MRG] Add MAE formula in the regression criteria docs. (#8402)
aashil Feb 24, 2017
dc0f201
DOC describe scikit-learn-contrib in related projects and contributin…
jnothman Feb 24, 2017
223e9a6
DOC Fix default value in RandomizedLasso (#8455)
Feb 26, 2017
fad531d
[MRG+1] FIX/DOC Improve documentation regarding non-determinitic tree…
glemaitre Feb 26, 2017
41ee20a
Correct default value of reg_covar in gaussian_mixture. (#8462)
tguillemot Feb 27, 2017
e987092
initial commit
sergeyf Feb 27, 2017
980961d
init bug fix
sergeyf Feb 27, 2017
513b4fa
fixing pep8 errors
sergeyf Feb 27, 2017
7ad467d
more pep8 fixes
sergeyf Feb 27, 2017
4ee5785
fixing build failures
sergeyf Feb 27, 2017
f5611b4
fixing error for _statistics in Imputer
sergeyf Feb 27, 2017
283b569
fixing failed test by skipping MICEImputer
sergeyf Feb 28, 2017
b6a4d9f
fixing circular import issue. Questionable style?
sergeyf Feb 28, 2017
ca85386
one flake left
sergeyf Feb 28, 2017
0a89f88
initial commit
sergeyf Feb 27, 2017
5bb3eab
init bug fix
sergeyf Feb 27, 2017
e70241e
fixing pep8 errors
sergeyf Feb 27, 2017
1f3e2fa
fixing build failures
sergeyf Feb 27, 2017
713c9f3
addressing a few comments, and removing updates to plot ols
sergeyf Feb 28, 2017
9ac7f01
initial commit
sergeyf Feb 27, 2017
99414e7
init bug fix
sergeyf Feb 27, 2017
83d8e26
fixing pep8 errors
sergeyf Feb 27, 2017
eb98371
fixing build failures
sergeyf Feb 27, 2017
869fb6a
fixing error for _statistics in Imputer
sergeyf Feb 27, 2017
3982e57
fixing failed test by skipping MICEImputer
sergeyf Feb 28, 2017
ff729ac
fixing circular import issue. Questionable style?
sergeyf Feb 28, 2017
ecaea48
one flake left
sergeyf Feb 28, 2017
948c2cb
init bug fix
sergeyf Feb 27, 2017
9387fad
fixing pep8 errors
sergeyf Feb 27, 2017
e128c48
fixing error for _statistics in Imputer
sergeyf Feb 27, 2017
6db8702
fixing failed test by skipping MICEImputer
sergeyf Feb 28, 2017
71b862e
one flake left
sergeyf Feb 28, 2017
023f93c
addressing a few comments, and removing updates to plot ols
sergeyf Feb 28, 2017
300c3b3
typo
sergeyf Feb 28, 2017
5cf5681
mu
sergeyf Feb 27, 2017
8d32148
mu
sergeyf Feb 28, 2017
981fc56
mu
sergeyf Feb 27, 2017
cb019de
init bug fix
sergeyf Feb 27, 2017
b653c68
fixing pep8 errors
sergeyf Feb 27, 2017
c7f4341
fixing build failures
sergeyf Feb 27, 2017
a6c66aa
mu
sergeyf Feb 28, 2017
c1e5fad
Save predictions in diabetes_y_pred (#8241)
davidrobles Feb 27, 2017
b29d15e
initial commit
sergeyf Feb 27, 2017
b2096cc
init bug fix
sergeyf Feb 27, 2017
0428aca
fixing pep8 errors
sergeyf Feb 27, 2017
305fd95
fixing build failures
sergeyf Feb 27, 2017
5ba7e5d
mu
sergeyf Feb 27, 2017
b5d4595
mu
sergeyf Feb 28, 2017
8cf3498
mu
sergeyf Feb 28, 2017
efe22d9
mu
sergeyf Feb 27, 2017
ede98d8
init bug fix
sergeyf Feb 27, 2017
850e011
fixing pep8 errors
sergeyf Feb 27, 2017
cd6c344
fixing build failures
sergeyf Feb 27, 2017
8279731
initial commit
sergeyf Feb 27, 2017
64edea3
init bug fix
sergeyf Feb 27, 2017
68546e4
fixing pep8 errors
sergeyf Feb 27, 2017
ae11e3c
mu
sergeyf Feb 27, 2017
ab616f9
mu
sergeyf Feb 28, 2017
7f585fd
mu
sergeyf Feb 28, 2017
9f8c65f
mu
sergeyf Feb 28, 2017
2cefa90
mu
sergeyf Feb 27, 2017
afcef3c
mu
sergeyf Feb 28, 2017
fd16ac4
mu
sergeyf Feb 28, 2017
7d2256f
mu
sergeyf Feb 27, 2017
4c75257
init bug fix
sergeyf Feb 27, 2017
b224348
fixing pep8 errors
sergeyf Feb 27, 2017
d6cdd5b
fixing build failures
sergeyf Feb 27, 2017
9bafe7e
mu
sergeyf Feb 28, 2017
8471e0f
mu
sergeyf Feb 28, 2017
b4fbcf3
mu
sergeyf Feb 28, 2017
040e140
mu
sergeyf Feb 28, 2017
c8cb82a
Merge branch 'mice' of https://github.com/sergeyf/scikit-learn into mice
sergeyf Feb 28, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion examples/linear_model/plot_ols.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,4 +67,4 @@
plt.xticks(())
plt.yticks(())

plt.show()
plt.show()
67 changes: 47 additions & 20 deletions examples/missing_values.py
Original file line number Diff line number Diff line change
@@ -1,50 +1,67 @@
"""
======================================================
====================================================
Imputing missing values before building an estimator
======================================================
====================================================

This example shows that imputing the missing values can give better results
than discarding the samples containing any missing value.
Imputing does not always improve the predictions, so please check via cross-validation.
This example shows that imputing the missing values can give
better results than discarding the samples containing any missing value.
Imputing does not always improve the predictions,
so please check via cross-validation.
Sometimes dropping rows or using marker values is more effective.

Missing values can be replaced by the mean, the median or the most frequent
value using the ``strategy`` hyper-parameter.
The median is a more robust estimator for data with high magnitude variables
which could dominate results (otherwise known as a 'long tail').

Script output::
Another option is the MICE imputer. This uses round-robin linear regression,
treating every variable as an output in turn. The simple version implemented
assumes Gaussian output variables. If your output variables are obviously
non-Gaussian, consider transforming them to improve performance.

Score with the entire dataset = 0.56
Score without the samples containing missing values = 0.48
Score after imputation of the missing values = 0.55
Script output:

MSE with the entire dataset = 3354.15
MSE without the samples containing missing values = 2968.98
MSE after mean imputation of the missing values = 3507.77
MSE after MICE imputation of the missing values = 3340.39
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a nice usecase where this value would be more demonstrative of the advantage of MICE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the MSE is better than "MSE with the entire dataset", and better than "MSE after mean imputation of the missing values". Were you hoping for a more dramatic improvement?


In this case, imputing helps the classifier match the original score.

Note that MICE will not always be better than, e.g., simple mean imputation.
To see an example of this, swap in ``boston`` for ``diabetes``.

In this case, imputing helps the classifier get close to the original score.

"""
import numpy as np

from sklearn.datasets import load_diabetes
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import MICEImputer
from sklearn.model_selection import cross_val_score

rng = np.random.RandomState(0)

dataset = load_boston()
dataset_name = 'diabetes' # 'boston' for another examples
if dataset_name == 'boston':
dataset = load_boston()
elif dataset_name == 'diabetes':
dataset = load_diabetes()
X_full, y_full = dataset.data, dataset.target
n_samples = X_full.shape[0]
n_features = X_full.shape[1]

# Estimate the score on the entire dataset, with no missing values
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, X_full, y_full).mean()
print("Score with the entire dataset = %.2f" % score)
score = cross_val_score(estimator, X_full, y_full,
scoring='neg_mean_squared_error').mean() * -1
print("MSE with the entire dataset = %.2f" % score)

# Add missing values in 75% of the lines
missing_rate = 0.75
n_missing_samples = np.floor(n_samples * missing_rate)
n_missing_samples = int(np.floor(n_samples * missing_rate))
missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,
dtype=np.bool),
np.ones(n_missing_samples,
Expand All @@ -56,10 +73,11 @@
X_filtered = X_full[~missing_samples, :]
y_filtered = y_full[~missing_samples]
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, X_filtered, y_filtered).mean()
print("Score without the samples containing missing values = %.2f" % score)
score = cross_val_score(estimator, X_filtered, y_filtered,
scoring='neg_mean_squared_error').mean() * -1
print("MSE without the samples containing missing values = %.2f" % score)

# Estimate the score after imputation of the missing values
# Estimate the score after imputation (mean strategy) of the missing values
X_missing = X_full.copy()
X_missing[np.where(missing_samples)[0], missing_features] = 0
y_missing = y_full.copy()
Expand All @@ -68,5 +86,14 @@
axis=0)),
("forest", RandomForestRegressor(random_state=0,
n_estimators=100))])
score = cross_val_score(estimator, X_missing, y_missing).mean()
print("Score after imputation of the missing values = %.2f" % score)
score = cross_val_score(estimator, X_missing, y_missing,
scoring='neg_mean_squared_error').mean() * -1
print("MSE after mean imputation of the missing values = %.2f" % score)

# Estimate the score after imputation (MICE strategy) of the missing values
estimator = Pipeline([("imputer", MICEImputer(missing_values=0)),
("forest", RandomForestRegressor(random_state=0,
n_estimators=100))])
score = cross_val_score(estimator, X_missing, y_missing,
scoring='neg_mean_squared_error').mean() * -1
print("MSE after MICE imputation of the missing values = %.2f" % score)
15 changes: 13 additions & 2 deletions sklearn/dummy.py
Original file line number Diff line number Diff line change
Expand Up @@ -449,7 +449,7 @@ def fit(self, X, y, sample_weight=None):
self.constant_ = np.reshape(self.constant_, (1, -1))
return self

def predict(self, X):
def predict(self, X, return_std=False):
"""
Perform classification on test vectors X.

Expand All @@ -459,18 +459,29 @@ def predict(self, X):
Input vectors, where n_samples is the number of samples
and n_features is the number of features.

return_std : boolean, optional
Whether to return the standard deviation of posterior prediction.

Returns
-------
y : array, shape = [n_samples] or [n_samples, n_outputs]
Predicted target values for X.

y_std : array, shape = [n_samples] or [n_samples, n_outputs]
Standard deviation of predictive distribution of query points.
"""
check_is_fitted(self, "constant_")
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
n_samples = X.shape[0]

y = np.ones((n_samples, 1)) * self.constant_
y_std = np.zeros((n_samples, 1))

if self.n_outputs_ == 1 and not self.output_2d_:
y = np.ravel(y)
y_std = np.ravel(y_std)

return y
if return_std:
return y, y_std
else:
return y
4 changes: 3 additions & 1 deletion sklearn/preprocessing/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
from .label import MultiLabelBinarizer

from .imputation import Imputer
from .mice import MICEImputer


__all__ = [
Expand All @@ -38,6 +39,7 @@
'KernelCenterer',
'LabelBinarizer',
'LabelEncoder',
'MICEImputer',
'MultiLabelBinarizer',
'MinMaxScaler',
'MaxAbsScaler',
Expand All @@ -54,4 +56,4 @@
'maxabs_scale',
'minmax_scale',
'label_binarize',
]
]
9 changes: 6 additions & 3 deletions sklearn/preprocessing/imputation.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,10 @@ def fit(self, X, y=None):
self.missing_values,
self.axis)

invalid_mask = np.isnan(self.statistics_)
valid_mask = np.logical_not(invalid_mask)
self._valid_statistics_indexes = np.where(valid_mask)[0]

return self

def _sparse_fit(self, X, strategy, missing_values, axis):
Expand Down Expand Up @@ -339,14 +343,13 @@ def transform(self, X):
invalid_mask = np.isnan(statistics)
valid_mask = np.logical_not(invalid_mask)
valid_statistics = statistics[valid_mask]
valid_statistics_indexes = np.where(valid_mask)[0]
missing = np.arange(X.shape[not self.axis])[invalid_mask]

if self.axis == 0 and invalid_mask.any():
if self.verbose:
warnings.warn("Deleting features without "
"observed values: %s" % missing)
X = X[:, valid_statistics_indexes]
X = X[:, self._valid_statistics_indexes]
elif self.axis == 1 and invalid_mask.any():
raise ValueError("Some rows only contain "
"missing values: %s" % missing)
Expand Down Expand Up @@ -374,4 +377,4 @@ def transform(self, X):

X[coordinates] = values

return X
return X
Loading
0