8000 [MRG+1] Issue #7779 Fixed with new function (datasets.load_wine) added. by tylerlanigan · Pull Request #7912 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] Issue #7779 Fixed with new function (datasets.load_wine) added. #7912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Feb 13, 2017

Conversation

tylerlanigan
Copy link
Contributor

Reference Issue

Fixes #7779

What does this implement/fix? Explain your changes.

Provides a new example called plot_scaling_importance.py

In order to implement the example, the wine dataset from UCI is used.

A load_wine function is added to sklearn.datasets that load the wine dataset. The dataset is stored in the datasets/data folder. Updated are base.py and init.py to reflect this.

The following has been tested and works:
from sklearn.datasets import load_wine

A test is added in datasets/tests/test_base.py to test the new load_wine() function. The new function has been tested in the style of load_iris()

Any other comments?

@jnothman
Copy link
Member

Test failure:

======================================================================
FAIL: sklearn.datasets.tests.test_base.test_load_wine
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/scikit-learn/scikit-learn/testvenv/local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/datasets/tests/test_base.py", line 215, in test_load_wine
    assert_equal(res.data.shape, (178, 13))
AssertionError: Tuples differ: (150, 4) != (178, 13)
First differing element 0:
150
178
- (150, 4)
+ (178, 13)
    """Fail immediately, with the given message."""
>>  raise self.failureException('Tuples differ: (150, 4) != (178, 13)\n\nFirst differing element 0:\n150\n178\n\n- (150, 4)\n+ (178, 13)')

@tylerlanigan
Copy link
Contributor Author

Ok that was dumb of me. I did all the testing outside of that folder and forgot to update the original load_iris to load_wine. Fixed.

In future pull requests I'll figure out my testing bug. Maybe I could ask you for a little guidance for how you set up your testing environment?

@tylerlanigan
Copy link
Contributor Author

@jnothman I've submitted a new commit called "fixed test_base.py" which I believe addressed the error that popped up from before. I'm confused about this new error message that is popping up for continuous integration? How do I go about fixing this?

@jnothman
Copy link
8000 Member

The current error is from flake8: it's about style, not test failure.


./examples/preprocessing/plot_scaling_importance.py:32:80: E501 line too long (97 > 79 characters)
which has been standardized using :class:`StandardScaler <sklearn.preprocessing.StandardScaler>`,
                                                                               ^
./examples/preprocessing/plot_scaling_importance.py:49:1: E402 module level import not at top of file
from sklearn.cross_validation import train_test_split
^
./examples/preprocessing/plot_scaling_importance.py:50:1: E402 module level import not at top of file
from sklearn import preprocessing
^
./examples/preprocessing/plot_scaling_importance.py:51:1: E402 module level import not at top of file
from sklearn.decomposition import PCA
^
./examples/preprocessing/plot_scaling_importance.py:52:1: E402 module level import not at top of file
from sklearn.preprocessing import StandardScaler
^
./examples/preprocessing/plot_scaling_importance.py:52:1: F401 'sklearn.preprocessing.StandardScaler' imported but unused
from sklearn.preprocessing import StandardScaler
^
./examples/preprocessing/plot_scaling_importance.py:53:1: E402 module level import not at top of file
from sklearn.naive_bayes import GaussianNB
^
./examples/preprocessing/plot_scaling_importance.py:54:1: E402 module level import not at top of file
from sklearn import metrics
^
./examples/preprocessing/plot_scaling_importance.py:55:1: E402 module level import not at top of file
import matplotlib.pyplot as plt
^
./examples/preprocessing/plot_scaling_importance.py:56:1: E402 module level import not at top of file
from sklearn.datasets import load_wine
^
./examples/preprocessing/plot_scaling_importance.py:67:80: E501 line too long (91 > 79 characters)
                                                    test_size=0.30, random_state=RAN_STATE)
                                                                               ^
./sklearn/datasets/base.py:321:43: W291 trailing whitespace
                 feature_names=['alcohol',
                                          ^
./sklearn/datasets/tests/test_base.py:213:1: E302 expected 2 blank lines, found 1
def test_load_wine():

@tylerlanigan
Copy link
Contributor Author

@jnothman Does this need to be corrected before it is merged? I was looking at another example when I made this one and it generated a few PEP8 errors because the modules weren't at the top. I assumed that this needs to be the case in order to render it correctly for the webpage.

@jnothman
Copy link
Member

I'm not sure, but that might be an old convention:
http://scikit-learn.org/stable/auto_examples/plot_compare_reduction.html
seems to work okay despite printing doc after imports

On 21 November 2016 at 12:50, Tyler Lanigan notifications@github.com
wrote:

@jnothman https://github.com/jnothman Does this need to be corrected
before it is merged? I was looking at another example when I made this one
and it generated a few PEP8 errors because the modules weren't at the top.
I assumed that this needs to be the case in order to render it correctly
for the webpage.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7912 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz66j8Y5sP-TAHc2ygNeyZX17OSIPiks5rAPjvgaJpZM4K3ZYh
.

@@ -1,6 +1,13 @@
#!/usr/bin/python
# -*- coding: utf-8 -*-

from sklearn.cross_validation import train_test_split
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, docstring must come before improts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman Alright I'll change it back.

@tylerlanigan
Copy link
Contributor Author

@jnothman I've moved the import statements back and looked at the logs for the errors. It seems that the only flake8 errors are that the import statements needs to be at the top. Is there anything else that is required of me before this can be merged?

./examples/preprocessing/plot_scaling_importance.py:41:1: E402 module level import not at top of file
from sklearn.cross_validation import train_test_split
^
./examples/preprocessing/plot_scaling_importance.py:42:1: E402 module level import not at top of file
from sklearn import preprocessing
^
./examples/preprocessing/plot_scaling_importance.py:43:1: E402 module level import not at top of file
from sklearn.decomposition import PCA
^
./examples/preprocessing/plot_scaling_importance.py:44:1: E402 module level import not at top of file
from sklearn.naive_bayes import GaussianNB
^
./examples/preprocessing/plot_scaling_importance.py:45:1: E402 module level import not at top of file
from sklearn import metrics
^
./examples/preprocessing/plot_scaling_importance.py:46:1: E402 module level import not at top of file
import matplotlib.pyplot as plt
^
./examples/preprocessing/plot_scaling_importance.py:47:1: E402 module level import not at top of file
from sklearn.datasets import load_wine
^

@jnothman
Copy link
Member

Well, two core devs need to review and approve of it... so probably there
is more work to do, but what will depend on someone finding time to review
:)

On 21 November 2016 at 14:36, Tyler Lanigan notifications@github.com
wrote:

@jnothman https://github.com/jnothman I've moved the import statements
back and looked at the logs for the errors. It seems that the only flake8
errors are that the import statements needs to be at the top. Is there
anything else that is required of me before this can be merged?

./examples/preprocessing/plot_scaling_importance.py:41:1: E402 module level import not at top of file
from sklearn.cross_validation import train_test_split
^
./examples/preprocessing/plot_scaling_importance.py:42:1: E402 module level import not at top of file
from sklearn import preprocessing
^
./examples/preprocessing/plot_scaling_importance.py:43:1: E402 module level import not at top of file
from sklearn.decomposition import PCA
^
./examples/preprocessing/plot_scaling_importance.py:44:1: E402 module level import not at top of file
from sklearn.naive_bayes import GaussianNB
^
./examples/preprocessing/plot_scaling_importance.py:45:1: E402 module level import not at top of file
from sklearn import metrics
^
./examples/preprocessing/plot_scaling_importance.py:46:1: E402 module level import not at top of file
import matplotlib.pyplot as plt
^
./examples/preprocessing/plot_scaling_importance.py:47:1: E402 module level import not at top of file
from sklearn.datasets import load_wine
^


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7912 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz66iVWIpEGJMiEu-v13XJ_g2sp6ayks5rARHHgaJpZM4K3ZYh
.

@lesteve
Copy link
Member
lesteve commented Nov 21, 2016

The flake8 linting is done on the PR diff, so that explains why it is complaining on your example and not on old ones.

To be honest I lean slightly towards changing setup.cfg to ignore E402 (imports not at the top of the file). This would make sense especially for examples where you tend to have some kind of interactive-like style and do your imports only when you need them.

@lesteve
Copy link
Member
lesteve commented Nov 21, 2016

In your particular case you can move print(__doc__) after the imports if you want to get rid of the flake8 failures.

@GaelVaroquaux
Copy link
Member 8000
GaelVaroquaux commented Nov 21, 2016 via email

Copy link
Member
@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass

=========================================================

Features scaling though standardization (or Z-score normalization)
can be an importance preprocessing step for many machine learning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important

to the covariance matrix.

In order to illustrate this in an example, PCA will be performed on a dataset
which has been standardized using StandardScalerand a copy which has remained
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space: StandardScaler and

from __future__ import print_function
print(__doc__)

from sklearn.cross_validation import train_test_split
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use sklearn.model_selection instead since sklearn.cross_validation is deprecated in 0.18.

ax2.set_title('Standardized training dataset after PCA')

for ax in (ax1, ax2):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment: I would remove this blank line and add one before plt.tight_layout.

@@ -68,6 +69,7 @@
'fetch_rcv1',
'fetch_kddcup99',
'get_data_home',
'load_wine',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the load_ functions seems alphabetically ordered, please respect that (both here and in the imports)


Examples
--------
Let's say you are interested in the samples 10, 25, and 50, and want to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm you seems to take the samples [10, 80, 140] below, is that intended?


The copy of UCI ML Wine Data Set dataset is
downloaded from:
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you modified the class target labels as mentioned in the comments? If yes you may want to mention that.



# Contants
RAN_STATE = 42
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong reason to save three letters -> RANDOM_STATE = 42

weight) because of their respective scales (meters vs. kilos) it can
be seen how not scaling the features would cause PCA to determine that
the direction of maximal variance more closely corresponds with the
‘weight’ axis. As a change in height of one meter can be considered much
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird quotes around weight, did you mean to use '

# License: BSD 3 clause


# Contants
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constants

To be honest this is the kind of comments I would remove, since it doesn't bring much added value

@tylerlanigan
Copy link
Contributor Author

@lesteve Sounds good. I can get to this is a couple of days. My computer is in the shop

@jnothman
Copy link
Member

@amueller
Copy link
Member

maybe actually print the first principal component with and without scaling? Without scaling, it will be dominated by the feature with the largest scale (I would imagine) and you could explain that in the docstring?

@tylerlanigan
Copy link
Contributor Author

@amueller Yeah I can do that no problem.

@tylerlanigan
Copy link
Contributor Author

@lesteve I have also completed your requested changes

@tylerlanigan
Copy link
Contributor Author

@lesteve How do we proceed? I have made all of the changes that you requested, and have passed all of the checks?

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loader also needs a unit test in datasets/tests/test_base.py

features.

The results will then be used to train a naive Bayes classifier, and a clear
difference the prediction accuracies will be observed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"difference the" -> "difference in"


================= ==============
Classes 3
Samples per class [59,71,48]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alginment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure of what you mean? Do you mean that the [59,71,48] needs to line up center wise with the 3? Please clarify

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything else is right aligned

If True, returns ``(data, target)`` instead of a Bunch object.
See below for more information about the `data` and `target` object.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove blank line


(data, target) : tuple if ``return_X_y`` is True


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove blank line

https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

The file has been modified:
-to include class labels class_0, class_1 and class_2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will render correctly. Please check, at http://scikit-learn.org/circle?CIRCLE_BUILD_NO/modules/generated/sklearn.datasets.load_wine.html

Then again, I'm not sure this detail needs to be here as long as we're assured these changes are valid.

-to rename target variables from 1, 2, and 3 to 0, 1 and 2;
-to include to amount of datapoints and class labels.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one blank line

think of Principle Component Analysis (PCA) as being a prime example
of when normalization is important. In PCA we are interested in the
components that maximize the variance. If there exists components
(e.g human height) that vary less then other components (e.g human
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then -> than


================= ==============
Classes 3
Samples per class [59,71,48]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything else is right aligned

@tylerlanigan
Copy link
Contributor Author

@jnothman, Ok I've completed your requested changes. The unit test was done in a previous commit.

@tylerlanigan
Copy link
Contributor Author

@jnothman @lesteve Hey guys, I've completed your changes and pushed to the pull request. I think it's waiting on you to approve..?

@jnothman
Copy link
Member
jnothman commented Dec 13, 2016 via email

def load_wine(return_X_y=False):
"""Load and return the wine dataset (classification).

.. versionadded:: 0.18
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduce indent to 4 spaces

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

more important than the change in weight of one kilogram, it is easily
seen that this determination is incorrect. In the case of PCA, scaling
features using normalization is preferred over using min-max scaling as
the primary components are computed using the correlation matrix as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get this statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement doesn't need to be here. I removed it for clarity.

regression) require features to be normalized, intuitively we can
think of Principle Component Analysis (PCA) as being a prime example
of when normalization is important. In PCA we are interested in the
components that maximize the variance. If there exists components
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about, more succinctly, "If one component (e.g. human height) varies less than another (e.g. weight) because of their respective scales (meters vs. kilos), PCA might determine that the direction of maximal variance more closely corresponds with the 'weight' axis, if those features are not scaled.

be seen how not scaling the features would cause PCA to determine that
the direction of maximal variance more closely corresponds with the
'weight' axis. As a change in height of one meter can be considered much
more important than the change in weight of one kilogram, it is easily
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"it is easily seen that this determination is incorrect" -> "this is clearly incorrect".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

opposed to the covariance matrix.

In order to illustrate this in an example, PCA will be performed on a
dataset which has been standardized using Standard Scaler and a copy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:class:`preprocessing.StandardScaler`

Copy link
Member

"comparing the use of the unscaled data against the same with StandardScaler applied"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing as we have added a specialised dataset for this illustration, please briefly describe the dataset and the reason for its heterogeneous scale.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

downloaded and modified to fit standard format from:
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

Examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a References section citing the original PARVUS

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure whether we're obliged to cite UCI... Or indeed whether we're licenced to copy the UCI data. Nothing at https://archive.ics.uci.edu/ml/datasets/Wine suggests that we are.

Copy link
Contributor Author
@tylerlanigan tylerlanigan Dec 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman The load_breast_cancer function is also from UCI...

The copy of UCI ML Breast Cancer Wisconsin (Diagnostic) dataset is
    downloaded from:
    https://goo.gl/U2Uwz2 ```

That is the only citation I found for it. I also suspect that load_iris got it's data from UCI as well, but it doesn't mention anything.

I can add a reference anyways to set a better example

"""
module_path = dirname(__file__)
with open(join(module_path, 'data', 'wine_data.csv')) as csv_file:
data_file = csv.reader(csv_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd appreciate if this was refactored wrt load_iris

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman I don't see a difference in those four lines between load_iris and load_wine. Please clarify what you mean

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's why you should factor it out into a helper function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman I can do this, but should I also edit all the other load functions to use this new refactored one? I'm concerned that this is getting a little outside of the scope of the original issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh. If there are really many of them that need refactoring, perhaps not. But I only really think it's this, iris and breast_cancer that share the same loading.

@@ -0,0 +1,85 @@
Wine Data Database
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this is copied verbatim without licence.

Note that this does not quite conform to their citation policy either: https://archive.ics.uci.edu/ml/citation_policy.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the citation and reworded the paragraph a little.

assert_true(res.DESCR)

# test return_X_y option
X_y_tuple = load_iris(return_X_y=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_wine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup


# test return_X_y option
X_y_tuple = load_iris(return_X_y=True)
bunch = load_iris()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_wine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup :S

regression) require features to be normalized, intuitively we can
think of Principle Component Analysis (PCA) as being a prime example
of when normalization is important. In PCA we are interested in the
components that maximize the variance. If there exists components
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exists -> exist

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about, more succinctly, "If one component (e.g. human height) varies less than another (e.g. weight) because of their respective scales (meters vs. kilos), PCA might determine that the direction of maximal variance more closely corresponds with the 'weight' axis, if those features are not scaled.

"""
module_path = dirname(__file__)
with open(join(module_path, 'data', 'wine_data.csv')) as csv_file:
data_file = csv.reader(csv_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's why you should factor it out into a helper function

print(__doc__)

# Code source: Tyler Lanigan <tylerlanigan@gmail.com>
# Sebastian Raschka <mail@sebastianraschka.com>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignment

@jnothman
Copy link
Member

@amueller know anything about our licence to redistribute data from UCI??

@tylerlanigan
Copy link
Contributor Author

@jnothman Hey, just a heads up, I haven't pushed these changes yet. I'm just marking them as "done" as I go to keep track of them myself.

@amueller
Copy link
Member

@jnothman our license? My guess would be that iris is public domain. I'm not sure we did any research for the other datasets. UCI doesn't specify anything. If someone uploads their data, they "donate" it, but there is nothing on licensing terms. Given that it's maintained by the library, they should know better lol.

We could bug them to be better about specifying licenses, or we could redistribute until someone tells us not to?

@jnothman
Copy link
Member

Looks like you botched the merge.

@tylerlanigan
Copy link
Contributor Author

Yeah, I'm a little unsure how to fix this..

@jnothman
Copy link
Member

Firstly you can git reset --hard your branch back to the commit you were at before, so that you haven't lost anything. Then the goal is to fetch the updated upstream/master (i.e. master as at scikit-learn central) and, when in your branch, git merge upstream/master (or git rebase upstream/master).

@jnothman
Copy link
Member

294af08 was one of the most recent working HEADs of this branch.

@tylerlanigan
Copy link
Contributor Author

Is there anything else?

@jnothman
Copy link
Member
jnothman commented Jan 26, 2017 via email

Copy link
Member
@MechCoder MechCoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@@ -242,6 +242,122 @@ def load_files(container_path, description=None, categories=None,
DESCR=description)


def load_data(module_path, data_file_name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why you did not use the np.genfromtxt utility directly, I use it to load csv's? I feel this entire block is just

path = os.path.join(module_path, data_file_name)
X_y = np.genfromtxt(path, delimiter=",", skip_header=1)
data = X_y[:, :-1]
y = np.asarray(X_y[:, -1], dtype=np.int)

(or something similar)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether or not rightly, we've used the header line to encode extra information

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know if that is a valid excuse. The target_names can be hard-coded.

@jnothman
Copy link
Member
jnothman commented Feb 7, 2017 via email

@MechCoder
Copy link
Member

Me neither, but this part of the codebase can do with some code-refactoring. Please merge if you are happy.

@jnothman
Copy link
Member

Thanks @t-lanigan

@jnothman jnothman merged commit eb9fe80 into scikit-learn:master Feb 13, 2017
@tylerlanigan
Copy link
Contributor Author

@jnothman Thanks as well. When can I see page on the website?

@lesteve
Copy link
Member
lesteve commented Feb 13, 2017

@jnothman Thanks as well. When can I see page on the website?

You want to look at scikit-learn.org/dev which is updated on each push into master and go to the examples page.

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
@Przemo10 Przemo10 mentioned this pull request Mar 17, 2017
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
lemonlaug pushed a commit to lemonlaug/scikit-learn that referenced this pull request Jan 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Examples should better illustrate the value of scaling
6 participants
0