10000 [MRG+1] Fixing a bug where entropy included labeled items by mdezube · Pull Request #8150 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] Fixing a bug where entropy included labeled items #8150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jan 28, 2017

Conversation

mdezube
Copy link
Contributor
@mdezube mdezube commented Jan 3, 2017

This bug leads the loop to never converge no matter how high the iteration count since it includes already labeled items in the bucket of items it'll label in the next iteration.

This can readily be seen by adding a "print unlabeled_indices.shape" statement and noticing it stops shrinking after only 1/3 of the data is labeled.

Note I'm not sure how to rebuild the documentation and the .py and .ipynb files linked to here, is that done automatically?

This bug leads the loop to never converge no matter how high the iteration count since it includes already labeled items in the bucket of items it'll label in the next iteration.

This can readily be seen by adding a "print unlabeled_indices.shape" statement and noticing it stops shrinking after only 1/3 of the data is labeled.
Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, I think this is the right thing to do.

uncertainty_index = uncertainty_index = np.argsort(pred_entropies)[-5:]
uncertainty_index = np.argsort(pred_entropies)[::-1]
uncertainty_index = uncertainty_index[
np.in1d(uncertainty_index, unlabeled_indices)][:5]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this comment didn't submit:
is it possible this expression will return fewer than 5 results? Does this need to be handled below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't cause an error, but we might as well break the loop earlier. Also I fixed the charts to handle if the iteration count is increased.

@tguillemot
Copy link
Contributor
tguillemot commented Jan 5, 2017

Travis is not happy because of pep8 problems.
Can you have a look @mdezube ?

@mdezube
Copy link
Contributor Author
mdezube commented Jan 5, 2017

I took a look and it's due to line length. But most of the file doesn't conform to the specified line length anyways according to PEP 8 so I didn't think it was worth fixing specifically the lines I changed. I tend to always favor consistency throughout a file in situations like these.

Up to you if you want me to fix it, take a look at the file and let me know what your preference is.

@tguillemot
Copy link
Contributor

@mdezube I understand your point of view.

Indeed some files in sklearn doesn't conform to PEP8.
As it takes a lot of time to correct them, we added the Travis process to control that the new line are PEP8 conform. If all the new PR are PEP8 conform, things will not be worst :).

A PR is most welcome if you want to correct some PEP8 problems in sklearn, but for now, if it will be easier to only fix the line you modified :). Thx !

f.text(.05, (1 - (i + 1) * .183),
"model %d\n\nfit with\n%d labels" % ((i + 1), i * 5 + 10), size=10)

# if running for more than 5 iterations visualiaze the gain only on the first 5.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If running for more than 5 iterations visualize the gain is only on the first 5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I omitted the capitalization though since no other comments in the file are capitalized.

sub.set_title('predict: %i\ntrue: %i' % (
lp_model.transduction_[image_index], y[image_index]), size=10)
sub.axis('off')
# if running for more than 5 iterations visualiaze the gain only on the first 5.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If running for more than 5 iterations visualize the gain is only on the first 5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I omitted the capitalization though since no other comments in the file are capitalized.

@tguillemot
Copy link
Contributor

Still some PEP8 problems.
Can you have a look @mdezube ?

@mdezube
Copy link
Contributor Author
mdezube commented Jan 17, 2017

@tguillemot I trimmed the whitespace per the pep8 warning. There's still a warning around my lines being too long, but I aimed for consistency throughout the file since all of the existing lines follow the 100 char limit, not 80. If you'd rather they all be 80 I'm happy to do this in a follow up PR, but I feel it should be a single CL instead of lumped in with this functionality fix.


# keep track of indices that we get labels for
delete_indices = np.array([])

8000
f.text(.05, (1 - (i + 1) * .183),
"model %d\n\nfit with\n%d labels" % ((i + 1), i * 5 + 10), size=10)
# if running for more than 5 iterations, visualize the gain on only the first 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# for more than 5 iterations visualize the gain only on the first 5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# if running for more than 5 iterations, visualize the gain on only the first 5
if i < 5:
f.text(.05, (1 - (i + 1) * .183),
"model %d\n\nfit with\n%d labels" % ((i + 1), i * 5 + 10), size=10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f.text(.05, (1 - (i + 1) * .183),
       "model %d\n\nfit with\n%d labels" % 
       (i + 1, i * 5 + 10), size=10)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

sub.set_title('predict: %i\ntrue: %i' % (
lp_model.transduction_[image_index], y[image_index]), size=10)
sub.axis('off')
# if running for more than 5 iterations, visualize the gain on only the first 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# for more than 5 iterations visualize the gain only on the first 5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -69,29 +72,35 @@
pred_entropies = stats.distributions.entropy(
lp_model.label_distributions_.T)

# select five digit examples that the classifier is most uncertain about
uncertainty_index = uncertainty_index = np.argsort(pred_entropies)[-5:]
# select up to five digit examples that the classifier is most uncertain about
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# select up to 5 digit examples that the classifier is most uncertain about

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@mdezube
Copy link
Contributor Author
mdezube commented Jan 21, 2017

Comments addressed!

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

f.text(.05, (1 - (i + 1) * .183),
"model %d\n\nfit with\n%d labels" % ((i + 1), i * 5 + 10), size=10)
# for more than 5 iterations, visualize the gain only on the first 5
if i < 5:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is necessary. Perhaps if we want the number of iterations to look like a parameter the user might play with, we should pull it out into a named variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well extract it into a variable :) See the latest changes.

@jnothman jnothman changed the title [MRG] Fixing a bug where entropy included labeled items [MRG+1] Fixing a bug where entropy included labeled items Jan 22, 2017
Makes it more obvious that `max_iterations` can freely be changed
Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's clearer.

@tguillemot
Copy link
Contributor

LGTM.

Both appveyor and circle seems unrelated.

@mdezube
Copy link
Contributor Author
mdezube commented Jan 28, 2017

Now that this is approved, is there a way for me to merge it in? Or do I just wait?

@jnothman
Copy link
Member

Yes, I'm okay to merge this.

@jnothman
Copy link
Member

Thanks

@jnothman jnothman merged commit 1b0ec1b into scikit-learn:master Jan 28, 2017
@mdezube
Copy link
Contributor Author
mdezube commented Jan 29, 2017

Thanks Joel! QQ, does this change automatically get picked up into the ipynb file that includes this demo? And should I expect it to go out in v .19?

@jnothman
Copy link
Member
jnothman commented Jan 29, 2017 via email

@tguillemot
Copy link
Contributor

Thanks @mdezube

@mdezube
Copy link
Contributor Author
mdezube commented Jan 30, 2017 via email

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
@Przemo10 Przemo10 mentioned this pull request Mar 17, 2017
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0