-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Fixing a bug where entropy included labeled items #8150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This bug leads the loop to never converge no matter how high the iteration count since it includes already labeled items in the bucket of items it'll label in the next iteration. This can readily be seen by adding a "print unlabeled_indices.shape" statement and noticing it stops shrinking after only 1/3 of the data is labeled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, I think this is the right thing to do.
uncertainty_index = uncertainty_index = np.argsort(pred_entropies)[-5:] | ||
uncertainty_index = np.argsort(pred_entropies)[::-1] | ||
uncertainty_index = uncertainty_index[ | ||
np.in1d(uncertainty_index, unlabeled_indices)][:5] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this comment didn't submit:
is it possible this expression will return fewer than 5 results? Does this need to be handled below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't cause an error, but we might as well break the loop earlier. Also I fixed the charts to handle if the iteration count is increased.
Travis is not happy because of pep8 problems. |
I took a look and it's due to line length. But most of the file doesn't conform to the specified line length anyways according to PEP 8 so I didn't think it was worth fixing specifically the lines I changed. I tend to always favor consistency throughout a file in situations like these. Up to you if you want me to fix it, take a look at the file and let me know what your preference is. |
@mdezube I understand your point of view. Indeed some files in sklearn doesn't conform to PEP8. A PR is most welcome if you want to correct some PEP8 problems in sklearn, but for now, if it will be easier to only fix the line you modified :). Thx ! |
f.text(.05, (1 - (i + 1) * .183), | ||
"model %d\n\nfit with\n%d labels" % ((i + 1), i * 5 + 10), size=10) | ||
|
||
# if running for more than 5 iterations visualiaze the gain only on the first 5. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If running for more than 5 iterations visualize the gain is only on the first 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I omitted the capitalization though since no other comments in the file are capitalized.
sub.set_title('predict: %i\ntrue: %i' % ( | ||
lp_model.transduction_[image_index], y[image_index]), size=10) | ||
sub.axis('off') | ||
# if running for more than 5 iterations visualiaze the gain only on the first 5. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If running for more than 5 iterations visualize the gain is only on the first 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I omitted the capitalization though since no other comments in the file are capitalized.
Still some PEP8 problems. |
@tguillemot I trimmed the whitespace per the pep8 warning. There's still a warning around my lines being too long, but I aimed for consistency throughout the file since all of the existing lines follow the 100 char limit, not 80. If you'd rather they all be 80 I'm happy to do this in a follow up PR, but I feel it should be a single CL instead of lumped in with this functionality fix. |
|
||
# keep track of indices that we get labels for | ||
delete_indices = np.array([]) | ||
10000 |
||
f.text(.05, (1 - (i + 1) * .183), | ||
"model %d\n\nfit with\n%d labels" % ((i + 1), i * 5 + 10), size=10) | ||
# if running for more than 5 iterations, visualize the gain on only the first 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# for more than 5 iterations visualize the gain only on the first 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
# if running for more than 5 iterations, visualize the gain on only the first 5 | ||
if i < 5: | ||
f.text(.05, (1 - (i + 1) * .183), | ||
"model %d\n\nfit with\n%d labels" % ((i + 1), i * 5 + 10), size=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f.text(.05, (1 - (i + 1) * .183),
"model %d\n\nfit with\n%d labels" %
(i + 1, i * 5 + 10), size=10)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
sub.set_title('predict: %i\ntrue: %i' % ( | ||
lp_model.transduction_[image_index], y[image_index]), size=10) | ||
sub.axis('off') | ||
# if running for more than 5 iterations, visualize the gain on only the first 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# for more than 5 iterations visualize the gain only on the first 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -69,29 +72,35 @@ | |||
pred_entropies = stats.distributions.entropy( | |||
lp_model.label_distributions_.T) | |||
|
|||
# select five digit examples that the classifier is most uncertain about | |||
uncertainty_index = uncertainty_index = np.argsort(pred_entropies)[-5:] | |||
# select up to five digit examples that the classifier is most uncertain about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# select up to 5 digit examples that the classifier is most uncertain about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Comments addressed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
f.text(.05, (1 - (i + 1) * .183), | ||
"model %d\n\nfit with\n%d labels" % ((i + 1), i * 5 + 10), size=10) | ||
# for more than 5 iterations, visualize the gain only on the first 5 | ||
if i < 5: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is necessary. Perhaps if we want the number of iterations to look like a parameter the user might play with, we should pull it out into a named variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well extract it into a variable :) See the latest changes.
Makes it more obvious that `max_iterations` can freely be changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's clearer.
LGTM. Both appveyor and circle seems unrelated. |
Now that this is approved, is there a way for me to merge it in? Or do I just wait? |
Yes, I'm okay to merge this. |
Thanks |
Thanks Joel! QQ, does this change automatically get picked up into the ipynb file that includes this demo? And should I expect it to go out in v .19? |
Yes and yes
…On 30 Jan 2017 6:01 am, "Michael Dezube" ***@***.***> wrote:
Thanks Joel! QQ, does this change automatically get picked up into the
ipynb file that includes this demo? And should I expect it to go out in v
.19?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#8150 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_seXiVVaWAanf_PwLSCt3FAJ4m8ks5rXOIGgaJpZM4LZ9J5>
.
|
Thanks @mdezube |
Happy to contribute!
…On Mon, Jan 30, 2017 at 2:55 AM, Thierry Guillemot ***@***.*** > wrote:
Thanks @mdezube <https://github.com/mdezube>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8150 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACo82lFAWcgAoA1vwuNld3GIopstCM-wks5rXZd-gaJpZM4LZ9J5>
.
|
This bug leads the loop to never converge no matter how high the iteration count since it includes already labeled items in the bucket of items it'll label in the next iteration.
This can readily be seen by adding a "print unlabeled_indices.shape" statement and noticing it stops shrinking after only 1/3 of the data is labeled.
Note I'm not sure how to rebuild the documentation and the .py and .ipynb files linked to here, is that done automatically?