-
Notifications
You must be signed in to change notification settings - Fork 15
Suggestions for Ch6 - Classification 2 #70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm just going to add code to suppress warnings rather than muck with package versions. We can remove that code later once altair upgrades. |
I thought about this for a bit, and tried to implement both ways. I'm going to stick with Yes, it does have the drawback that other code may muck with the global seed, and you don't really know whether functions you're using are using the BUT for the purposes of students learning about how randomness in computers works as part of a 100-level course, it's just so much simpler and less error prone for students to teach them about setting a global seed. Given that the drawbacks of That being said, I will add a notebox to tell students about the two caveats I mentioned above, and about
We shouldn't be setting the
Since we need to set the seed once at the start of the analysis, we need to teach them about what seeding does. (They also see it a tonne in their worksheets anyway, so at least it isn't totally foreign to them). |
@joelostblom I've been using |
Yes, that's a hold over from tidymodels. I'm using the terminology "model object" |
I like this idea, but it's more of a "nice to have". I'll open an issue and punt for later. |
I'm just going to use |
We explain how they're created in the paragraph above. I don't think we need to show them the code for this because they shouldn't ever be adding random columns in practice. |
It's good enough for now, punted to later as an issue |
This is where we illustrate what cross val does manually (by manually generating 5 splits). We only show the first split. You'll notice we are splitting the |
It should be "in ____" because the previous section covers the conceptual stuff. I'll change to to "in scikit-learn", because thankfully it actually implements it, unlike tidymodels in R where we had to code it ourselves. |
Note, crucially, we want the standard error here, not standard deviation. The |
It is an estimate -- I'm keeping that word |
I don't think there's any real value in introducing two separate methods for this in this class. I'm just going to introduce grid CV. |
Nope and it isn't necessary; removing |
I think this is a good thing to consider later (nice to have, not necessary right now). Will spin into its own issue. |
I agree. I'm going to comment it out for now entirely and reintroduce it later once we have breathing room. Issue posted. |
I will start going through Ch6 now, might not finish until tomorrow. A few comments on what you wrote above:
I am not familiar with the internals for how the PRNGSs work, but I think both pandas and sklearn recommend the use of
Great!
I agree that the cross-validation is an estimate of the test accuracy. I read that sentence as that we are getting an estimate of the cross-validation accuracy, which we are not, since we are getting the actual cross-validation accuracy. But I'm not a native speaker so if it reads well to you then let's keep it. |
sample
we will always control randomness via therandom_state
parameter.random_state
when doing thetrain_test
split instead.cancer["Class"].map({'M': 'Malignant', 'B': 'Benign'})
(which would be convenient but a new thing) or simply filtering:cancer.loc[cancer['Class'] == 'M', 'Class'] = 'Malignant'
+ the benign version (which is less elegant but builds on what they have learned already)FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
stratify
andshuffle
, not just mention what they do (include a short example for each in a sentence or so).info
method always prints columns vertically, not just when there is a large number of them.groupby
+count
here. That creates a count for each column, so it doesn't generalize well. Instead we should use.groupby
+size
OR just usevalue_counts(normalize=True)
which means we don't have to divide by the length after (btw we should get dataframe length viadf.shape[0]
not usinglen
(applies to the entire book)). I don't think there is anormalize
option tosize
, but can't double check because I'm in the airX
and lower casey
?X
should be created ascancer_train.drop(columns='Class')
, which is more generalizableX_train
and do the same for the test data.X_tune
/y_tune
in grdisearch, which should just be the same_train
insteadconcat
to add the predictions? Can't we just create a copy of the new columns the regular way in the original dataframe (if it is OK to modify at this point)? Or usedf.assign(pred=df['predictions'])
to just print it without overwriting the variable (if we have taughtassign
before here.y_test
and the predictions created incancer_test_predictions
, so that students get a better grasp of what it is.confusion_matrix
when we are saying thatCnfusionMatrixDisiplay
is the recommend way?cv
param ofcross_validate
does not seem that helpful?func=
inagg
range
does somewhere or do we just start using it?param_grid_lots
is not a great name, something likelarge_param_grid
sounds better. same with the plots name.n_jobs=-1
?fit()
function of GridSearchCV instance.estimatethat"GridSearchCV.best_estimator_
instead of refitting it (which is redundant and potentially time consuming).in Python"The text was updated successfully, but these errors were encountered: