8000 ml-knn page with make_moons.csv data · plotly/plotly.r-docs@c52901a · GitHub
[go: up one dir, main page]

Skip to content

Commit c52901a

Browse files
author
Kalpit Desai
committed
ml-knn page with make_moons.csv data
1 parent a93d8f8 commit c52901a

File tree

2 files changed

+382
-0
lines changed

2 files changed

+382
-0
lines changed

r/2021-07-21-ml-knn.rmd

Lines changed: 281 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,281 @@
1+
## kNN Classification in R
2+
3+
Visualize Tidymodels's k-Nearest Neighbors (kNN) classification in R with Plotly.
4+
5+
6+
## Basic binary classification with kNN
7+
8+
This section gets us started with displaying basic binary classification using 2D data. We first show how to display training versus testing data using [various marker styles](https://plotly.com/r/marker-style/), then demonstrate how to evaluate our classifier's performance on the **test split** using a continuous color gradient to indicate the model's predicted score.
9+
10+
We will use [Tidymodels](https://www.tidymodels.org/) for training our model and for loading and splitting data. Tidymodels is a popular Machine Learning (ML) library that offers various tools for creating and training ML algorithms, feature engineering, data cleaning, and evaluating and testing models.
11+
12+
We will train a [k-Nearest Neighbors (kNN)](https://parsnip.tidymodels.org/reference/nearest_neighbor.html) classifier. First, the model records the label of each training sample. Then, whenever we give it a new sample, it will look at the `k` closest samples from the training set to find the most common label, and assign it to our new sample.
13+
14+
15+
### Display training and test splits
16+
17+
We get the synthetic data that form the shape of a moon. We then split it into a training and testing set. data that form the shape of a moon. We then split it into a training and testing set. Finally, we display the ground truth labels using [a scatter plot](https://plotly.com/r/line-and-scatter/).
18+
19+
In the graph, we display all the negative labels as squares, and positive labels as circles. We differentiate the training and test set by adding a dot to the center of test data.
20+
21+
In this example, we will use graph objects, Plotly's low-level API for building figures.
22+
23+
```{r}
24+
library(tidyverse)
25+
library(tidymodels)
26+
library(plotly)
27+
28+
make_moons <- read.csv(file = "make_moons.csv")
29+
make_moons$y <- as.character(make_moons$y)
30+
set.seed(123)
31+
make_moons_split <- initial_split(make_moons, prop = 3/4)
32+
make_moons_training <- make_moons_split %>%
33+
training()
34+
make_moons_test <- make_moons_split %>%
35+
testing()
36+
train_index <- as.integer(rownames(make_moons_training))
37+
test_index <- as.integer(rownames(make_moons_test))
38+
make_moons[train_index,'split'] = 'Train Split Label'
39+
make_moons[test_index,'split'] = 'Test Split Label'
40+
make_moons$y <- paste(make_moons$split,make_moons$y)
41+
42+
fig <- plot_ly(data = make_moons, x = ~X1, y = ~X2, type = 'scatter', mode = 'markers',alpha = 0.5, symbol = ~y, symbols = c('square','circle','square-dot','circle-dot'),
43+
marker = list(size = 12,
44+
color = 'lightyellow',
45+
line = list(color = 'black',width = 1)))
46+
47+
fig
48+
```
49+
50+
### Visualize predictions on test split
51+
52+
Now, we train the kNN model on the same training data displayed in the previous graph. Then, we predict the confidence score of the model for each of the data points in the test set. We will use shapes to denote the true labels, and the color will indicate the confidence of the model for assign that score.
53+
54+
Notice that `scatter` only requires one function call to plot both negative and positive labels, and can additionally set a continuous color scale based on the `yscore` output by our kNN model.
55+
56+
```{r}
57+
library(plotly)
58+
library(tidymodels)
59+
60+
db <- read.csv('data/make_moons.csv')
61+
db$y <- as.factor(db$y)
62+
db_split <- initial_split(db, prop = 3/4)
63+
train_data <- training(db_split)
64+
test_data <- testing(db_split)
65+
x_test <- test_data %>% select(X1, X2)
66+
y_test <- test_data %>% select(y)
67+
68+
69+
knn_dist <- nearest_neighbor(neighbors = 15, weight_func = 'rectangular') %>%
70+
set_engine('kknn') %>%
71+
set_mode('classification') %>%
72+
fit(y~., data = train_data)
73+
yscore <- knn_dist %>%
74+
predict(x_test, type = 'prob')
75+
colnames(yscore) <- c('yscore0','yscore1')
76+
yscore <- yscore$yscore1
77+
78+
pdb <- cbind(x_test, y_test)
79+
pdb <- cbind(pdb, yscore)
80+
81+
fig <- plot_ly(data = pdb,x = ~X1, y = ~X2, type = 'scatter', mode = 'markers',color = ~yscore, colors = 'RdBu', symbol = ~y, split = ~y, symbols = c('square-dot','circle-dot'),
82+
marker = list(size = 12, line = list(color = 'black', width = 1)))
83+
84+
fig
85+
```
86+
87+
## Probability Estimates with `Contour`
88+
89+
Just like the previous example, we will first train our kNN model on the training set.
90+
91+
Instead of predicting the confidence for the test set, we can predict the confidence map for the entire area that wraps around the dimensions of our dataset. To do this, we use [`meshgrid`](https://www.rdocumentation.org/packages/pracma/versions/1.9.9/topics/meshgrid) to create a grid, where the distance between each point is denoted by the `mesh_size` variable.
92+
93+
Then, for each of those points, we will use our model to give a confidence score, and plot it with a [contour plot](https://plotly.com/r/contour-plots/).
94+
95+
```{r}
96+
library(plotly)
97+
library(pracma)
98+
library(kknn)
99+
library(tidymodels)
100+
101+
make_moons <- read.csv(file = "data/make_moons.csv")
102+
make_moons_classification <- make_moons
103+
make_moons$y <- as.character(make_moons$y)
104+
set.seed(123)
105+
make_moons_split <- initial_split(make_moons, prop = 3/4)
106+
make_moons_training <- make_moons_split %>%
107+
training()
108+
make_moons_test <- make_moons_split %>%
109+
testing()
110+
train_index <- as.integer(rownames(make_moons_training))
111+
test_index <- as.integer(rownames(make_moons_test))
112+
113+
mesh_size = .02
114+
margin = 0.25
115+
x_min = min(make_moons$X1) - margin
116+
x_max = max(make_moons$X1) + margin
117+
y_min = min(make_moons$X2) - margin
118+
y_max = max(make_moons$X2) + margin
119+
xrange <- seq(x_min, x_max, mesh_size)
120+
yrange <- seq(y_min, y_max, mesh_size)
121+
xy <- meshgrid(x = xrange, y = yrange)
122+
xx <- xy$X
123+
yy <- xy$Y
124+
125+
make_moons_classification$y <- as.factor(make_moons_classification$y)
126+
127+
knn_dist <- nearest_neighbor(neighbors = 15, weight_func = 'rectangular') %>%
128+
set_engine('kknn') %>%
129+
set_mode('classification') %>%
130+
fit(y~., data = make_moons_classification)
131+
132+
dim_val <- dim(xx)
133+
xx1 <- matrix(xx, length(xx), 1)
134+
yy1 <- matrix(yy, length(yy), 1)
135+
final <- data.frame(xx1, yy1)
136+
colnames(final) <- c('X1','X2')
137+
pred <- knn_dist %>%
138+
predict(final, type = 'prob')
139+
140+
predicted <- pred$.pred_1
141+
Z <- matrix(predicted, dim_val[1], dim_val[2])
142+
143+
fig <- plot_ly(x = xrange, y= yrange, z = Z, colorscale='RdBu', type = "contour")
144+
fig
145+
```
146+
147+
Now, let's try to combine our `Contour` plot with the first scatter plot of our data points, so that we can visually compare the confidence of our model with the true labels.
148+
149+
```{r}
150+
library(plotly)
151+
library(pracma)
152+
library(kknn)
153+
library(tidymodels)
154+
155+
make_moons <- read.csv(file = "data/make_moons.csv")
156+
make_moons_classification <- make_moons
157+
make_moons$y <- as.character(make_moons$y)
158+
set.seed(123)
159+
make_moons_split <- initial_split(make_moons, prop = 3/4)
160+
make_moons_training <- make_moons_split %>%
161+
training()
162+
make_moons_test <- make_moons_split %>%
163+
testing()
164+
train_index <- as.integer(rownames(make_moons_training))
165+
test_index <- as.integer(rownames(make_moons_test))
166+
167+
mesh_size = .02
168+
margin = 0.25
169+
x_min = min(make_moons$X1) - margin
170+
x_max = max(make_moons$X1) + margin
171+
y_min = min(make_moons$X2) - margin
172+
y_max = max(make_moons$X2) + margin
173+
xrange <- seq(x_min, x_max, mesh_size)
174+
yrange <- seq(y_min, y_max, mesh_size)
175+
xy <- meshgrid(x = xrange, y = yrange)
176+
xx <- xy$X
177+
yy <- xy$Y
178+
179+
make_moons_classification$y <- as.factor(make_moons_classification$y)
180+
181+
knn_dist <- nearest_neighbor(neighbors = 15, weight_func = 'rectangular') %>%
182+
set_engine('kknn') %>%
183+
set_mode('classification') %>%
184+
fit(y~., data = make_moons_classification)
185+
make_moons[train_index,'split'] = 'Train Split Label'
186+
make_moons[test_index,'split'] = 'Test Split Label'
187+
make_moons$y <- paste(make_moons$split,make_moons$y)
188+
189+
dim_val <- dim(xx)
190+
xx1 <- matrix(xx, length(xx), 1)
191+
yy1 <- matrix(yy, length(yy), 1)
192+
final <- data.frame(xx1, yy1)
193+
colnames(final) <- c('X1','X2')
194+
pred <- knn_dist %>%
195+
predict(final, type = 'prob')
196+
predicted <- pred$.pred_1
197+
Z <- matrix(predicted, dim_val[1], dim_val[2])
198+
199+
200+
fig <- plot_ly(symbols = c('square','circle','square-dot','circle-dot'))%>%
201+
add_trace(x = xrange, y= yrange, z = Z, colorscale='RdBu', type = "contour", opacity = 0.5) %>%
202+
add_trace(data = make_moons, x = ~X1, y = ~X2, type = 'scatter', mode = 'markers', symbol = ~y ,
203+
marker = list(size = 12,
204+
color = 'lightyellow',
205+
line = list(color = 'black',width = 1)))
206+
fig
207+
```
208+
209+
## Multi-class prediction confidence with [`Heatmap`](https://plotly.com/r/heatmaps/)
210+
211+
It is also possible to visualize the prediction confidence of the model using [heatmaps](https://plotly.com/r/heatmaps/). In this example, you can see how to compute how confident the model is about its prediction at every point in the 2D grid. Here, we define the confidence as the difference between the highest score and the score of the other classes summed, at a certain point.
212+
213+
```{r}
214+
library(pracma)
215+
library(plotly)
216+
library(tidyverse)
217+
library(tidymodels)
218+
library(plyr)
219+
220+
data(iris) # We will use the iris data, which is included in R by default
221+
222+
mesh_size = .02
223+
margin = 1
224+
225+
db_split <- initial_split(iris, prop = 3/4)
226+
train_data <- training(db_split)
227+
test_data <- testing(db_split)
228+
229+
# Create a mesh grid on which we will run our model
230+
l_min = min(iris$Sepal.Length) - margin
231+
l_max = max(iris$Sepal.Length) + margin
232+
w_min = min(iris$Sepal.Width) - margin
233+
w_max = max(iris$Sepal.Width) + margin
234+
lrange = seq(l_min, l_max, mesh_size)
235+
wrange = seq(w_min, w_max, mesh_size)
236+
237+
mg = meshgrid(lrange, wrange)
238+
ll = mg$X
239+
ww = mg$Y
240+
241+
# Create classifier, run predictions on grid
242+
model = nearest_neighbor( neighbors = 15, weight_func = 'inv' ) %>%
243+
set_engine("kknn") %>%
244+
set_mode("classification") %>%
245+
fit(Species ~ Sepal.Length + Sepal.Width, data = train_data)
246+
247+
ll1 <- matrix(ll, length(ll), 1)
248+
ww1 <- matrix(ww, length(ww), 1)
249+
final <- data.frame(ll1, ww1)
250+
251+
colnames(final) = c("Sepal.Length", "Sepal.Width" )
252+
253+
pred <- model %>%
254+
predict(final, type = 'prob')
255+
256+
dim_val <- dim(ll)
257+
proba_setosa <- matrix(pred$.pred_setosa, dim_val[1], dim_val[2])
258+
proba_versicolor <- matrix(pred$.pred_versicolor, dim_val[1], dim_val[2])
259+
proba_virginica <- matrix(pred$.pred_virginica, dim_val[1], dim_val[2])
260+
261+
# Compute the classifier confidence
262+
Z <- array(c(proba_setosa, proba_versicolor, proba_virginica), dim = c(dim_val[1],dim_val[2],3))
263+
diff = aaply(Z, c(1,2), max) - (aaply(Z, c(1,2), sum) - aaply(Z,c(1,2), max))
264+
265+
# Overlay the heatmap of the confidence on the scatter plot of the examples
266+
fig <- plot_ly()
267+
fig <- fig %>% add_trace(data=test_data, x = ~Sepal.Length, y = ~Sepal.Width, symbol = ~Species, split = ~Species, symbols = c('square-dot','circle-dot','diamond'),
268+
type = 'scatter', mode = 'markers',
269+
marker = list(size = 12, line = list(width = 1.5), color = 'lightyellow'))%>% layout(title="Prediction Confidence on Test Split")
270+
fig <- fig %>% add_trace(x = lrange, y = wrange, z = diff, type = 'heatmap')
271+
272+
fig
273+
```
274+
275+
## Reference
276+
277+
Learn more about `Contour plots`, and `Heatmap` here:
278+
279+
* https://plot.ly/r/heatmaps/
280+
281+
* https://plot.ly/r/contour-plots/

r/data/make_moons.csv

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
X1,X2,y
2+
0.031594993633213456,0.9869877579082642,0
3+
2.115097840005998,-0.04624396793368002,1
4+
0.8824897202429526,-0.07575605515192824,0
5+
-0.05514410395493309,-0.03733245922211456,1
6+
0.8295450281029343,-0.5393214930366995,1
7+
2.112857076342866,0.6620835334836002,1
8+
0.5696926997602635,0.33744135752913684,0
9+
0.9521745361967547,-0.7530747091382108,1
10+
-0.029221145736673143,0.3928733563262766,1
11+
1.7157955733793964,0.3606945394211044,1
12+
-0.2788301416383404,0.6142053862138132,1
13+
2.00344508668566,-0.3091496039347519,1
14+
1.5000961594814137,-0.7463637596324144,1
15+
1.2254083492697045,1.1979301686740846,0
16+
0.4673081934172612,0.5470319224500304,0
17+
-0.36653222317639494,1.119716334184673,0
18+
0.3012445942883551,1.2331569713300374,0
19+
0.30430746270370695,0.8237393538028818,0
20+
-0.23712491563950083,0.778760335469834,0
21+
1.1633983178077336,0.5529023837343671,0
22+
-0.5970241655451736,0.6572360991187756,0
23+
0.4635673501248081,-0.195154298389205,1
24+
0.6845111142379667,0.7528768522591496,0
25+
0.7051569889057123,-0.4589244400760628,1
26+
-0.6580500787381698,-0.12944211219823631,0
27+
-0.746629463206858,-0.3829632037287942,0
28+
0.6058522563968107,0.31252841966998635,0
29+
2.1813716830490253,-0.022917469276894753,1
30+
1.919806334070562,0.17247328896340242,1
31+
1.4834364020743789,-0.10517023149846338,1
32+
0.47404722604781996,-0.14575067196362812,1
33+
1.0561460467688328,-1.0357020680498332,1
34+
1.865007322962233,-0.011078737347192516,1
35+
0.41947210215169756,-0.3247810115072367,1
36+
0.06873257777127617,0.5664846702876994,1
37+
-0.17332432476326556,1.0021513076857296,0
38+
0.12412420648128769,1.0006289185760988,0
39+
0.18121142278818408,0.24717743090960537,1
40+
-0.2545155930677909,0.1931727159301121,1
41+
1.025806681391205,-0.6257456623273743,1
42+
1.5200214310317168,-0.04515581275373087,0
43+
0.6417403675063451,-0.39369468370639576,1
44+
-0.8761558871630455,0.046566202513465105,0
45+
-1.069649973668144,0.13666024731616316,0
46+
1.0265876544828723,0.37782457690227467,0
47+
0.9313132543149979,1.385178411971549,0
48+
0.6768054363547378,1.5718993127141032,0
49+
-0.36885733097325313,0.7288660141702618,0
50+
-1.0246549549067154,0.16190214636642664,0
51+
0.7745538478844521,0.15375803209412348,0
52+
1.4045116008147969,-0.00705700513831449,1
53+
-0.38147174216725455,-0.23488746653490064,1
54+
0.7215522446289362,0.447216575087707,0
55+
-0.5134668595395712,0.6786909545326081,0
56+
0.3211854615031241,0.2866866685419984,0
57+
0.19536280375270595,0.1608510651493314,1
58+
0.5282419616404991,0.983009931544712,0
59+
-0.22165389689435605,0.2516013870512852,0
60+
0.22334676355617628,1.3221718264805824,0
61+
-0.10704571812268296,0.5617832625636306,1
62+
0.6365168516040098,0.7544482523438969,0
63+
-0.3722784784934399,0.9929131660820307,0
64+
0.20718083265981707,-0.0976714273989191,1
65+
0.1273314190904385,-0.37965489671189484,1
66+
0.7143523082580422,-0.799941612146002,1
67+
1.094878136347055,-0.36841844821907443,1
68+
-0.14814362283282412,0.9615865686187115,0
69+
1.586187998577997,-0.6298451656162283,1
70+
0.7444455063252204,0.5766137068632283,0
71+
2.1801102777300736,-0.699777507780252,1
72+
0.2457559354815964,0.8496383011786972,0
73+
-0.9500358136002063,0.9036169893360139,0
74+
-0.8823075833032377,0.07249044196648242,0
75+
2.0229707875067526,0.12325148232449507,1
76+
2.1457732101750753,0.4629636173144291,1
77+
0.3553646839272794,-0.6784798863681882,1
78+
0.34665025884741507,1.115706761193445,0
79+
1.7392373006108919,0.45900352420145013,1
80+
0.6385646696350307,-0.44718443416511544,1
81+
1.2876687011345485,-0.4910366009324705,1
82+
-0.1377260693484619,1.2453261987125588,0
83+
-0.5617530312161083,1.0548605100608779,0
84+
1.2900374790980524,-0.206914047504865,1
85+
-0.8753936535485004,0.5054342299267167,0
86+
-0.9285824919481817,-0.4563199052393644,0
87+
0.02493632078646507,0.1074795833868682,1
88+
0.19725589918348185,-0.06801667981997078,1
89+
0.7334605649545698,0.2816192880885117,0
90+
1.6829443382714295,-0.20204230317276864,1
91+
0.5076412403964466,-0.11731979334752535,1
92+
1.6676021747551528,-0.42485665227173697,1
93+
-0.8217228210436642,0.6314106629162008,0
94+
0.3017090266430521,0.7860353394551198,0
95+
1.3767150470581093,-0.8091510741803736,1
96+
1.17037550831432,0.5984065269053864,0
97+
1.6994530943264532,0.5877196698165861,1
98+
0.2186232335250059,-0.6525211867631453,1
99+
0.9529142847496186,-0.4197656395767486,1
100+
-1.318500344394381,0.4231123496153354,0
101+
-1.2968176444425201,0.18414709023079687,0

0 commit comments

Comments
 (0)
0