kNN on Wisconsin Breast Cancer Data
Brandon Fitzgerald
April 8, 2017
The purpose of this presentation is to illustrate how the basic usage of k-Nearest Neighbors must be expanded upon to get a clearer picture of how the algorithm performs and how there are many options for data-preparation along the way. We will be focusing on two primary decisions that must be made in the process:
- The choice of normalization function
- The choice of \(k\), the number of nearest neighbors to use
I will be assuming that you understand the basics of the kNN algorithm: That each observation in the testing set is to be classified by identifying the \(k\) nearest observations in the training set and labeling the test observation as the majority label in these “nearest neighbors.”
The Wisconsin Diagnostic Breast Cancer Dataset
We will be using data collected by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian from the University of Wisconsin in 1992. Characteristics of the cell nuclei from tumors such as radius and texture were measured. These cells were known to be either benign or malignant.
The data was downloaded from the UCI Machine Learning Repository at
http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
The Basics
First, let’s set things up in R by loading the necessary package and importing the data into R. the class
package will be used to run the k-nearest neighbors algorithm. We will also use a specific seed so that you can reproduce this in R yourself. Importing the data, let’s take a look at the basic structure of the dataset.
library(class)
set.seed(4865)
data_files_path = "C:/**Your Data Files Path**"
wbcd = read.csv(paste(data_files_path, "/wisc_bc_data.csv",
sep=""))
str(wbcd)
## 'data.frame': 569 obs. of 32 variables:
## $ id : int 87139402 8910251 905520 868871 9012568 906539 925291 87880 862989 89827 ...
## $ diagnosis : Factor w/ 2 levels "B","M": 1 1 1 1 1 1 1 2 1 1 ...
## $ radius_mean : num 12.3 10.6 11 11.3 15.2 ...
## $ texture_mean : num 12.4 18.9 16.8 13.4 13.2 ...
## $ perimeter_mean : num 78.8 69.3 70.9 73 97.7 ...
## $ area_mean : num 464 346 373 385 712 ...
## $ smoothness_mean : num 0.1028 0.0969 0.1077 0.1164 0.0796 ...
## $ compactness_mean : num 0.0698 0.1147 0.078 0.1136 0.0693 ...
## $ concavity_mean : num 0.0399 0.0639 0.0305 0.0464 0.0339 ...
## $ points_mean : num 0.037 0.0264 0.0248 0.048 0.0266 ...
## $ symmetry_mean : num 0.196 0.192 0.171 0.177 0.172 ...
## $ dimension_mean : num 0.0595 0.0649 0.0634 0.0607 0.0554 ...
## $ radius_se : num 0.236 0.451 0.197 0.338 0.178 ...
## $ texture_se : num 0.666 1.197 1.387 1.343 0.412 ...
## $ perimeter_se : num 1.67 3.43 1.34 1.85 1.34 ...
## $ area_se : num 17.4 27.1 13.5 26.3 17.7 ...
## $ smoothness_se : num 0.00805 0.00747 0.00516 0.01127 0.00501 ...
## $ compactness_se : num 0.0118 0.03581 0.00936 0.03498 0.01485 ...
## $ concavity_se : num 0.0168 0.0335 0.0106 0.0219 0.0155 ...
## $ points_se : num 0.01241 0.01365 0.00748 0.01965 0.00915 ...
## $ symmetry_se : num 0.0192 0.035 0.0172 0.0158 0.0165 ...
## $ dimension_se : num 0.00225 0.00332 0.0022 0.00344 0.00177 ...
## $ radius_worst : num 13.5 11.9 12.4 11.9 16.2 ...
## $ texture_worst : num 15.6 22.9 26.4 15.8 15.7 ...
## $ perimeter_worst : num 87 78.3 79.9 76.5 104.5 ...
## $ area_worst : num 549 425 471 434 819 ...
## $ smoothness_worst : num 0.139 0.121 0.137 0.137 0.113 ...
## $ compactness_worst: num 0.127 0.252 0.148 0.182 0.174 ...
## $ concavity_worst : num 0.1242 0.1916 0.1067 0.0867 0.1362 ...
## $ points_worst : num 0.0939 0.0793 0.0743 0.0861 0.0818 ...
## $ symmetry_worst : num 0.283 0.294 0.3 0.21 0.249 ...
## $ dimension_worst : num 0.0677 0.0759 0.0788 0.0678 0.0677 ...
The first variable, id
, is there simply as a unique identifier for each observation. We will take it out for the purposes of our analysis. Now, let’s also get an idea as to what we are going to be classifying.
wbcd = wbcd[-1]
table(wbcd$diagnosis)
##
## B M
## 357 212
round(prop.table(table(wbcd$diagnosis))*100, digits=1)
##
## B M
## 62.7 37.3
There are 357 cells that were identified as benign and 212 that were identified as malignant. As percentage, 62.7% were benign and 37.3% were malignant. In the context of the kNN algorithm that we will be running, this means that the maximum error rate that will occur is, on average, going to be 37.3%. If each test observations is classified as the majority label for all training points, they will all be classified as benign, with an error rate of how ever many malignant cells there are in the testing sample. This will be, on average, 37.3% if our test set is chosen randomly.
So, let’s inspect the data in more detail now. Consider just a few features.
summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])
## radius_mean area_mean smoothness_mean
## Min. : 6.981 Min. : 143.5 Min. :0.05263
## 1st Qu.:11.700 1st Qu.: 420.3 1st Qu.:0.08637
## Median :13.370 Median : 551.1 Median :0.09587
## Mean :14.127 Mean : 654.9 Mean :0.09636
## 3rd Qu.:15.780 3rd Qu.: 782.7 3rd Qu.:0.10530
## Max. :28.110 Max. :2501.0 Max. :0.16340
The mean area is on a much large scale than the mean radius, which itself is on a much larger scale than the mean smoothness. As such, the mean area will contribute much more to the distances between observations and the mean smoothness will not contribute much at all. So, normalization or scaling of the features is required.
Normalization Techniques
This process of normalization is often described as putting the features on the same scale. We will start out with the two most common normalizations.
Feature Normalization: \[\frac{X_i - min(X)}{max(X) - min(X)}\]
Z-Score Normalization: \[\frac{X_i - \bar{X}}{s}\]
Feature normalization is named as such because of it’s popularity in this context. It should also be noted that z-score normalization is technically the Student’s t-statistic because we’re using the sample mean and standard deviation.
While “putting the feature on the same scale” is pretty intuitive terminology, it can be helpful to think of the process of normalization as, in a sense, putting all the feature on no scale. Both of the formulas above result in a unitless number, because the units in the numerator “cancel out” with the units in the denominator. This perspective will come up again later.
With all of that, let’s now get to it. Starting with feature normalization, we can define it as a function and apply it to all features, separating them off into their own dataset to be used in the kNN algorithm. Note that this definitely does put everything on the same scale. All values are between 0 and 1.
fscale = function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
wbcd_features_fscaled = as.data.frame(lapply(wbcd[-1],fscale))
summary(wbcd_features_fscaled[c("radius_mean",
"area_mean",
"smoothness_mean")])
## radius_mean area_mean smoothness_mean
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2233 1st Qu.:0.1174 1st Qu.:0.3046
## Median :0.3024 Median :0.1729 Median :0.3904
## Mean :0.3382 Mean :0.2169 Mean :0.3948
## 3rd Qu.:0.4164 3rd Qu.:0.2711 3rd Qu.:0.4755
## Max. :1.0000 Max. :1.0000 Max. :1.0000
Now, let’s separate the observations into test sets and training sets (we’ll put off validation sets for the purposes of this presentation). We’ll use 100 observation as our test set and the rest, 469 observations as our training set. Because the observations have already been randomized, we can simply choose the first 469 observations as our training set for coding convenience.
Note the naming convention that is used in order to be able to clearly understand what each variable represents. Features is used for the independent variables that will be used to predict the dependent variable, the label. I will also be labeling any variables that depend on the normalization function used as seen below with _f
.
wbcd_features_train_f = wbcd_features_fscaled[1:469, ]
wbcd_features_test_f = wbcd_features_fscaled[470:569, ]
wbcd_label_train = wbcd[1:469, 1]
wbcd_label_test = wbcd[470:569, 1]
Now, using the knn()
function from the class
package, we can store the predicted labels in a variable. To compare this with the true test observation labels, we will use the table()
function from Base R.
We could use any number of nearest neighbors from \(k=1\) to \(k=469\), we chose \(k=42\) for absolutely no practical reason.
wbcd_label_pred_f = knn(train = wbcd_features_train_f,
cl = wbcd_label_train,
test = wbcd_features_test_f,
k = 42)
wbcd_table_f = table(wbcd_label_test,
wbcd_label_pred_f)
wbcd_table_f
## wbcd_label_pred_f
## wbcd_label_test B M
## B 61 0
## M 4 35
wbcd_percent_error_f = 1-sum(diag(prop.table(wbcd_table_f)))
wbcd_percent_error_f
## [1] 0.04
From the above table we can see that 4 malignant tumors were misclassified as benign and no benign tumors were classified as malignant. Note that we can also extract the diagonal as a proportion of the total number of observations to get a total error rate programmatically. This will be quite useful later.
Now, we will repeat what we just did, but using z-score normalization (aka, the Student’s t statistic). R has a built in z-score normalization that we will be using, the scale()
function.
wbcd_features_zscaled = as.data.frame(scale(wbcd[-1]))
summary(wbcd_features_zscaled[c("radius_mean",
"area_mean",
"smoothness_mean")])
## radius_mean area_mean smoothness_mean
## Min. :-2.0279 Min. :-1.4532 Min. :-3.10935
## 1st Qu.:-0.6888 1st Qu.:-0.6666 1st Qu.:-0.71034
## Median :-0.2149 Median :-0.2949 Median :-0.03486
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.4690 3rd Qu.: 0.3632 3rd Qu.: 0.63564
## Max. : 3.9678 Max. : 5.2459 Max. : 4.76672
Again, we now separate into training and test sets, run knn()
, and compare with the test set labels using table()
. Note the effect of the normalization. The variables are centered at 0, with each value representing the number of standard deviations away from the center. Very different than feature normalization
wbcd_features_train_z = wbcd_features_zscaled[1:469, ]
wbcd_features_test_z = wbcd_features_zscaled[470:569, ]
wbcd_label_pred_z = knn(train = wbcd_features_train_z,
cl = wbcd_label_train,
test = wbcd_features_test_z,
k = 42)
wbcd_table_z = table(wbcd_label_test,
wbcd_label_pred_z)
wbcd_table_z
## wbcd_label_pred_z
## wbcd_label_test B M
## B 61 0
## M 4 35
wbcd_percent_error_z = 1-sum(diag(prop.table(wbcd_table_z)))
wbcd_percent_error_z
## [1] 0.04
We have an error rate of 4%, the same as the previous error rate when using feature normalization. While interesting, this does not leave us with any real answers as to how we might confidently predict malignancy in future cells. Does it not matter what normalization we use? Would a particular one be better if a different value of \(k\) were used?
Now The Fun Starts
We can start answering these questions by testing out all possible levels of k. Wrapping the above code inside a for
loop, we will store the error rates using feature normalization and z-score normalization so that we can compare them on broader scale. We will plot them together and use a lowess curve to smooth out each of the scatter plots so we can compare them more easily.
#Testing out all possible levels of k for feature normalization
min.k = 1
max.k = length(wbcd_label_train)
k = min.k:max.k
wbcd_error_f = numeric(max.k-min.k+1)
for(i in k) {
wbcd_label_pred_f = knn(train = wbcd_features_train_f,
cl = wbcd_label_train,
test = wbcd_features_test_f,
k = i)
wbcd_table_f = table(wbcd_label_test,
wbcd_label_pred_f)
wbcd_error_f[i-min.k+1] = 1-sum(diag(prop.table(wbcd_table_f)))
}
#Testing out all possible levels of k for z-score normalization
wbcd_error_z = numeric(max.k-min.k+1)
for(i in k) {
wbcd_label_pred_z = knn(train = wbcd_features_train_z,
cl = wbcd_label_train,
test = wbcd_features_test_z,
k = i)
wbcd_table_z = table(wbcd_label_test,
wbcd_label_pred_z)
wbcd_error_z[i-min.k+1] = 1-sum(diag(prop.table(wbcd_table_z)))
}
#Plotting both error rates as functions of k
plot(k, wbcd_error_z,
type="p",
col="red",
main="False Positive + False Negative Rate",
ylab="Total Error Rate")
points(k, wbcd_error_f,
type="p",
col="blue")
lines(lowess(k, wbcd_error_z, f=0.6),
col="red")
lines(lowess(k, wbcd_error_f, f=0.6),
col="blue")
legend("topleft",
legend=c("Feature Normalization", "Z-score Normalization"),
fill=c("blue", "red"))
As expected, the error rate go up as \(k\) gets larger. They approach and reach the approximately 37.3% that was stated early. \(k\) is so large that knn()
is classifying everything as benign, making all of the approximately 37.3% of the testing set observations that are malignant misclassified.
On the whole, it seems that feature normalization performs better than z-score normalization. We have a couple paths we can take now. The standard thing to do now is to focus on small values of k and see if we can optimize that…but I want to do something a bit more unconventional and interesting.
More Normalization Functions
The whole point of normalization is that it puts values into comparable unitless numbers. There are many ways to do this, albeit not alwaysstandard ways. For example, we could look at each value of a feature as a perentage of the maximum value of the feature:
Percent-of-Max Normalization: \[\frac{X_i}{max(X)}\]
maxscale = function(x) {
return(x/max(x))
}
wbcd_features_maxscaled = as.data.frame(lapply(wbcd[-1],maxscale))
summary(wbcd_features_maxscaled[c("radius_mean",
"area_mean",
"smoothness_mean")])
## radius_mean area_mean smoothness_mean
## Min. :0.2483 Min. :0.05738 Min. :0.3221
## 1st Qu.:0.4162 1st Qu.:0.16805 1st Qu.:0.5286
## Median :0.4756 Median :0.22035 Median :0.5867
## Mean :0.5026 Mean :0.26185 Mean :0.5897
## 3rd Qu.:0.5614 3rd Qu.:0.31295 3rd Qu.:0.6444
## Max. :1.0000 Max. :1.00000 Max. :1.0000
Now, using this normalization function on the test and training features, we can run k-nearest neighbors again and calculate a vector of total error rates for each possible value of \(k\).
wbcd_features_train_max = wbcd_features_maxscaled[1:469, ]
wbcd_features_test_max = wbcd_features_maxscaled[470:569, ]
wbcd_error_max = numeric(max.k-min.k+1)
for(i in k) {
wbcd_label_pred_max = knn(train = wbcd_features_train_max,
cl = wbcd_label_train,
test = wbcd_features_test_max,
k = i)
wbcd_table_max = table(wbcd_label_test,
wbcd_label_pred_max)
wbcd_error_max[i-min.k+1] = 1-sum(diag(prop.table(wbcd_table_max)))
}
We could also try scaling each value as a percentage of the mean.
Percent-of-Mean Normalization: \[\frac{X_i}{\bar{X}}\]
muscale = function(x) {
return(x/mean(x))
}
wbcd_features_muscaled = as.data.frame(lapply(wbcd[-1],muscale))
summary(wbcd_features_muscaled[c("radius_mean",
"area_mean",
"smoothness_mean")])
## radius_mean area_mean smoothness_mean
## Min. :0.4941 Min. :0.2191 Min. :0.5462
## 1st Qu.:0.8282 1st Qu.:0.6418 1st Qu.:0.8963
## Median :0.9464 Median :0.8415 Median :0.9949
## Mean :1.0000 Mean :1.0000 Mean :1.0000
## 3rd Qu.:1.1170 3rd Qu.:1.1952 3rd Qu.:1.0928
## Max. :1.9898 Max. :3.8190 Max. :1.6957
And again, using this normalization function on the test and training features, we can run k-nearest neighbors and calculate a vector of total error rates for each possible value of \(k\).
wbcd_features_train_mu = wbcd_features_muscaled[1:469, ]
wbcd_features_test_mu = wbcd_features_muscaled[470:569, ]
wbcd_error_mu = numeric(max.k-min.k+1)
for(i in k) {
wbcd_label_pred_mu = knn(train = wbcd_features_train_mu,
cl = wbcd_label_train,
test = wbcd_features_test_mu,
k = i)
wbcd_table_mu = table(wbcd_label_test,
wbcd_label_pred_mu)
wbcd_error_mu[i-min.k+1] = 1-sum(diag(prop.table(wbcd_table_mu)))
}
Down the Rabbit Hole
Before we finally visualize this by comparing the error rate graphically to the performance of feature and z-score normalization, let’s take this to an extreme. To illustrate the primary purpose of a normalization function, that it makes data unitless, consider the following normalization function.
Wacky Normalization: \[\frac{s^2 X_i}{max(X) \left( min(X) + s \right) \sqrt{\sum\limits_{i=1}^{n} {X_i}^2}}\]
It looks pretty weird, but if you follow through on the units of each part of the expression, everything does cancel, giving a unitless number in the end. It is a perfectly valid normalization function.
wackyscale = function(x) {
return((x*var(x)) / (sqrt(sum(x^2))*max(x)*(min(x)+sd(x))))
}
wbcd_features_wackyscaled = as.data.frame(lapply(wbcd[-1],wackyscale))
summary(wbcd_features_wackyscaled[c("radius_mean",
"area_mean",
"smoothness_mean")])
## radius_mean area_mean smoothness_mean
## Min. :0.0008454 Min. :0.0008089 Min. :0.0004112
## 1st Qu.:0.0014168 1st Qu.:0.0023693 1st Qu.:0.0006749
## Median :0.0016190 Median :0.0031067 Median :0.0007491
## Mean :0.0017107 Mean :0.0036918 Mean :0.0007529
## 3rd Qu.:0.0019109 3rd Qu.:0.0044123 3rd Qu.:0.0008228
## Max. :0.0034040 Max. :0.0140987 Max. :0.0012768
For the last time, using this normalization function on the test and training features, we can run k-nearest neighbors and calculate a vector of total error rates for each possible value of \(k\).
wbcd_features_train_wacky = wbcd_features_wackyscaled[1:469, ]
wbcd_features_test_wacky = wbcd_features_wackyscaled[470:569, ]
wbcd_error_wacky = numeric(max.k-min.k+1)
for(i in k) {
wbcd_label_pred_wacky = knn(train = wbcd_features_train_wacky,
cl = wbcd_label_train,
test = wbcd_features_test_wacky,
k = i)
wbcd_table_wacky = table(wbcd_label_test,
wbcd_label_pred_wacky)
wbcd_error_wacky[i-min.k+1] = 1-sum(diag(prop.table(wbcd_table_wacky)))
}
Finally, plotting all error rate so that we can compare them all, we get the following:
plot(k, wbcd_error_z,
type="n",
col="red",
main="False Positive + False Negative Rate",
ylab="Total Error Rate")
lines(lowess(k, wbcd_error_z, f=0.6),
col="red")
lines(lowess(k, wbcd_error_f, f=0.6),
col="blue")
lines(lowess(k, wbcd_error_max, f=0.6),
col="green")
lines(lowess(k, wbcd_error_mu, f=0.6),
col="orange")
lines(lowess(k, wbcd_error_wacky, f=0.6),
col="black")
legend("topleft",
legend=c("Feature Normalization",
"Z-score Normalization",
"Max Normalization",
"Mean Normalization",
"Wacky Normalization"),
fill=c("blue", "red", "green", "orange", "black"))
Every normalization function, even wackyscale()
, behaves as expected, with increasing error as \(k\) gets larger. However, note that there is subtlety in their performance. For the small values of \(k\), where the total error rate is small for all functions, it looks like feature, z-score, and percentage-of-max normalizations perform the best.
Zooming in on the Solution
Let’s focus on small values of \(k\) to see if there are any subtle changes there that may highlight an optimum value of \(k\) to use for this data.
small_k = 1:50
all_values = c(wbcd_error_z[small_k],
wbcd_error_f[small_k],
wbcd_error_max[small_k])
plot(small_k, wbcd_error_z[small_k],
type="p",
col="red",
main="False Positive + False Negative Rate",
ylab="Total Error Rate",
xlab="k",
ylim=c(min(all_values), max(all_values)))
points(small_k, wbcd_error_f[small_k],
type="p",
col="blue")
points(small_k, wbcd_error_max[small_k],
type="p",
col="green")
lines(lowess(small_k, wbcd_error_z[small_k], f=0.6),
col="red")
lines(lowess(small_k, wbcd_error_f[small_k], f=0.6),
col="blue")
lines(lowess(small_k, wbcd_error_max[small_k], f=0.6),
col="green")
legend("topleft",
legend=c("Feature Normalization",
"Z-score Normalization",
"Max Normalization"),
fill=c("blue", "red", "green"))
The Final Process
While from our previous broader perspective, it seemed like feature normalization had the best performance, it looks like percentage-of-max normalization performs better at lower levels of \(k\), minimizing the over all error rate. However, there is something we haven’t yet taken into account. This entire time we have been using the exact same testing and training sets for every normalization and every value of \(k\). To truly get an idea as to how kNN will perform with the different normalization functions, we should be randomizing the testing and training set for every single run of the algorithm. So, let us improve our code once again by using the sample()
function in Base R.
#Re-randomizing the sample each time we run knn(), inside the for loop.
test_size = 100
#Testing out various levels of k for feature normalization
wbcd_error_f = numeric(max.k-min.k+1)
for(i in k) {
test_sample = sample(1:(max.k-min.k+1), size=test_size)
wbcd_features_train_f = wbcd_features_fscaled[-test_sample, ]
wbcd_features_test_f = wbcd_features_fscaled[test_sample, ]
wbcd_label_train = wbcd[-test_sample, 1]
wbcd_label_test = wbcd[test_sample, 1]
wbcd_label_pred_f = knn(train = wbcd_features_train_f,
cl = wbcd_label_train,
test = wbcd_features_test_f,
k = i)
wbcd_table_f = table(wbcd_label_test,
wbcd_label_pred_f)
wbcd_error_f[i-min.k+1] = 1-sum(diag(prop.table(wbcd_table_f)))
}
#Testing out various levels of k for z-score normalization
wbcd_error_z = numeric(max.k-min.k+1)
for(i in k) {
test_sample = sample(1:(max.k-min.k+1), size=test_size)
wbcd_features_train_z = wbcd_features_zscaled[-test_sample, ]
wbcd_features_test_z = wbcd_features_zscaled[test_sample, ]
wbcd_label_train = wbcd[-test_sample, 1]
wbcd_label_test = wbcd[test_sample, 1]
wbcd_label_pred_z = knn(train = wbcd_features_train_z,
cl = wbcd_label_train,
test = wbcd_features_test_z,
k = i)
wbcd_table_z = table(wbcd_label_test,
wbcd_label_pred_z)
wbcd_error_z[i-min.k+1] = 1-sum(diag(prop.table(wbcd_table_z)))
}
#Testing out various levels of k for percentage-of-max normalization
wbcd_error_max = numeric(max.k-min.k+1)
for(i in k) {
test_sample = sample(1:(max.k-min.k+1), size=test_size)
wbcd_features_train_max = wbcd_features_maxscaled[-test_sample, ]
wbcd_features_test_max = wbcd_features_maxscaled[test_sample, ]
wbcd_label_train = wbcd[-test_sample, 1]
wbcd_label_test = wbcd[test_sample, 1]
wbcd_label_pred_max = knn(train = wbcd_features_train_max,
cl = wbcd_label_train,
test = wbcd_features_test_max,
k = i)
wbcd_table_max = table(wbcd_label_test,
wbcd_label_pred_max)
wbcd_error_max[i-min.k+1] = 1-sum(diag(prop.table(wbcd_table_max)))
}
#Plotting all error rates as functions of k
small_k = 1:50
all_values = c(wbcd_error_z[small_k],
wbcd_error_f[small_k],
wbcd_error_max[small_k])
plot(small_k, wbcd_error_z[small_k],
type="p",
col="red",
main="False Positive + False Negative Rate",
ylab="Total Error Rate",
xlab="k",
ylim=c(min(all_values), max(all_values)))
points(small_k, wbcd_error_f[small_k],
type="p",
col="blue")
points(small_k, wbcd_error_max[small_k],
type="p",
col="green")
lines(lowess(small_k, wbcd_error_z[small_k], f=0.6),
col="red")
lines(lowess(small_k, wbcd_error_f[small_k], f=0.6),
col="blue")
lines(lowess(small_k, wbcd_error_max[small_k], f=0.6),
col="green")
legend("topleft",
legend=c("Feature Normalization",
"Z-score Normalization",
"Max Normalization"),
fill=c("blue", "red", "green"))
Note how randomization of the testing and training sets “spread” the points out a bit, and how the lowess curves no longer are as distinct from one another. For example, it no longer seems that z-score normalization performs that much worse than feature and percentage-of-max normalization. However, we do still have that percentage-of-max normalization performs somewhat better than the others when \(k<10\). Additionally, the optimum value of the tuning parameter is \(k=1%\).
From the analysis shown, a pretty good classifier for breast cancer tumors would be the 1-nearest neighbor algorithms using percentage-of-max normalization of the data.
No comments:
Post a Comment