Brandon Fitzgerald: K-Nearest Neighbors on Wisconsin Breast Cancer Data

kNN on Wisconsin Breast Cancer Data

The purpose of this presentation is to illustrate how the basic usage of k-Nearest Neighbors must be expanded upon to get a clearer picture of how the algorithm performs and how there are many options for data-preparation along the way. We will be focusing on two primary decisions that must be made in the process:

The choice of normalization function
The choice of \(k\), the number of nearest neighbors to use

I will be assuming that you understand the basics of the kNN algorithm: That each observation in the testing set is to be classified by identifying the \(k\) nearest observations in the training set and labeling the test observation as the majority label in these “nearest neighbors.”

The Wisconsin Diagnostic Breast Cancer Dataset

We will be using data collected by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian from the University of Wisconsin in 1992. Characteristics of the cell nuclei from tumors such as radius and texture were measured. These cells were known to be either benign or malignant.

The data was downloaded from the UCI Machine Learning Repository at

http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

The Basics

First, let’s set things up in R by loading the necessary package and importing the data into R. the class package will be used to run the k-nearest neighbors algorithm. We will also use a specific seed so that you can reproduce this in R yourself. Importing the data, let’s take a look at the basic structure of the dataset.

library(class)
set.seed(4865)
data_files_path = "C:/**Your Data Files Path**"
wbcd = read.csv(paste(data_files_path, "/wisc_bc_data.csv",
                sep=""))
str(wbcd)

## 'data.frame':    569 obs. of  32 variables:
##  $ id               : int  87139402 8910251 905520 868871 9012568 906539 925291 87880 862989 89827 ...
##  $ diagnosis        : Factor w/ 2 levels "B","M": 1 1 1 1 1 1 1 2 1 1 ...
##  $ radius_mean      : num  12.3 10.6 11 11.3 15.2 ...
##  $ texture_mean     : num  12.4 18.9 16.8 13.4 13.2 ...
##  $ perimeter_mean   : num  78.8 69.3 70.9 73 97.7 ...
##  $ area_mean        : num  464 346 373 385 712 ...
##  $ smoothness_mean  : num  0.1028 0.0969 0.1077 0.1164 0.0796 ...
##  $ compactness_mean : num  0.0698 0.1147 0.078 0.1136 0.0693 ...
##  $ concavity_mean   : num  0.0399 0.0639 0.0305 0.0464 0.0339 ...
##  $ points_mean      : num  0.037 0.0264 0.0248 0.048 0.0266 ...
##  $ symmetry_mean    : num  0.196 0.192 0.171 0.177 0.172 ...
##  $ dimension_mean   : num  0.0595 0.0649 0.0634 0.0607 0.0554 ...
##  $ radius_se        : num  0.236 0.451 0.197 0.338 0.178 ...
##  $ texture_se       : num  0.666 1.197 1.387 1.343 0.412 ...
##  $ perimeter_se     : num  1.67 3.43 1.34 1.85 1.34 ...
##  $ area_se          : num  17.4 27.1 13.5 26.3 17.7 ...
##  $ smoothness_se    : num  0.00805 0.00747 0.00516 0.01127 0.00501 ...
##  $ compactness_se   : num  0.0118 0.03581 0.00936 0.03498 0.01485 ...
##  $ concavity_se     : num  0.0168 0.0335 0.0106 0.0219 0.0155 ...
##  $ points_se        : num  0.01241 0.01365 0.00748 0.01965 0.00915 ...
##  $ symmetry_se      : num  0.0192 0.035 0.0172 0.0158 0.0165 ...
##  $ dimension_se     : num  0.00225 0.00332 0.0022 0.00344 0.00177 ...
##  $ radius_worst     : num  13.5 11.9 12.4 11.9 16.2 ...
##  $ texture_worst    : num  15.6 22.9 26.4 15.8 15.7 ...
##  $ perimeter_worst  : num  87 78.3 79.9 76.5 104.5 ...
##  $ area_worst       : num  549 425 471 434 819 ...
##  $ smoothness_worst : num  0.139 0.121 0.137 0.137 0.113 ...
##  $ compactness_worst: num  0.127 0.252 0.148 0.182 0.174 ...
##  $ concavity_worst  : num  0.1242 0.1916 0.1067 0.0867 0.1362 ...
##  $ points_worst     : num  0.0939 0.0793 0.0743 0.0861 0.0818 ...
##  $ symmetry_worst   : num  0.283 0.294 0.3 0.21 0.249 ...
##  $ dimension_worst  : num  0.0677 0.0759 0.0788 0.0678 0.0677 ...

The first variable, id, is there simply as a unique identifier for each observation. We will take it out for the purposes of our analysis. Now, let’s also get an idea as to what we are going to be classifying.

wbcd = wbcd[-1]
table(wbcd$diagnosis)

## 
##   B   M 
## 357 212

round(prop.table(table(wbcd$diagnosis))*100, digits=1)

## 
##    B    M 
## 62.7 37.3

There are 357 cells that were identified as benign and 212 that were identified as malignant. As percentage, 62.7% were benign and 37.3% were malignant. In the context of the kNN algorithm that we will be running, this means that the maximum error rate that will occur is, on average, going to be 37.3%. If each test observations is classified as the majority label for all training points, they will all be classified as benign, with an error rate of how ever many malignant cells there are in the testing sample. This will be, on average, 37.3% if our test set is chosen randomly.

So, let’s inspect the data in more detail now. Consider just a few features.

summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])

##   radius_mean       area_mean      smoothness_mean  
##  Min.   : 6.981   Min.   : 143.5   Min.   :0.05263  
##  1st Qu.:11.700   1st Qu.: 420.3   1st Qu.:0.08637  
##  Median :13.370   Median : 551.1   Median :0.09587  
##  Mean   :14.127   Mean   : 654.9   Mean   :0.09636  
##  3rd Qu.:15.780   3rd Qu.: 782.7   3rd Qu.:0.10530  
##  Max.   :28.110   Max.   :2501.0   Max.   :0.16340

The mean area is on a much large scale than the mean radius, which itself is on a much larger scale than the mean smoothness. As such, the mean area will contribute much more to the distances between observations and the mean smoothness will not contribute much at all. So, normalization or scaling of the features is required.

Normalization Techniques

This process of normalization is often described as putting the features on the same scale. We will start out with the two most common normalizations.

Feature Normalization: \[\frac{X_i - min(X)}{max(X) - min(X)}\]

Z-Score Normalization: \[\frac{X_i - \bar{X}}{s}\]

Feature normalization is named as such because of it’s popularity in this context. It should also be noted that z-score normalization is technically the Student’s t-statistic because we’re using the sample mean and standard deviation.

While “putting the feature on the same scale” is pretty intuitive terminology, it can be helpful to think of the process of normalization as, in a sense, putting all the feature on no scale. Both of the formulas above result in a unitless number, because the units in the numerator “cancel out” with the units in the denominator. This perspective will come up again later.

With all of that, let’s now get to it. Starting with feature normalization, we can define it as a function and apply it to all features, separating them off into their own dataset to be used in the kNN algorithm. Note that this definitely does put everything on the same scale. All values are between 0 and 1.

fscale = function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}
wbcd_features_fscaled = as.data.frame(lapply(wbcd[-1],fscale))
summary(wbcd_features_fscaled[c("radius_mean",
                                "area_mean",
                                "smoothness_mean")])

##   radius_mean       area_mean      smoothness_mean 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2233   1st Qu.:0.1174   1st Qu.:0.3046  
##  Median :0.3024   Median :0.1729   Median :0.3904  
##  Mean   :0.3382   Mean   :0.2169   Mean   :0.3948  
##  3rd Qu.:0.4164   3rd Qu.:0.2711   3rd Qu.:0.4755  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

Now, let’s separate the observations into test sets and training sets (we’ll put off validation sets for the purposes of this presentation). We’ll use 100 observation as our test set and the rest, 469 observations as our training set. Because the observations have already been randomized, we can simply choose the first 469 observations as our training set for coding convenience.

Note the naming convention that is used in order to be able to clearly understand what each variable represents. Features is used for the independent variables that will be used to predict the dependent variable, the label. I will also be labeling any variables that depend on the normalization function used as seen below with _f.

wbcd_features_train_f = wbcd_features_fscaled[1:469, ]
wbcd_features_test_f = wbcd_features_fscaled[470:569, ]
wbcd_label_train = wbcd[1:469, 1]
wbcd_label_test = wbcd[470:569, 1]

Now, using the knn() function from the class package, we can store the predicted labels in a variable. To compare this with the true test observation labels, we will use the table() function from Base R.

We could use any number of nearest neighbors from \(k=1\) to \(k=469\), we chose \(k=42\) for absolutely no practical reason.

wbcd_label_pred_f = knn(train = wbcd_features_train_f,
                        cl = wbcd_label_train,
                        test = wbcd_features_test_f,
                        k = 42)
wbcd_table_f = table(wbcd_label_test,
                     wbcd_label_pred_f)
wbcd_table_f

##                wbcd_label_pred_f
## wbcd_label_test  B  M
##               B 61  0
##               M  4 35

wbcd_percent_error_f = 1-sum(diag(prop.table(wbcd_table_f)))
wbcd_percent_error_f

## [1] 0.04

From the above table we can see that 4 malignant tumors were misclassified as benign and no benign tumors were classified as malignant. Note that we can also extract the diagonal as a proportion of the total number of observations to get a total error rate programmatically. This will be quite useful later.

Now, we will repeat what we just did, but using z-score normalization (aka, the Student’s t statistic). R has a built in z-score normalization that we will be using, the scale() function.

wbcd_features_zscaled = as.data.frame(scale(wbcd[-1]))
summary(wbcd_features_zscaled[c("radius_mean",
                                "area_mean",
                                "smoothness_mean")])

##   radius_mean        area_mean       smoothness_mean   
##  Min.   :-2.0279   Min.   :-1.4532   Min.   :-3.10935  
##  1st Qu.:-0.6888   1st Qu.:-0.6666   1st Qu.:-0.71034  
##  Median :-0.2149   Median :-0.2949   Median :-0.03486  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.4690   3rd Qu.: 0.3632   3rd Qu.: 0.63564  
##  Max.   : 3.9678   Max.   : 5.2459   Max.   : 4.76672

Again, we now separate into training and test sets, run knn(), and compare with the test set labels using table(). Note the effect of the normalization. The variables are centered at 0, with each value representing the number of standard deviations away from the center. Very different than feature normalization

wbcd_features_train_z = wbcd_features_zscaled[1:469, ]
wbcd_features_test_z = wbcd_features_zscaled[470:569, ]
wbcd_label_pred_z = knn(train = wbcd_features_train_z,
                        cl = wbcd_label_train,
                        test = wbcd_features_test_z,
                        k = 42)
wbcd_table_z = table(wbcd_label_test,
                     wbcd_label_pred_z)
wbcd_table_z

##                wbcd_label_pred_z
## wbcd_label_test  B  M
##               B 61  0
##               M  4 35

wbcd_percent_error_z = 1-sum(diag(prop.table(wbcd_table_z)))
wbcd_percent_error_z

## [1] 0.04

We have an error rate of 4%, the same as the previous error rate when using feature normalization. While interesting, this does not leave us with any real answers as to how we might confidently predict malignancy in future cells. Does it not matter what normalization we use? Would a particular one be better if a different value of \(k\) were used?

Now The Fun Starts

We can start answering these questions by testing out all possible levels of k. Wrapping the above code inside a for loop, we will store the error rates using feature normalization and z-score normalization so that we can compare them on broader scale. We will plot them together and use a lowess curve to smooth out each of the scatter plots so we can compare them more easily.

#Testing out all possible levels of k for feature normalization
min.k = 1
max.k = length(wbcd_label_train)
k = min.k:max.k
wbcd_error_f = numeric(max.k-min.k+1)
for(i in k) {
  wbcd_label_pred_f = knn(train = wbcd_features_train_f,
                          cl = wbcd_label_train,
                          test = wbcd_features_test_f,
                          k = i)
  wbcd_table_f = table(wbcd_label_test,
                       wbcd_label_pred_f)
  wbcd_error_f[i-min.k+1] = 1-sum(diag(prop.table(wbcd_table_f)))
}
#Testing out all possible levels of k for z-score normalization
wbcd_error_z = numeric(max.k-min.k+1)
for(i in k) {
  wbcd_label_pred_z = knn(train = wbcd_features_train_z,
                          cl = wbcd_label_train,
                          test = wbcd_features_test_z,
                          k = i)
  wbcd_table_z = table(wbcd_label_test,
                       wbcd_label_pred_z)
  wbcd_error_z[i-min.k+1] = 1-sum(diag(prop.table(wbcd_table_z)))
}
#Plotting both error rates as functions of k
plot(k, wbcd_error_z,
     type="p",
     col="red",
     main="False Positive + False Negative Rate",
     ylab="Total Error Rate")
points(k, wbcd_error_f,
       type="p",
       col="blue")
lines(lowess(k, wbcd_error_z, f=0.6),
      col="red")
lines(lowess(k, wbcd_error_f, f=0.6),
      col="blue")
legend("topleft",
       legend=c("Feature Normalization", "Z-score Normalization"),
       fill=c("blue", "red"))

As expected, the error rate go up as \(k\) gets larger. They approach and reach the approximately 37.3% that was stated early. \(k\) is so large that knn() is classifying everything as benign, making all of the approximately 37.3% of the testing set observations that are malignant misclassified.

On the whole, it seems that feature normalization performs better than z-score normalization. We have a couple paths we can take now. The standard thing to do now is to focus on small values of k and see if we can optimize that…but I want to do something a bit more unconventional and interesting.

More Normalization Functions

The whole point of normalization is that it puts values into comparable unitless numbers. There are many ways to do this, albeit not alwaysstandard ways. For example, we could look at each value of a feature as a perentage of the maximum value of the feature:

Percent-of-Max Normalization: \[\frac{X_i}{max(X)}\]

maxscale = function(x) {
  return(x/max(x))
}
wbcd_features_maxscaled = as.data.frame(lapply(wbcd[-1],maxscale))
summary(wbcd_features_maxscaled[c("radius_mean",
                                  "area_mean",
                                  "smoothness_mean")])

##   radius_mean       area_mean       smoothness_mean 
##  Min.   :0.2483   Min.   :0.05738   Min.   :0.3221  
##  1st Qu.:0.4162   1st Qu.:0.16805   1st Qu.:0.5286  
##  Median :0.4756   Median :0.22035   Median :0.5867  
##  Mean   :0.5026   Mean   :0.26185   Mean   :0.5897  
##  3rd Qu.:0.5614   3rd Qu.:0.31295   3rd Qu.:0.6444  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000

Now, using this normalization function on the test and training features, we can run k-nearest neighbors again and calculate a vector of total error rates for each possible value of \(k\).

wbcd_features_train_max = wbcd_features_maxscaled[1:469, ]
wbcd_features_test_max = wbcd_features_maxscaled[470:569, ]
wbcd_error_max = numeric(max.k-min.k+1)
for(i in k) {
  wbcd_label_pred_max = knn(train = wbcd_features_train_max,
                            cl = wbcd_label_train,
                            test = wbcd_features_test_max,
                            k = i)
  wbcd_table_max = table(wbcd_label_test,
                         wbcd_label_pred_max)
  wbcd_error_max[i-min.k+1] = 1-sum(diag(prop.table(wbcd_table_max)))
}

We could also try scaling each value as a percentage of the mean.

Percent-of-Mean Normalization: \[\frac{X_i}{\bar{X}}\]

muscale = function(x) {
  return(x/mean(x))
}
wbcd_features_muscaled = as.data.frame(lapply(wbcd[-1],muscale))
summary(wbcd_features_muscaled[c("radius_mean",
                                 "area_mean",
                                 "smoothness_mean")])

##   radius_mean       area_mean      smoothness_mean 
##  Min.   :0.4941   Min.   :0.2191   Min.   :0.5462  
##  1st Qu.:0.8282   1st Qu.:0.6418   1st Qu.:0.8963  
##  Median :0.9464   Median :0.8415   Median :0.9949  
##  Mean   :1.0000   Mean   :1.0000   Mean   :1.0000  
##  3rd Qu.:1.1170   3rd Qu.:1.1952   3rd Qu.:1.0928  
##  Max.   :1.9898   Max.   :3.8190   Max.   :1.6957

And again, using this normalization function on the test and training features, we can run k-nearest neighbors and calculate a vector of total error rates for each possible value of \(k\).

wbcd_features_train_mu = wbcd_features_muscaled[1:469, ]
wbcd_features_test_mu = wbcd_features_muscaled[470:569, ]
wbcd_error_mu = numeric(max.k-min.k+1)
for(i in k) {
  wbcd_label_pred_mu = knn(train = wbcd_features_train_mu,
                           cl = wbcd_label_train,
                           test = wbcd_features_test_mu,
                           k = i)
  wbcd_table_mu = table(wbcd_label_test,
                        wbcd_label_pred_mu)
  wbcd_error_mu[i-min.k+1] = 1-sum(diag(prop.table(wbcd_table_mu)))
}

Down the Rabbit Hole

Before we finally visualize this by comparing the error rate graphically to the performance of feature and z-score normalization, let’s take this to an extreme. To illustrate the primary purpose of a normalization function, that it makes data unitless, consider the following normalization function.

Wacky Normalization: \[\frac{s^2 X_i}{max(X) \left( min(X) + s \right) \sqrt{\sum\limits_{i=1}^{n} {X_i}^2}}\]

It looks pretty weird, but if you follow through on the units of each part of the expression, everything does cancel, giving a unitless number in the end. It is a perfectly valid normalization function.

wackyscale = function(x) {
  return((x*var(x)) / (sqrt(sum(x^2))*max(x)*(min(x)+sd(x))))
}
wbcd_features_wackyscaled = as.data.frame(lapply(wbcd[-1],wackyscale))
summary(wbcd_features_wackyscaled[c("radius_mean",
                                    "area_mean",
                                    "smoothness_mean")])

##   radius_mean          area_mean         smoothness_mean    
##  Min.   :0.0008454   Min.   :0.0008089   Min.   :0.0004112  
##  1st Qu.:0.0014168   1st Qu.:0.0023693   1st Qu.:0.0006749  
##  Median :0.0016190   Median :0.0031067   Median :0.0007491  
##  Mean   :0.0017107   Mean   :0.0036918   Mean   :0.0007529  
##  3rd Qu.:0.0019109   3rd Qu.:0.0044123   3rd Qu.:0.0008228  
##  Max.   :0.0034040   Max.   :0.0140987   Max.   :0.0012768

For the last time, using this normalization function on the test and training features, we can run k-nearest neighbors and calculate a vector of total error rates for each possible value of \(k\).

wbcd_features_train_wacky = wbcd_features_wackyscaled[1:469, ]
wbcd_features_test_wacky = wbcd_features_wackyscaled[470:569, ]
wbcd_error_wacky = numeric(max.k-min.k+1)
for(i in k) {
  wbcd_label_pred_wacky = knn(train = wbcd_features_train_wacky,
                              cl = wbcd_label_train,
                              test = wbcd_features_test_wacky,
                              k = i)
  wbcd_table_wacky = table(wbcd_label_test,
                           wbcd_label_pred_wacky)
  wbcd_error_wacky[i-min.k+1] = 1-sum(diag(prop.table(wbcd_table_wacky)))
}

Finally, plotting all error rate so that we can compare them all, we get the following:

plot(k, wbcd_error_z,
     type="n",
     col="red",
     main="False Positive + False Negative Rate",
     ylab="Total Error Rate")
lines(lowess(k, wbcd_error_z, f=0.6),
      col="red")
lines(lowess(k, wbcd_error_f, f=0.6),
      col="blue")
lines(lowess(k, wbcd_error_max, f=0.6),
      col="green")
lines(lowess(k, wbcd_error_mu, f=0.6),
      col="orange")
lines(lowess(k, wbcd_error_wacky, f=0.6),
      col="black")
legend("topleft",
       legend=c("Feature Normalization",
                "Z-score Normalization",
                "Max Normalization",
                "Mean Normalization",
                "Wacky Normalization"),
       fill=c("blue", "red", "green", "orange", "black"))

Every normalization function, even wackyscale(), behaves as expected, with increasing error as \(k\) gets larger. However, note that there is subtlety in their performance. For the small values of \(k\), where the total error rate is small for all functions, it looks like feature, z-score, and percentage-of-max normalizations perform the best.

Zooming in on the Solution

Let’s focus on small values of \(k\) to see if there are any subtle changes there that may highlight an optimum value of \(k\) to use for this data.

small_k = 1:50
all_values = c(wbcd_error_z[small_k],
               wbcd_error_f[small_k],
               wbcd_error_max[small_k])
plot(small_k, wbcd_error_z[small_k],
     type="p",
     col="red",
     main="False Positive + False Negative Rate",
     ylab="Total Error Rate",
     xlab="k",
     ylim=c(min(all_values), max(all_values)))
points(small_k, wbcd_error_f[small_k],
       type="p",
       col="blue")
points(small_k, wbcd_error_max[small_k],
       type="p",
       col="green")
lines(lowess(small_k, wbcd_error_z[small_k], f=0.6),
      col="red")
lines(lowess(small_k, wbcd_error_f[small_k], f=0.6),
      col="blue")
lines(lowess(small_k, wbcd_error_max[small_k], f=0.6),
      col="green")
legend("topleft",
       legend=c("Feature Normalization",
                "Z-score Normalization",
                "Max Normalization"),
       fill=c("blue", "red", "green"))

The Final Process

While from our previous broader perspective, it seemed like feature normalization had the best performance, it looks like percentage-of-max normalization performs better at lower levels of \(k\), minimizing the over all error rate. However, there is something we haven’t yet taken into account. This entire time we have been using the exact same testing and training sets for every normalization and every value of \(k\). To truly get an idea as to how kNN will perform with the different normalization functions, we should be randomizing the testing and training set for every single run of the algorithm. So, let us improve our code once again by using the sample() function in Base R.

#Re-randomizing the sample each time we run knn(), inside the for loop.
test_size = 100
#Testing out various levels of k for feature normalization
wbcd_error_f = numeric(max.k-min.k+1)
for(i in k) {
  test_sample = sample(1:(max.k-min.k+1), size=test_size)
  wbcd_features_train_f = wbcd_features_fscaled[-test_sample, ]
  wbcd_features_test_f = wbcd_features_fscaled[test_sample, ]
  wbcd_label_train = wbcd[-test_sample, 1]
  wbcd_label_test = wbcd[test_sample, 1]
  wbcd_label_pred_f = knn(train = wbcd_features_train_f,
                          cl = wbcd_label_train,
                          test = wbcd_features_test_f,
                          k = i)
  wbcd_table_f = table(wbcd_label_test,
                       wbcd_label_pred_f)
  wbcd_error_f[i-min.k+1] = 1-sum(diag(prop.table(wbcd_table_f)))
}
#Testing out various levels of k for z-score normalization
wbcd_error_z = numeric(max.k-min.k+1)
for(i in k) {
  test_sample = sample(1:(max.k-min.k+1), size=test_size)
  wbcd_features_train_z = wbcd_features_zscaled[-test_sample, ]
  wbcd_features_test_z = wbcd_features_zscaled[test_sample, ]
  wbcd_label_train = wbcd[-test_sample, 1]
  wbcd_label_test = wbcd[test_sample, 1]
  wbcd_label_pred_z = knn(train = wbcd_features_train_z,
                          cl = wbcd_label_train,
                          test = wbcd_features_test_z,
                          k = i)
  wbcd_table_z = table(wbcd_label_test,
                       wbcd_label_pred_z)
  wbcd_error_z[i-min.k+1] = 1-sum(diag(prop.table(wbcd_table_z)))
}
#Testing out various levels of k for percentage-of-max normalization
wbcd_error_max = numeric(max.k-min.k+1)
for(i in k) {
  test_sample = sample(1:(max.k-min.k+1), size=test_size)
  wbcd_features_train_max = wbcd_features_maxscaled[-test_sample, ]
  wbcd_features_test_max = wbcd_features_maxscaled[test_sample, ]
  wbcd_label_train = wbcd[-test_sample, 1]
  wbcd_label_test = wbcd[test_sample, 1]
  wbcd_label_pred_max = knn(train = wbcd_features_train_max,
                            cl = wbcd_label_train,
                            test = wbcd_features_test_max,
                            k = i)
  wbcd_table_max = table(wbcd_label_test,
                         wbcd_label_pred_max)
  wbcd_error_max[i-min.k+1] = 1-sum(diag(prop.table(wbcd_table_max)))
}
#Plotting all error rates as functions of k
small_k = 1:50
all_values = c(wbcd_error_z[small_k],
              wbcd_error_f[small_k],
              wbcd_error_max[small_k])
plot(small_k, wbcd_error_z[small_k],
     type="p",
     col="red",
     main="False Positive + False Negative Rate",
     ylab="Total Error Rate",
     xlab="k",
     ylim=c(min(all_values), max(all_values)))
points(small_k, wbcd_error_f[small_k],
       type="p",
       col="blue")
points(small_k, wbcd_error_max[small_k],
       type="p",
       col="green")
lines(lowess(small_k, wbcd_error_z[small_k], f=0.6),
      col="red")
lines(lowess(small_k, wbcd_error_f[small_k], f=0.6),
      col="blue")
lines(lowess(small_k, wbcd_error_max[small_k], f=0.6),
      col="green")
legend("topleft",
       legend=c("Feature Normalization",
                "Z-score Normalization",
                "Max Normalization"),
       fill=c("blue", "red", "green"))

Note how randomization of the testing and training sets “spread” the points out a bit, and how the lowess curves no longer are as distinct from one another. For example, it no longer seems that z-score normalization performs that much worse than feature and percentage-of-max normalization. However, we do still have that percentage-of-max normalization performs somewhat better than the others when \(k<10\). Additionally, the optimum value of the tuning parameter is \(k=1%\).

From the analysis shown, a pretty good classifier for breast cancer tumors would be the 1-nearest neighbor algorithms using percentage-of-max normalization of the data.

Brandon Fitzgerald

Saturday, April 8, 2017

K-Nearest Neighbors on Wisconsin Breast Cancer Data

kNN on Wisconsin Breast Cancer Data

Brandon Fitzgerald

April 8, 2017

The Wisconsin Diagnostic Breast Cancer Dataset

The Basics

Normalization Techniques

Now The Fun Starts

More Normalization Functions

Down the Rabbit Hole

Zooming in on the Solution

The Final Process

No comments:

Post a Comment

A Simple SQL Server to Practice SQL in R

Search This Blog