I am currently working on iris data in R and I am using knn algorithm for classification I have used 120 data for training and rest 30 for testing but for training I have to specified the value of k but I am not able to understand how I can find the value k
In KNN, finding the value of k is not easy. A small value of k means that noise will have a higher influence on the result and a large value make it computationally expensive. Data scientists usually choose as an odd number if the number of classes is 2 and another simple approach to select k is set k=sqrt(n).
Hope this helps!
# k-NN using caret: library(ISLR) library(caret) # Split the data: data(iris) indxTrain <- createDataPartition(y = iris$Sepal.Length,p = 0.75,list = FALSE) training <- iris[indxTrain,] testing <- iris[-indxTrain,] # Run k-NN: set.seed(400) ctrl <- trainControl(method="repeatedcv",repeats = 3) knnFit <- train(Species ~ ., data = training, method = "knn", trControl = ctrl, preProcess = c("center","scale"),tuneLength = 20) knnFit #Use plots to see optimal number of clusters: #Plotting yields Number of Neighbours Vs accuracy (based on repeated cross validation) plot(knnFit)
This shows 5 has the highest accuracy rate, so the value of k is 5.
Hi Vibhu Can you please explain repeats and tuneLength parameter here?
You can apply elbow method to find out k. We have sjPlot package in R to do this quick. Syntax is simple. sjc.elbow(dataframe) will plot the graph.
The k value where there is bent in the graph tells you how many clusters are there.
Suppose for iris data set
df = iris[-5]
From the graph, we can say that bent is at 2. So, number of clusters equals 2.
Check out this article for a nice explanation about KNN and how to choose the right k value
Hope this helps,
- ctrl <- trainControl(method=“repeatedcv”,repeats = 3)
This is the train control fucntion of Caret package. Here we choose repeated cross validation. Repeated 3 means we do everything 3 times for consistency. The number of folds here is omitted, and indicates in how many parts we split the data. The default is 10 folds. In my opinion for roughly 100 samples in this case 10 fold is too high (we only validate ~10 samples each time!). I would suggest using 3CV or 5 CV:
trainControl( method = “repeatedcv”, number = 3, repeats = 5, classProbs = TRUE)
- knnFit <- train(Species ~ ., data = training, method = “knn”, trControl = ctrl, preProcess = c(“center”,“scale”),tuneLength = 20)
In the Caret train function you can specify tuneLength, which is a parameter that uses the parameter(s) default. This is a Caret feature.I think that for kNN, it starts in k=5 and it continues in increments of 2: k = 5, 7, 9, 11, etc… When the cross validation is performed, caret displays the best option for all the parameter values tested.
Hope this helps
It is inappropriate to say which k value suits best without looking at the data. If training samples of similar classes form clusters, then using k value from 1 to 10 will achieve good accuracy. If data is randomly distributed then one cannot say which k value will give the best results. In such cases, you need to find it by performing an empirical analysis.
Rather than focusing on finding suitable k value, use some techniques like SVD and PCA to transform the data. Then believe me KNN or SVM can classify the data very efficiently.
set.seed(123) ctrl=trainControl(method = "cv",number=3) knnfit=train(default.payment.next.month ~., data=training_set, method="knn", preProcess = c("center","scale"), trcontrol=ctrl, tuneLength=5)
Something is wrong; all the Kappa metric values are missing:
Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA
Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA
NA’s :1 NA’s :1
can any one help me with this ?
In this case, isn’t it advisable or prudent to choose 4 instead?