K-means clustering



Which types of variable we take in k-means clustering.



The question looks too vague…can you please elaborate what do you mean by which type of variables or may be your exact use case.


I performed k-means clustering of frequency of words from term document matrix. I tried different values of K for clustering. Finally based on scree plot, decided to go with 4 clusters.

  1. Except for one, three others are overlapping. Words in each cluster are not very distinct. How can I improve this clustering?
  2. Since for a tdm, the # of columns in tdm is not standard, predict function of k-means does not work. How do I predict the cluster membership of new data point?
  3. Are there better unsupervised learning methods for document classification?



I performed k-means clustering on a sales data.
I want create 3 different cluster according to bill date . but when i passed bill date in k means then it give a error.

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In storage.mode(x) <- “double” : NAs introduced by coercion



As k-means clustering works with numeric data for computing distances, can you try converting the date column to numeric. We can convert that to number of seconds with respect to a given date. Usually in most of the languages like java, we have this starting date is 1st Jan 1970 UTC which is also called the Unix epoch. However, please note that after this conversion, we need to scale the data as this date column will get numbers of high magnitude.

Hope this helps.



K-implies bunching is a kind of unsupervised realizing, which is utilized when you have unlabeled information (i.e., information without characterized classifications or gatherings). The objective of this calculation is to discover bunches in the information, with the quantity of gatherings spoke to by the variable K. The calculation works iteratively to dole out every datum point to one of K clustering in light of the highlights that are given. Information focuses are bunched in light of highlight comparability. The consequences of the K-implies grouping calculation are:

The centroids of the K groups, which can be utilized to name new information

Names for the preparation information (every datum point is relegated to a solitary group)

Instead of characterizing bunches before taking a gander at the information, bunching enables you to discover and investigate the gatherings that have framed naturally. The “Picking K” area beneath portrays how the quantity of gatherings can be resolved.

Every centroid of a bunch is a gathering of highlight esteems which characterize the subsequent gatherings. Looking at the centroid highlight weights can be utilized to subjectively translate what sort of gathering each bunch speaks to.

This prologue to the K-implies bunching calculation covers:

Regular business situations where K-implies is utilized

The means associated with running the calculation

A Python illustration utilizing conveyance armada information