How to normalize the skewed multivariate data? (in R)

r
machine_learning
data_science

#1

I have got a customer data involving variables like credit limit, DSO days etc… having outliers ,missing values,and zeros.
I want to cluster the data as Low risk ,Medium risk and High risk customers using Kmeans and KNN for prediction…

Found the data to be non normal using descriptive statistics.Treated outliers using quantiles, treated missing values by imputing minimum value, then normalized data using normalization/standardization techniques. Then applied Kmeans clustering which yielded no proper results(overlapping clusters)

Questions:

  1. Do we need to normalize the data for clustering?
    If needed, how? please suggest the procedure in R.
  2. What kind of data transformation is useful in this case? How to handle negative values which cannot be avoided?

bottom line: Struggling in preparing the data for K-means.


#2

For clustering, it is important as it is based on distance calculations & variables may have different scales, suppose clustering will have problem if you use raw values such as income, age (income may b very high but age must b double digit), so income may influence the clustering. So scaling is important here.

try ?scale in R, you ll know better than i can tell you.


#3


hi, thanks for your answer…
after scaling all variables, applying kmeans, the “clusplot” for the model looks like this… what can I interpret? what are the components representing in the plot? what is the other " kmeans plot" interpreting?


#4

cant say much after looking at the graphs.Just a question out of curiosity how have u decided number of clusters. If selected randomly, you can try optimizing within sum of squares & select on the basis of that. You can easily find solution for graphing number of clusters v/s wss. Later you got to find out the characterize each cluster based on the features. based on the cluster definition, check if it makes sense to you. one more thing, if doing PCA, you gotta to select number of components for at least 80% explanation.


#5

I did using elbow method. In wss plot, I got 4 clusters and those were the cluster plots I got.