Variable Clustering in R

clustering

#1

Hi All,
I recently got an assignment for variable clustering and model building using R.
A brief description about it:
Given that , the data is clean.( free from missing values, outliers , etc) , i have to first, create variable clusters(number of clusters should be dynamically decided on the basis of feature variables), and then create logistic regression models using 1 variable from each cluster.
Then, on the basis of evaluation metrics of each model , decide the models which satisfies the hypothesis and run those model on validation dataset.
I am new to variable clustering, and was unable to search for some good resources for same.
I did come across , packages like “ClustOfVar” and varclus function (Hmisc package) , but was unable to get the exact functions/packages.


#2

Not sure if I understood it correctly, but the easiest solution should be hierarchical clustering. Generate a dendrogram and pick the best number of clusters. K-Means is another possibility. You can find examples of both in this link: Cluster Analysis. For more details on clustering methods, I recommend this article: Getting your clustering right (Part I)


#3

@caiotaniguchi ,sorry for not being so descriptive.

I have to do “variable clustering” using hierarchical clustering.
After searching on internet for some time , i found that , using hclust() , i should be able to achieve it.
However, the problem i find with it is , that it won’t dynamically decide the number of clusters to be made.We have to manually key in the number of factors required.
Using hclust(), i was able to create n clusters.Now how to create a logistic regression model using one variable from each cluster at a time.I did’t understand, how to achieve this part.
Do let me know , if the problem is still not clear.
Thanks


#4

Hi @Rishabh0709

If I understand well your problem. Now you have observations in clusters, that is you build groups, therefore the central point in the cluster is you new reference. (for each element you do one error which is related to the centre of the cluster).

You have to do a classification so if you give the cluster number a independent variable to you logistic regression you could perhaps build you classification. It is something as class = f(cluster number) where f() is the logistic regression.

Hope this help.

Alain


#5

@Rishabh0709, there is something weird with this assignment. It makes no sense to create a model using a single variable from each cluster. For predictions, clusters are either used to segment the data and for each cluster a model is trained (not restricted to a single feature), or the cluster number is added as a new feature and a single model is trained.

Regarding the automation of the number of clusters, this is also strange. Normally, choosing the number of clusters is something that should be done only once unless the data distribution changes a lot over time, so it seems a waste of time. If it is absolutely necessary, you can use K-Means and compute the WSS as described in the Quick-R link from the previous post. The only thing that needs to be modified is to choose the number of clusters by picking the cluster number with the highest second derivative value. This replaces the manual selection by scree plot.


#6

@caiotaniguchi , thanks for your response and yes, it is indeed weird :slight_smile: , but i have to do it anyway.
One more thing , using VARCLUS proc in sas , we can store cluster information (variables in each cluster, R-squared with own and nearby cluster , etc.) in a dataset.
How to do the same in R.


#7

Hi @Rishabh0709, Maybe this package (varclus{Hmisc}) is the one you are trying to find?


#8

Thanks @jalFaizy.
I found one other package named "ClustOfVar"
I was able to achieve the variable clustering and was able to create a basic logistic regression model using 1 variable from each cluster.
Now , i want to automate this process.
Take for example, i have 3 clusters having 3 variable each.
I want to build logistic regression model using all possible combination like;

C1 C2 C3
0 0 1
0 1 0
0 1 1


1 0 2


1 0 3

and so on till
3 3 3

Here , C1, C2 and C3 are cluster names
and 0,1,23 represent the number of variables selected from each cluster.

So, i have to create logistic regression model using all possible combination.
Please let me know how to automate this process.
I hope, i am clear about the problem.

Thanks ,
Rishabh


#9

Hi All,
In continuation to above work , i did find couple of function like “combn” and “paste” which turns out to be very helpful.Using these functions, i was trying to achieve the above problem.
I’ll post it with a simple example.
I have 2 cluster of variables namely cluster1 and cluster2

cluster1 <- c( “A”, “B”)
cluster2 <- c( “C”, “D”)

The expected output is:
“C”
“D”
“C+D”
“A”
“B”
“A+C”
“A+D”
“B+C”
“B+D”
“A+C+D”
“B+C+D”
“A+B”
“A+B+C”
“A+B+D”
“A+B+C+D”

(It’s basically, all possible combination of 2 clusters ; “00,01,02,10,11,12,20,21,22”.So, 00 will print nothing,01 should print “C”, “D”, 02 should print “C+D”, 10 should print “A”,“B”, 11 should print “A+C”,“A+D”,“B+C”,“B+D” , and so on for all other combination.
Following , is the code, i have so far to achieve the required output:

for(i in 0:length(cluster1)) {
for(j in 0:length(cluster2)) {
#print(c(i,j))
frm1 <- {}
frm2 <- {}
if ( i== 0 & j == 0) {next} else {
if(i == 0 & j >0 ) {frm2 <- as.matrix(combn(cluster2 , j))} else {
if(i > 0 & j ==0) {frm1 <- as.matrix(combn(cluster1 , i))} else {
frm1 <- as.matrix(combn(cluster1 , i))
frm2 <- as.matrix(combn(cluster2 , j))
}}}
print (paste(frm1 , collapse = " + "))
print (paste(frm2 , collapse = " + "))
}}

Following is the output i got:( have removed some lines like “” and [1] , just to focus on the part of output that matters.)

“C + D”
“C + D”
“A + B”
“A + B”
“C + D”
“A + B”
“C + D”
“A + B”
“A + B”
“C + D”
“A + B”
“C + D”

Please provide your help in order to change my code, to achieve the expected output.

Also , please note that, in actual case, number of cluster and number of variables in each cluster will change.

Thanks


#10

Not the greatest script ever, but this will do the trick:

# Assign parameters
k_clust <- 2
p_var <- 2
n_comb <- k_clust * p_var

# Create a list with all possible factors
tmp_list <- list()
for (i in 1:n_comb) {
	tmp_list[[length(tmp_list) + 1]] <- c("", LETTERS[i]) 
}

# Create a data frame with all possible combinations
df <- expand.grid(tmp_list)

# Merge combinations into a string
for (feature in colnames(df)) {
	df$combinations <- paste(df$combinations, df[, feature], sep = "+")
}

# Additional regex to format the data correctly
df$combinations <- gsub("^\\+*|\\+*$", "", df$combinations)
df$combinations <- gsub("\\++", "\\+", df$combinations)

#11

awesome @caiotaniguchi
thanks a lot.


#12

@caiotaniguchi
I got stuck again. :frowning:

Now, i have to create logistic regression model, using each of the possible combination as predictor variable(s).

attach(train)
for(i in 2:length(pred_var_comb)) {
print(glm(Loan_Status ~ pred_var_comb[i] , train , family = “binomial”))
}

When i run this snippet, i get following error:

Error in model.frame.default(formula = Loan_Status ~ pred_var_comb[i], :
variable lengths differ (found for ‘pred_var_comb[i]’)

Even after googling a lot, i didn’t find any solution for it.

pred_var_comb is a character vector with all possible combinations of predictor variables.

pred_var_comb

“”
“CoapplicantIncome”
“Credit_History”
“CoapplicantIncome+Credit_History”
“Loan_Amount_Term”
“CoapplicantIncome+Loan_Amount_Term”
“Credit_History+Loan_Amount_Term” “CoapplicantIncome+Credit_History+Loan_Amount_Term”
“Property_Area”
“CoapplicantIncome+Property_Area”
“Credit_History+Property_Area” “CoapplicantIncome+Credit_History+Property_Area”
“Loan_Amount_Term+Property_Area” “CoapplicantIncome+Loan_Amount_Term+Property_Area”
“Credit_History+Loan_Amount_Term+Property_Area” “CoapplicantIncome+Credit_History+Loan_Amount_Term+Property_Area”

Please let me know , how to resolve this issue
Thanks


#13

@Rishabh0709, it’s not possible to concatenate a formula with a string in R, the string must be already in ‘y ~ x’ form. Then, you just need to apply the as.formula function. Try the following:

for (combination in pred_var_comb) {
    # Skips the empty entry (you might want to get rid of the blank instead)
    if (combination == "") next
    f <- as.formula(paste0('Loan_Status ~', combination))
    print(glm(f, train, family = "binomial"))
}

Assuming the blank is removed, you could also use a more elegant vectorized solution:

models <- lapply(pred_var_comb, 
                 function(x) glm(as.formula(paste0('Loan_Status ~', x)),
                                 train, family = "binomial"))

#14

Thanks a lot @caiotaniguchi.
U rocks!!!
One More thing,how can store the glm outputs in loop.
I tried using"assign" , but it didn’t worked out.


#15

The vectorized solution already stores the models in a list called ‘models’.

For the loop solution, initialize an empty list before the loop and do the following modification:

# Replace this line:
print(glm(f, train, family = "binomial"))

# With this one:
models[[length(models) + 1]] <- glm(f, train, family = "binomial")

#16

Thanks a lot. @caiotaniguchi.


#17

@caiotaniguchi
Instead of providing number of clusters required manually , is there any way using which we can define maximum number of variable clusters (something like “maxcluster” in varclus(SAS))


#18

@Rishabh0709, the only method that I know of is the one using WSS and K-Means which I mentioned above. For hierarchical clustering, you could probably use the value of distance between splits to determine the best number of clusters, but it’s more of a hack than a consistent solution.

I actually don’t use clustering much, they never helped any of the predictions model I tried and are sloooow.


#19

@Rishabh0709 Can you please share the code for variable selection similar to Proc Varclus in SAS.
Thanks in advance