I need a help/idea on how to generate a correlation matrix on a large dataset. I am dealing with a dataset of 1000000 customers (rows) and 50 items (columns). Each cell (i,j) is 1 if customer i has bought item j in the past. I want to find how customers are similar by calculating the correlation between customers.
A lazy algorithm is use two loops with n(n-1)/2 iterations (tried pandas.dataframe.corr as well). Doing this, my pc freezes. I am using python on mac (8Gb, 3.24GHz). I used Spark (scala) and it ran out of memory as well. I was thinking of mapreduce but a friend told me it won’t help on such problem to carry pairwise computation.
Any idea please??