Correlation matrix on a very large dataset

big_data
matrix
correlation
sparkr
mapreduce

#1

Hi,

I need a help/idea on how to generate a correlation matrix on a large dataset. I am dealing with a dataset of 1000000 customers (rows) and 50 items (columns). Each cell (i,j) is 1 if customer i has bought item j in the past. I want to find how customers are similar by calculating the correlation between customers.

A lazy algorithm is use two loops with n(n-1)/2 iterations (tried pandas.dataframe.corr as well). Doing this, my pc freezes. I am using python on mac (8Gb, 3.24GHz). I used Spark (scala) and it ran out of memory as well. I was thinking of mapreduce but a friend told me it won’t help on such problem to carry pairwise computation.

Any idea please??


#2

Hi,

Instead to calculating correlation on whole data. You can also use stratified sample for calculating correlation.

Best!
Ankit Gupta


#3

@kthouz

You could use “strata” (package- sampling) for stratified sampling!

Please check below links for reference:
strata

Stratified Sampling and its application using dplyr

Thanks,
Abhishek Das