I want to perform statistical analysis on various kind of data sets. The data sets taken as input are of different structure( each with different no. of columns and data types). I want to develop an algorithm which determines the similarity of input data from user from the sample data sets and perform analysis accordingly. Any help on how to proceed or with the algorithm would be of great use. Thanks in advance!
This type of problem can be solved by matching the data types, generally common type of data share same type of data type.
Different sites have their own database for users with distinct columns but they have same type of data in general like Username, Password, emailId, Contact number etc. , hence they can be clustered separately.
Please have a look on SSIS too, it can ease your work and broaden your horizon.
Hope it may help.
Thanks for the response! It is much appreciated.
I had thought of comparing the data type, but only similar data type doesn’t mean the entire column has similar contents. For eg. I have a column " Sales( having integer data type)" in the input data set
and one of the the sample data set has one of the columns as “Product ID( also integer)” although the columns in their true sense are different, but if we compare the data type it will consider them the same and perform the same analysis as we had to do on “Product ID” column.
Let me know your thoughts on this.
That is actually a challenge, you can do this by training the algorithm the true sense using different attributes of the columns like length, this can also be done by considering the available data behaviour also.
As for us 1234(Sales) and 1234(ProductID) may be different but for machine it requires training to understand this difference.
Try using Random Forest algorithm.
Hope you find it useful.