Data analysis and correlations

dataanalysis
analytics

#1

Hello,
I have an analytic problem.

In order to do an interview, the employer gave me a sample of data (csv data) without any indication of the names of variables. There only numeric data. He tolds me to analyze the data and founf the relationships betwen data.
As a first time, I work with data without variable’s name. Can you help me to fond the best way to analyze it.

Here the 16 first line of the data.

Thank you very much

BEsts!

“”;“A”;“B”;“C”;“D”;“E”;“G”;“H”;“I”;“J”
“1”;0,448161832988262;2;2;114,646534434721;3,30110318594957;-19,599488176302;234,434198671982;20,6532414748726;15,1932268216357
"2";0,432204742450267;1;2;85,7605207392913;3,62111444347777;-1,15239499472225;178,211501645965;16,9600753825141;-0,112499563201146
"3";0,568398675648496;1;2;71,6164756272189;7,69368362802501;24,7537090689179;153,913075791719;28,2888808213224;-0,104468251047106
"4";0,349178678821772;2;2;121,170146136389;3,37847601607964;11,7925265579597;252,551708711339;18,621791522593;6,38200554369286
"5";0,537820372032002;2;1;63,9725158637857;8,09451459803737;45,2679414663975;142,05393538719;22,3210943859884;2,63072085022351
"6";0,311805972829461;2;1;105,86608851945;7,00105784025606;56,7118971691138;207,671845532346;28,1108882120601;6,91708052471079
"7";0,188992844894528;2;1;91,4349746370189;4,22115610670039;113,719882708266;191,935903422313;27,722226329035;1,96456921716711
"8";0,220268326113001;2;1;96,9513989045572;3,30097548353886;41,9359256486865;193,983671256507;21,1632614867063;0,897974894012737
"9";0,745850879931822;2;2;117,920731392395;0,708984633432214;12,0311471051253;252,479427305587;16,1589645742755;13,4663055003645
"10";0,685712045058608;2;1;114,761284685084;-2,05310178438932;78,7495516682929;241,262975712287;29,0455805154816;0,628022028133988
"11";0,288209964288399;2;1;72,581830232794;6,64981545368408;45,2083742548179;148,12776449076;27,9515510556148;7,63068446314225
"12";0,837000070838258;2;1;92,5181661942909;0,0636188206698255;41,882160754504;199,975615311369;13,5301385749224;2,80857064980119
"13";0,725813006050885;2;2;119,512257003768;-1,14732052387933;13,9914669274313;243,515192596543;13,7178765507419;10,7559454559754
"14";0,448982792440802;2;1;76,9548634545701;6,22868004513269;36,4101586275915;159,127996500245;25,6106129292056;20,5549219150393
"15";0,000250126933678985;2;2;78,7042365624531;6,87908300161039;33,3968162297511;155,114586653932;25,8772230949845;2,66564733215586
"16";0,430642496794462;2;1;60,8333601287592;6,01562443738694;38,8547948159146;122,277246640109;22,1027759089393;4,64118405160815


#2

Santander customer satisfaction competition on kaggle was similar problem in which you had numerical anonymized (without variable names) data. You could pick some of the ideas applied there.

Here are some of my suggestions to you

  • Firstly, see if there are correlated columns in your data, because they can affect your predictive algorithm.
  • If you can, try to reverse engineer the columns with your intuition
  • Try visualizing your data, and find patterns. In that way you can understand your data and get a gist about it.
  • Try applying clustering or multi-variate analysis to get deeper into the data

Regards


#3

Hi @Cyrine - Adding on to what @jalFaizy has mentioned, here are a few suggestions:

  1. Find pairwise correlations between columns
  2. Find Linear Combinations - For example: if column A + column B = column C, then column C need not be included as a feature in modeling
  3. Use Variance Inflation Factor (VIF) to identify whether a particular column can be predicted using the other columns. Higher VIF values indicate that the columns are related in some way.

Hope this helps.


#4

Thank you very much. I take note your remarks.

Best regards