How do understand anonymized data (data without column names)?



For BNP kaggle competition, we are given anonymized dataset containing both categorical and numeric variables, i.e. we are not given with the column names of the variables, just the data. So how do you go through the process of understanding this kind of dataset?

Also, it is said that domain knowledge can be a key to a good Machine learning model. But in this kind of data, how do you do this?


@jalFaizy- In this type model you have to play with the given data because given data is anonymous and there is no column name of the input variable.Yes, it is difficult to solve this kind of problem.

Hope this helps!



Few ideas…

Start finding correlation between columns and reduce highly correlated columns, probably those may be same data in domain…

Do clusters to find similar group of data and spread of data points.

Since many algorithms take the data points to high dimensional space which is anyway not have meaning to domain, so assume that data already in high dimension. :slight_smile: