How to proceed with analysis from a single data file (192 variables/columns) and without proper information about the variables ?
How to proceed with analysis from a single data file (192 variables/columns) and without proper information about the variables?
have you tried the GUI rattle. Try that please
Also try using summary, describe, summarize commands and boxplot, plot and histograms for analysis.
There is no shortcut to doing analysis- it will depend on the data itself how to proceed. Which variable to ignore, which to split into new variables, which to keep as its- all depends on data.
For getting a more coherent answer, always share a few details about the data- size, number of rows, what it contains rather than number of columns alone
The only time I have come across this situation is in competitive modeling, which is usually not what happens in real time. Usually, I advice to understand the domain and the data fully before doing any modeling.
If you are in competitive modeling or have got a lot of masked data, then the only alternate is to explore them, come up with hypothesis, Quickly categorize which variables are significant through crude models and then create a refined model.
Hope this helps,
Kunal
Hi Ajay, the data sizes 1 TB. It has 192 columns and 200 million records (rows). It contains customer base information, transactions related information and billing information.