Let’s say my dataset has 26 categorical variables (A1 through A26) and another 5 Numerical variables (X1 through X4, Y1). Also, this dataset is huge.
Normally, using my business knowledge (limited though), I divide this dataset into multiple buckets and regress to explain Y1~X1 to X4 for each bucket. If the Adjusted R-Square is poor, I plot the data points in tableau to see if X1 to X4 exhibits a patter against Y1. If I see a pattern, then I break down X1 into intervals (that creates more independent variables). Most of the time, I manually slice and dice the data and re-run the regression incorporating changes to the regression equation using my manual findings.
My questions are as follows:
Objective: To predict Y1 for the large dataset using the independent variables X1 to X4.
Is there a technique/algorithm to segment the large dataset using the categorical variables A1 to A26 ?(maybe not every column has to be input. with a bit of glance, I can rule out 50% of them non-usable)
Also, If you encounter such task, how would you generally go about it (Just trying to pick your brains on it)
Thanks in Advance,