Any quick way to classify large dataset into smaller buckets to get better Linear Regression model for each bucket



Hi Everyone,

Let’s say my dataset has 26 categorical variables (A1 through A26) and another 5 Numerical variables (X1 through X4, Y1). Also, this dataset is huge.

Normally, using my business knowledge (limited though), I divide this dataset into multiple buckets and regress to explain Y1~X1 to X4 for each bucket. If the Adjusted R-Square is poor, I plot the data points in tableau to see if X1 to X4 exhibits a patter against Y1. If I see a pattern, then I break down X1 into intervals (that creates more independent variables). Most of the time, I manually slice and dice the data and re-run the regression incorporating changes to the regression equation using my manual findings.

My questions are as follows:

Objective: To predict Y1 for the large dataset using the independent variables X1 to X4.

My Question:
Is there a technique/algorithm to segment the large dataset using the categorical variables A1 to A26 ?(maybe not every column has to be input. with a bit of glance, I can rule out 50% of them non-usable)

Also, If you encounter such task, how would you generally go about it (Just trying to pick your brains on it)

Thanks in Advance,


There is a library called biglm in r.I used it when i’m faced with similar situation.Hope it helps