I am stuck in the process of building a model.Basically I have 10 parameters all of which are categorical variables,Even the categories are having a large number of unique values(one category has 1335 unique values of 3lakh records),and the y value which is to be predicted is the number of days (Numerical),I am using randomforestregressor and getting an accuracy of around 55-60%,I am not sure if this is the max limit or I really need to change the algorithm itself.Kindly suggest me with any possible solutions.I am flexible with any kind of solutions
For categorical features having a large number of values, you could use Leave-one-out Target Encoding (google it for details).
One suggestion is you can bin the values according to desired ranges so that distinct values will be handful. Then, follows the regular transformations like label encoding or dummy coding so that input dimension space will not be blown.
Possible steps to follow can be
- Try to find commonalities among unique values in a variable and try combing them with a more generalized category name. For ex. If you have education as a column name that contains the following unique values
Basic-4y, Basic-6y, Basic-9y, High-schools, Diploma, graduate, post graduate, etc.
You can think of combining Basic-4y, Basic-6y, Basic-9y, values to just Basic, without losing its meaning, This step helps in reducing no of one hot encoded columns.
- Create dummies for all independent features.
- Use RFE to ensure the one hot encoded variables affinity to target variables. Use variables ranked by RFE and create model to achieve better accuracy.
- Not advised to use PCA on categorical variables.
Hope this helps.