Improving model score apart from gbm and randomforest

r

#1

Can you suggest ways to improve model? I have tried regression,gbm, random forests, SVM with different options within. Are there any other option?

For Xgboost, in the tutorial they mention that it takes only numerical inputs, how to handle factors in this case?


#2

@pchavan- You can try to convert the factor variable into the numerical then take this as input to xgboost model.

Hope this helps!

Regards,
Hinduja


#3

I thought of doing it. But I don’t understand sense behind it. For instance.- there is low fat(2) and regular(1). As long as they are factors they make sense. If I convert it into numeric, Don’t you think they willloose that sense since they will be treated like numbers.


#4

In R when we import, our data set in R it gives a number to each factor of a variable .And while building the xgboost model it is a methodology that xgboost will take only numeric variable and for interpretation of the answer, you can convert the output back to the factor.

Hope this helps!


#5

@pchavan, I see your point. To address this issue, you can create dummy variables to replace the categorical variables in the data frame. This blog might help you.


#6

@pchavan You can convert a factor/categorical variable into multiple dummy variables which will be numeric and then use them in the XGBoost model. Here is a tutorial http://amunategui.github.io/dummyVar-Walkthrough/


#7

Hello Everyone,
Thanks for your help. But I am not really looking for help in creating dummy variables. I have been trying to improve my model for this big mart data. Can anyone please share xgboost r code for this problem(continuous target) or something in similar lines?. I am not able to understand given xgboost article .

Thanks :slight_smile:


#8

@pchavan,

I would recommend you to re-iterate the model building life cycle again rather focusing on “XGBoost” only to improve the accuracy of a model. I always follow below mentioned approach and it worked for me very well:

Problem Identification and Hypothesis generation

  1. Identify the problem first (If you have the domain experience, great)

  2. Generate Hypothesis (Which features can impact the target variables). Caution! you should perform this step without looking at the data.

Data Exploration and Pre processing

  1. Data Exploration (Exploring the hidden trend), Missing and outlier treatment

  2. Feature Engineering (Create new variable from existing variable(s), here step 2 will give you idea to generate new variables)

Model Building

  1. Select the right validation set, it help you to avoid over-fitting

  2. Select the right algorithm (Some time logistic regression delivers better result compare to others)

  3. Do the parameter tuning using GridSearch although with experience you can do this tuning process manually.

  4. Try multiple algorithms and check the Cross validation and leader board score

  5. Do ensemble of multiple algorithms output

Finally, do the prediction for test data set. I would also suggest that understand the pros and cons of algorithm before using it. It will help you to understand that which algorithm will work for which type of problem (data set).

Hope this helps!

Regards,
Imran


#10

How to fill the missing Outlet_Size data as it can help in improving the model. But I am not able to impute it as I simultaneously want to reorder the levels.Small<Medium<High


#11

@Vistas,

If you look at the frequency of missing data most of the missing data is from “Grocery store” and Tier 2 outlet.

So i imputed all the missing value as “Small”


#12

This little trick did help in improving my score. Rather than using Item_Outlet_Sales as the target / dependent variable, I created a new variable Items_Sold, and used this as the target.

train$Items_Sold = train$Item_Outlet_Sales / train$Item_MRP
train$Items_Sold = round(train$Items_Sold)
train$Item_Outlet_Sales = NULL

Later on, multiplied the predicted Items_Sold with Item_MRP to get back the Item_Outlet_Sales

Hope this helps :wink:


#13

Excellent!


#14

Hi Gaurav,

Thanks for the tip, it does help. But thinking loud whats the business sense in doing this considering all the products are consumable items that are used in daily life.

Thanks,
Amit


#15

HI pchavan,

Handling the factor / categorical feature in data is an art in machine learning.We have different ways to convert the factors into numerical variable.

  1. Take the one hot encoding
  2. Label encoding
  3. Mean response to the target
  4. Take the frequencies
  5. Create the groups based on similar type of factor for example group 1, 2, 3.

You have multiple ways to improve your model.

  1. Hyperparameters optimization also called parameters tuning.
  2. Treat the missing values
  3. Find the minimal set of features
    4 . If above options don’t work, try to more focus on understanding the data and relationships among the variables, then create simple features which add more value to your model.

And in last keep doing above steps until you find success.

Regards,
Ankit Gupta


#16

If a factor has only two level & Your are using a tree based method (DT, RF, Xgboost etc) then both one hot encoding or simple conversion to numeric will return same result - is that true?


#17

@pchavan , You can try one hot encoding(read the sklearn documentation)


#18

Hi Gaurav,

The new feature you have created has improved my score a lot, Can you identify/explai the logic behind it and how did you arrived at it .

Thanks
Stanley