Explanation of GBM pseudocode




In a recent article the pseudo code of GBM is described as follows :

  1. Initialize the outcome
  2. Iterate from 1 to total number of trees
    2.1 Update the weights for targets based on previous run (higher for the ones mis-classified)
    2.2 Fit the model on selected sub sample of data
    2.3 Make predictions on the full set of observations
    2.4 Update the output with current results taking into account the learning rate
  3. Return the final output.

It will be helpful if my following confusions can be mitigated.

a) Does total number of trees referring to n_estimators?

b) Sub point 2.2 mentions the model is fitted on sub sample of data.
2.3 refers to predictions on full set of observations.
So we are predicting on total data, using model from the sub sample of data.
Explanation with example will be helpful.

c) An example of warm_start will be helpful.

Thanks in anticipation…


Hi @shan4224,

A couple of suggestions before answering:

  1. It’ll be good if you give a link to the article in such posts so that others reading this know what you are referring to.
  2. You can post a link to this discussion in the comments on the article. This way the author would get a notification that there is a query which he can address. Additionally, other people reading the article and having similar doubts can see the link to the discussion there itself.

Regarding your points, here are my comments:

a) yes

b) yes we are making the model on a subset and using it to predict for the entire data set. It’s just like a train and test set. So the train are the samples chosen in the random selection and test are others. I hope it is clear.

c) Let me give a dummy example here:

suppose train_predictors and train_target are the input and output, then:

#Define a gbm with 1 estimator and warm_start=True
gbm=sklearn.ensemble.GradientBoostingClassifier( ...other parameters.., n_estimators=1,**warm_start=True**)
#Fit single estimator:
gbm.fit(train_predictions, train_target)

#Now gradually add more models to the GBM
for n_est in range(2, 1000):
    #set the new estimators:
    #fit the model again. now it fill start from the previous and continue form there
    gbm.fit(train, traintarget)

You can try something like this and it should work. Let me know in case you face any challenges.