How to implement stacking and blending of various models in R

stacking
blending

#1

Hello,

I am trying to understand the concepts of stacking and blending in ensemble methods.I came across the below attached piece of code in python used for one of the Kaggle competitions:


I have a few questions:
1.The StratifiedFold function is creating a k-fold cross validation?
2.The clfs are storing all the classifiers to be applied??
3.In the part for i, (train, test) in enumerate(skf): they are trying each classifier on each fold?So,if I have k_fold = 10 and number of classifiers as 4 then this loop runs 40 times??
4.I understand that in the last box:clf.fit(X_train, y_train) fits the model.Do the results of each model for each iteration get saved in clf.fit??What is happening in the last 4 lines in the box.
I am sorry if these are very basic questions but I am new to ML(& python) and the thing is that I want to implement this in R.So any help is greatly appreciated.


#2

Seems like you have all figured out already except the part 4. The code snippet you have pasted here uses a logistic regression “blending” of 5 classifiers. In part 4 the out-of-sample predictions are obtained for train sets and for the whole test sets as well. So in each loop you are getting a part of prediction for the train set (the part used as the validation) and for the whole test set. So if you have 4 folds you are obtaining the predictions for the train set once using stratified K fold and 4 set of predictions for the test set which you are taking mean of in the last line.

P.S. The best way to understand stuff from Kaggle is to take part in the competition. If you want to do some homework on the above model the “breastCancer” dataset of R package mlbench should help. In R you can use caret package’s “createFolds” method to do stratified sampling but you have to take a bit more pain to adjust the indexes.

If you bind (cbind) the probabilities with the train set itself - the process becomes stacking. In blending as seen above you are applying the second level classifiers on the output of the first level classifiers only.