Feature Engineering within the Cross Validation



If we are doing feature engineering using for example, category means, should we be doing this inside or outside the cross validation loops?

e.g. for 5 fold cross validation should we be using mean(“whole_data_set”) for both training and holdout or should we be using training mean(“4_training_sets”) and mean(“holdout_set”).


Hi @c3josh

Really nice question.

Ideally, if possible, you should not use the holdout sample values to calculate means for categories. It goes the same way as you won’t be using your test set values to calculate means as well.