Clarifying doubts regarding codes




In a recent article published , the following codes were mentioned
#Step 1 : Load the train and test files

train <- read.csv(“train_combined.csv”)
test <- read.csv(“test_combined.csv”)

#Step 2 : Specify basic metrics like number of bags/iterations, number of learners/models

num_models <- 24
itertions <- 1000

#Step 3 : Load the library needed for the performance metric (optional)


#Step 4 : Calculating individual performance of models for establishing benchmarks

rmsle_mat <- matrix(0,num_models,2)
rmsle_mat[,2] <- 1:num_models
for(i in 1:num_models){
rmsle_mat[i,1] <- rmsle(train[,i],train[,num_models+1])
best_model_no <- rmsle_mat[rmsle_mat[,1] == min(rmsle_mat[,1]),2]

Can you please explain how is “train_combined.csv” and “test_combined.csv” formed ? If possible explain the subsequent codes with an example.

Thanks in anticipation.


Hi @shan4224,

Train.csv and test.csv are the training and test data respectively.They can be formed in any way and you can take a look Kaggle Titanic data to get a better understanding
.It is difficult to explain the code without really implementing it but I will try to give you a basic understanding of what the code is doing.
It is mainly creating different models(1:24) and then calculating the errors for each model.Which ever model has the least error is being selected as the best_model and then this one is used on the train data.
Hope this helps!!


Hi… @shuvayan,

Thanks for the prompt reply.
I can understand Train.csv and test.csv are the training and test data respectively. But in the article its mentioned “train_combined.csv” and “test_combined.csv” .Was wondering , are they train and test data, or some combinations as the name suggested.



@shan4224-I think basically it is train and test data like any other kaggle problems it is naming of it.

Hope this helps!


I think, the train_combined and test_combined are the set of final variables to be considered. This might include:

  1. The exclusion of some variables.
  2. New generated variables.
    Also, sometimes the variable to be predicted is added to the test dataset.