# Clarifying doubts regarding codes

#1

Hi…

In a recent article published , the following codes were mentioned
#Step 1 : Load the train and test files

#Step 2 : Specify basic metrics like number of bags/iterations, number of learners/models

num_models <- 24
itertions <- 1000

#Step 3 : Load the library needed for the performance metric (optional)

library(Metrics)

#Step 4 : Calculating individual performance of models for establishing benchmarks

rmsle_mat <- matrix(0,num_models,2)
rmsle_mat[,2] <- 1:num_models
for(i in 1:num_models){
rmsle_mat[i,1] <- rmsle(train[,i],train[,num_models+1])
print(rmsle(train[,i],train[,num_models+1]))
}
best_model_no <- rmsle_mat[rmsle_mat[,1] == min(rmsle_mat[,1]),2]
rmsle(train[,best_model_no],train[,num_models+1])

Can you please explain how is “train_combined.csv” and “test_combined.csv” formed ? If possible explain the subsequent codes with an example.

Thanks in anticipation.

#2

Hi @shan4224,

Train.csv and test.csv are the training and test data respectively.They can be formed in any way and you can take a look Kaggle Titanic data to get a better understanding
.It is difficult to explain the code without really implementing it but I will try to give you a basic understanding of what the code is doing.
It is mainly creating different models(1:24) and then calculating the errors for each model.Which ever model has the least error is being selected as the best_model and then this one is used on the train data.
Hope this helps!!

#3

Hi… @shuvayan,

I can understand Train.csv and test.csv are the training and test data respectively. But in the article its mentioned “train_combined.csv” and “test_combined.csv” .Was wondering , are they train and test data, or some combinations as the name suggested.

Thanks…

#4

@shan4224-I think basically it is train and test data like any other kaggle problems it is naming of it.

Hope this helps!
Regards,
hinduja1234

#5

I think, the train_combined and test_combined are the set of final variables to be considered. This might include:

1. The exclusion of some variables.
2. New generated variables.
Also, sometimes the variable to be predicted is added to the test dataset.