In the Smart recruits hackathon there are about 1600 rows where all the values of Manager Variables are missing.
My idea is to create a unique manager ID for each manager(each manager has unique values for all the manager variables) and create a model with target variable as ID and independent variables as all applicant variables.
To deal with them the following is what I did:
Split the data set to ‘missing’(even if one value in any row is missing) and ‘non missing’ (none of the values are missing) datasets.
Extract all the values of Manager variable from the non missing data set.
Assign IDs for each unique Manager.
Merge and Assign the corresponding ID to each manager in the non missing data set.
Create a model where the target variable is ID and independent variables as all applicant variables.
Predict the corresponding IDs for each row in the ‘missing’ dataset.
test$Business_Sourced = 2
combo = rbind(train, test)
filled = na.omit(combo)
notfilled = subset(combo, !(ID %in% filled$ID))
y = filled %>% select(Office_PIN, Manager_DOJ:Manager_Num_Products2) table(y) manager = unique(y) manager$newvar <- seq(1,6641,1) z = merge(manager, filled, by = c("Office_PIN", "Manager_DOJ","Manager_Joining_Designation", "Manager_Current_Designation", "Manager_Grade", "Manager_Status", "Manager_Gender", "Manager_DoB", "Manager_Num_Application", "Manager_Num_Coded","Manager_Business", "Manager_Num_Products", "Manager_Business2", "Manager_Num_Products2")) Applicant_with_managerID = z %>% select(Office_PIN, newvar:Applicant_Qualification, Same_Locality:Applicant_Age) Applicant_with_managerID$newvar = as.factor(Applicant_with_managerID$newvar) str(Applicant_with_managerID) Applicant_with_managerID$Applicant_City_PIN = as.numeric(Applicant_with_managerID$Applicant_City_PIN) Applicant_with_managerID$Office_PIN = as.numeric(Applicant_with_managerID$Office_PIN) library(randomForest) model1 = randomForest(newvar ~.-(ID), data = Applicant_with_managerID, ntree = 50)
Since there are 6000 classes to predict I’m unable to create a model.
What should I do? Is my approach right? Should I change my model?