Imputing missing values in the Smart recruits hackathon



Hello People,

In the Smart recruits hackathon there are about 1600 rows where all the values of Manager Variables are missing.
My idea is to create a unique manager ID for each manager(each manager has unique values for all the manager variables) and create a model with target variable as ID and independent variables as all applicant variables.
To deal with them the following is what I did:

  1. Split the data set to ‘missing’(even if one value in any row is missing) and ‘non missing’ (none of the values are missing) datasets.

  2. Extract all the values of Manager variable from the non missing data set.

  3. Assign IDs for each unique Manager.

  4. Merge and Assign the corresponding ID to each manager in the non missing data set.

  5. Create a model where the target variable is ID and independent variables as all applicant variables.

  6. Predict the corresponding IDs for each row in the ‘missing’ dataset.
    The code:
    test$Business_Sourced = 2
    combo = rbind(train, test)
    filled = na.omit(combo)
    notfilled = subset(combo, !(ID %in% filled$ID))

     y = filled %>% select(Office_PIN, Manager_DOJ:Manager_Num_Products2)
     manager = unique(y)
     manager$newvar <- seq(1,6641,1)
     z = merge(manager, filled, by = c("Office_PIN", "Manager_DOJ","Manager_Joining_Designation", "Manager_Current_Designation", "Manager_Grade", "Manager_Status", "Manager_Gender", "Manager_DoB", "Manager_Num_Application", "Manager_Num_Coded","Manager_Business", "Manager_Num_Products", "Manager_Business2", "Manager_Num_Products2"))
     Applicant_with_managerID = z %>% select(Office_PIN, newvar:Applicant_Qualification, Same_Locality:Applicant_Age)
     Applicant_with_managerID$newvar = as.factor(Applicant_with_managerID$newvar)
     Applicant_with_managerID$Applicant_City_PIN = as.numeric(Applicant_with_managerID$Applicant_City_PIN)
     Applicant_with_managerID$Office_PIN = as.numeric(Applicant_with_managerID$Office_PIN)
     model1 = randomForest(newvar ~.-(ID), data = Applicant_with_managerID, ntree = 50)

Since there are 6000 classes to predict I’m unable to create a model.
What should I do? Is my approach right? Should I change my model?


The Smart Recruits Hackathon Discussion

Hi @B.Rabbit,

Please amuse me on this; you have the applicant variables as independent variables. How will applicant information help you find out who the manager is? Rather if you take something like Office_PIN, it would help a bit.

Also, in the code, (if I understand correctly), you take Manager_Num_Products into account when finding Manager_ID. That would be wrong, as that feature would change w.r.t. time even when the manager is the same! Try choosing only those features which would help you pinpointing the manager.

Hope that helps.


Hi @jalFaizy,
If you can see the code I did take Office_PIN. Still the model is performing bad. With respect to Manager_Num_Products, you’re right. I’ll remove it. My idea is that even if the applicant variables aren’t directly related to the Manager ID, things like Office_PIN applicant PIN could help me point to an ID. Even if doesn’t I’ll just overfit so as to impute the missing values(maybe that won’t be a problem as test and train data are fit from the same data). What say?



Here are some of my ideas for filling missing values:

  • Extract city/state information from Office_PIN (first two or three digits) and find relation of it between manager_ID.
  • Hypothesis: If a Manager is in Probation, he would stay in probation for some time. So that can be a feature.
  • Hypothesis: If an applicant has worked for the company before and has applied again, he will likely work under the same manager as that of the previous iteration.
  • See the trend between how many applications a manager gets a month, and then fill that trend for missing values (because its a time series data). This can be applied same for Manager_Num_Coded
  • (a somewhat complex approach) Find the behavior of Manager wrt to Prospective applicants, i.e. what kind of applicants a manager prefers. For this, you have to do extensive data exploration (and this may not be fruitful, as the data is small)