The Smart Recruits Hackathon Discussion


#1

Hi,

Is there solution provided for the smart recruits(23-24 July) hackathon Problem - Fintro is a Financial Distribution company. Over the last 10 years, they have created an offline distribution channel across India. They sell Financial products to consumers by hiring agents in their network. These agents are freelancers and get commission when they make a product sale.


#2

Hello @hbhargava2 ,

Python solution rank 2 Mr. SudalaiRajkumar’s code

R solution rank 1 Mr. Rohan Rao’s code

Hope that helps :slight_smile:


#3

Thanks Yes it is helpful.


#4

Here’s my winning solution, which scored 0.8856 (public) / 0.7658 (private) and ranked 1st on the public LB and private LB.

I have a single XGBoost model with 14 features. The most important feature is in the ordering of the applications. This could either be a trend that applications received towards the end of the day are more likely to be rejected or else it could be a data preparation issue. Either ways, I think it was an interesting pattern to catch and some simple plots/summaries caught my eye. It gave me the boost from 0.65 to 0.85. The feature is Order_Percentile in my code.

Here’s my complete code… it runs in less than a minute on my 4GB MacBook Air and is barely 30 lines. Have expanded it for easier reading.
Hope you find it useful.

Thanks again to AV for a nice fruitful weekend hackathon :slight_smile:

Maybe I’ll write a longer blog post on this later if time permits this week.
Edit: Done - http://rohanrao91.blogspot.in/2016/08/the-smart-recruit.html


#5

Hi Rohan,

I was wondering about 2 things:

  1. why would you do this:

X_train <- subset(X_train, !Manager_Joining_Designation %in% c(“Level 7”, “Other”))

Because test data does not have those instances?

How do you know that they are noise?

2 . What made you think that ordering of applications in a given day might be important? is it based on ID?


#6

Here’s my [submission for the challenge] (https://github.com/faizankshaikh/AV_SmartRecruits) (Public rank 60+, private rank 11)

Its not much but atleast the model was stable

The things that helped me were

  • Model Validation strategy
    • Divided the data into two parts according to time (Application_Receipt_Date) and did the validation
  • Features with overall applicant/manager performance
    • Applicant_Experience
    • Manager_Experience
    • Manager_All_Time_Business, etc

Overall it was an excellent experience. Thanks AV!


#8

@jalFaizy Your private rank is better than public.What does that reflect ?


#9

It means that the model was stable (public score : 0.62, private score :
0.63).

You want your model to be stable, so that it would perform the same on a
new data. So a higher performing model might not be a ‘better’ model, if it
does not generalize.


#10

Hi Rohan,
Hearty congratulations on the win. It was really brilliant that you caught the pattern in the data. When I read the problem statement, I was of the impression that the response variable (y) was whether or not the hired applicant was able to bring in business at the end of 3 months of joining the company. But your comment suggests that you treated it as an application accept/reject problem, and you accounted for the pattern in application submission.

Could you kindly let me know how significant was the change in score if this pattern was not included in your analysis. Thanks in advance.

Best.


#11

Hello People,

In the Smart recruits hackathon there are about 1600 rows where all the values of Manager Variables are missing. I split the data set into two datasets, one with missing and the other with non-missing values.
My idea is to create a unique manager ID for each manager(each manager has unique values for all the manager variables) and create a model with target variable as ID and independent variables as all applicant variables.

I used a random forest model to achieve an above, But I’m not getting good results.
Is the approach right? What other model should I use? There are 6400 unique managers.
Link to my code
Regards

@jalFaizy @Rohan_Rao please help


#12

I’m new to data science and R. Rohan, can you pls explain
X_train_order <- X_train[, .(Max_Order = max(Order),
Min_Order = min(Order)), .(Application_Receipt_Date)]
Also can you explain on what basis u ignored certain columns like occupation,mgr gender etc…
Thanks
buvana


#13

I am not sure, what are you trying to achieve by creating the manager ID as target variable. This looks absurd. As far as splitting of data is considered, that is not required for decision tree based methods like randomforest or xgboost


#14

@buvana: have you tried running the code and see, whats happening?


#15

Hello @munitech4u,
There are 1600 rows of data where all the manager variable values are missing. So I split the data into two(One data set which has all the manager variables filled and the other which doesn’t). Now say I give an ID to each manager and create a model where taking in all other variables other than the manger variables outputs manager ID, I can use the same model to fill in the missing values in the other dataset. That’s what I’m trying to achieve.

Regards


#16

i did but i dont understand why this is needed or what happens to the data because of this code.


#17

hello Rohan congratulations for your success. I am not expert like you in R coding. I am trying to run the code which you wrote.

I Copy and pasted the code of you from github link and got the error like this:

I Copy and pasted:

X_train <-sourav[,’:=’(ID = NULL,
Office_PIN = as.numeric(Office_PIN),
Application_Receipt_Date = as.numeric(as.Date(“2016-01-01”) - as.Date(Application_Receipt_Date, “%m/%d/%Y”)),
Applicant_City_PIN = as.numeric(Applicant_City_PIN),
Applicant_Gender = ifelse(Applicant_Gender == “M”, -1, ifelse(Applicant_Gender == “F”, 1, 0)),
Applicant_Age = as.numeric(as.Date(“2016-01-01”) - as.Date(Applicant_BirthDate, “%m/%d/%Y”))/365,
Applicant_BirthDate = NULL,
Applicant_Marital_Status = NULL,
Applicant_Occupation = as.numeric(as.factor(Applicant_Occupation)),
Applicant_Qualification = NULL,
Manager_Experience = as.numeric(as.Date(“2016-01-01”) - as.Date(Manager_DOJ, “%m/%d/%Y”))/365,
Manager_DOJ = NULL,
Manager_Joining_Designation = as.numeric(as.factor(Manager_Joining_Designation)),
Manager_Current_Designation = as.numeric(as.factor(Manager_Current_Designation)),
Manager_Grade = NULL,
Manager_Status = NULL,
Manager_Gender = NULL,
Manager_Age = as.numeric(as.Date(“2016-01-01”) - as.Date(Manager_DoB, “%m/%d/%Y”))/365,
Manager_DoB = NULL,
Manager_Num_Application = as.numeric(Manager_Num_Application),
Manager_Num_Coded = as.numeric(Manager_Num_Coded),
Manager_Business = as.numeric(Manager_Business),
Manager_Num_Products = as.numeric(Manager_Num_Products),
Manager_Business2 = NULL,
Manager_Num_Products2 = NULL)]

Please tell me why I am getting like this. Anybody who is expert in R please help me out.