Setting up the target variable in a classification problem



Hi -
Currently I’m working on a human resources project where I need to predict attrition.
I have the active list of employees as on Sep’17 and the list of churned employees for different years starting 2012. I need to predict which employees will leave from Oct’17 to Mar’18.

My approach has been:

  1. Tag all active employees as on Sep’17 as active
  2. Tag all churned employees in 2017 as inactive
  3. Split data into train and validation
  4. Build model and check performance

Is this approach correct? Are there any other alternative approaches in predicting employee churn?


The request is to predict employees who will leave from October to March. You have to think differently.

The method you are trying to do will predict who may leave. Adding the time point of the event, you need to think intelligently.


Okay. Now since I intend to predict all employees who will churn in the next 6 months,
we could tag all exits between Apr’17 to Sep’17 as 1 (have churned) and all other exits (Jan’17 to Mar’17) as active.
In this case, only the people who churned in the last 6 months are tagged as inactive and the rest, who may have churned, are still tagged active.

Here I bring in time interval in predictions. Is this is a better way, or do I need to approach the problem in a yet different manner?