Clarification on Employee Absenteeism prediction kaggle dataset

python

#1

Hello,
What is the best approach to solve below two questions from given data without using Time Series Analysis

  1. What changes company should bring to reduce the number of absenteeism?
  2. How much losses every month can we project in 2011 if same trend of absenteeism continues?

In Variable description Column 21 "Absenteeism time in hours " as mentioned as Target Variable

If i use regression model than how do answer those above two questions

Can i use K Means clustering here ? if we use K means clustering that data set contains both continuous and Categorical data , how do i proceed here

Pleas let me know effective way to answer those above two questions (I should not use Time Series Analysis here)

Data and other details present in below url

i have to submit this project but i don’t know which approach do i follow.


#2

Hi @chandrakanth98,

You can extract features from the time column (such as day, month, year, is_weekend, etc). All these new features extracted will be numerical data (continuous variables). You can apply common machine learning algorithms on this data to make predictions.

You have to find out what is the reason for absenteeism. This could be high number of working hours, location of work is far, a particular age group of people take leaves often (I have not seen the dataset, these are some general assumptions). Create a list of hypothesis and then use the data to validate the same (correlation, feature importance, other data analysis techniques)

Is there a variable in the dataset that tells you how is ‘monthly losses’ related to ‘Absenteeism time in hours’?


#3

Related thread :

Probably @lakshveer @ashishsharma93 would be able to help.


#4

Hi AishwaryaSingh,
Thanks for ur reply

I did analysis on data please let me know whether below approach is correct to solve the question 1

1. What changes company should bring to reduce the number of absenteeism?

First i will divide the data by using Clusters and than i will find the average of
‘Absenteeism in Hours’ for each clusters

The Cluster which is having more ‘Absenteeism in Hours’ than will analyse the data and try to guess the reason.(If require i will use PCA to reduce dimensionality)

If this approach is correct can i use K Means Clustering algorithm , because i have both Continuous and Categorical variables in data.

For Second question

  1. How many losses every month can we project in 2011 if the same trend of absenteeism continues?

I think here Loss means 'Absenteeism in Hours ’

since i don’t have 2011 test data to predict.
I don’t know which approach to follow