How to get the trend variable in Multiple linear Regression?

machine_learning

#1

Hi,

I’ve a multivariate time series data for Employee Absenteeism which I want to analyze using Multiple linear regression.

The objective is to to find how much losses every month can we project in 2011 if same trend of absenteeism continues?

I need to find the trend variable which can be included in the equation but I don’t know how exactly can this be achieved.

Data Set.zip (20.4 KB)

Your help will be much appreciated!

Thanks in advance


#2

Hi @lakshveer

This is not a time series problem, since the data is not collected at fixed time interval (the data is not time dependent). Group the data on month column, such that you have two columns : month and absenteeism in hours.

Sr.No. Month   Absenteeism (hour)
 1       7         119
 2       8         132
 3       9

… and so on

With this data, you will be able to predict the absenteeism (hours) for each month in 2011. Further you can add the values to get the total hours in year.


#3

Hi @AishwaryaSingh

Thanks for the reply!
Apart from adding month column, Should I not include the trend variable to make forecast about absenteeism hours as it will take trends from other variables into consideration and it seems to be more logical to me as taking only month can neglect the information provided by other variables.
I briefly went through the book : Forecasting: principles and practice, where I got some information about the trend variable. Specific page( https://www.otexts.org/fpp/5/2), Section : Example: Australian quarterly beer production.

However, I don’t exactly know how can we calculate this variable. Do you have any idea of how this trend variable is calculated.

Many thanks!


#4

Hi Lakshveer,

i am currently working on the same data set.Could you please help me out with certain queries?
i need to know how you treated the missing values in target variable “absenteesim”.
Did you use linear regression for that? How much accuracy were you getting with that model?
also, did you apply any technique to reduce levels in variable - reasons of absence.

your help is much appreciated.


#5

Hi @ashishsharma93,

The data set set which I had did not have missing value in the target variable. You can use mean or knn imputation to impute missing values.
I used MAPE and RMSE to evaluate the performance of the model. I was getting around around 27% MAPE and around 30 RMSE.
Training the model on the reduced levels of reason of absence doesn’t make much sense. However, you can categorize them to get some insight about the behavior of reason of absence with no of absenteeism hours.


#6

thanks for the info @lakshveer

but i looked at the data set you have attached in this thread. it does contain missing values. please check.

Also what approach did you use for problem statement : how much losses every month can we project in 2011 if same trend of absenteeism continues ?


#7

Hi @ashishsharma93,

Aah, yes I forgot, there were few missing values in the target variable, I used knn imputation to impute those.

You need to group total absenteeism hours by month and then forecast for the future months.
You can use tslm() in R.