Hi Anshul,
This is an interesting question. Here is how I’d have approached this problem.
-
Define a dependent variable, this is the time that the vehicle will take for repair. Based on your provided data, I’d have taken the difference between “Vehicle Arrival Date” and “Work completion date” in days. However, this could be calculated in hours as well.
-
Assuming that “Task details” are in free text format, I’d have categorized it in some format based on eyeballing the data.
-
I’d have created a variable that would incorporate the life of the vehicle, this could be probably the difference between the “Vehicle registration date” and today’s date. Another way to take vehicle usage as variable is to take a ratio of Vehicle life to that of Odometer reading.
-
Few additional variable that I’d have looked for would be:
-
Mechanic details who worked on the vehicle (such as experience in years and some kind of rating based on past record)
-
I’d also consider average waiting time before vehicle gets in for service.
-
Probably, day of week can also have an effect on service time.
Once, I’ve these variables in place, now I can think of building a model. Since my dependent variable is continuous, I’ll start with linear regression. Based on model summary, I’ll try to improve the variables that has been used in the model.
In the process of model improvement, I’d also look toward simple exploratory analysis and figure out if the variable is worth including in the model. However, my personal preference is to execute this step post first model, as first model with all variable included can give you a basic idea of what you’re looking at.
Once you’re done, you can start looking at some advance models, such as XGBOOST, randomforest etc. However, proceed to these models only when you’re dead sure that regression can no longer improve your predictions.
Hope this helps you as first step.
Peace!