Predict the duration a vehicle takes to service



I am new to data science and working on a POC where I have to predict how much time a vehicle will take in repairing when it is sent to garage.

I have Vehicle Details consisting of the following information :

Vehicle Make
Vehicle Registration Date
List item
Vehicle Type
Vehicle Arrival Date
Work Completion Date
Task Details
Work Area Description
Odometer Reading
In this each vehicle will be worked upon multiple tasks when it is kept in a garage. This task can be planned task or unplanned task.

What additional information will be required to approach the problem and request you to please provide me some approach/way and algorithm to approach towards a solution.


Hi Anshul,

This is an interesting question. Here is how I’d have approached this problem.

  1. Define a dependent variable, this is the time that the vehicle will take for repair. Based on your provided data, I’d have taken the difference between “Vehicle Arrival Date” and “Work completion date” in days. However, this could be calculated in hours as well.

  2. Assuming that “Task details” are in free text format, I’d have categorized it in some format based on eyeballing the data.

  3. I’d have created a variable that would incorporate the life of the vehicle, this could be probably the difference between the “Vehicle registration date” and today’s date. Another way to take vehicle usage as variable is to take a ratio of Vehicle life to that of Odometer reading.

  4. Few additional variable that I’d have looked for would be:

  • Mechanic details who worked on the vehicle (such as experience in years and some kind of rating based on past record)

  • I’d also consider average waiting time before vehicle gets in for service.

  • Probably, day of week can also have an effect on service time.

Once, I’ve these variables in place, now I can think of building a model. Since my dependent variable is continuous, I’ll start with linear regression. Based on model summary, I’ll try to improve the variables that has been used in the model.

In the process of model improvement, I’d also look toward simple exploratory analysis and figure out if the variable is worth including in the model. However, my personal preference is to execute this step post first model, as first model with all variable included can give you a basic idea of what you’re looking at.

Once you’re done, you can start looking at some advance models, such as XGBOOST, randomforest etc. However, proceed to these models only when you’re dead sure that regression can no longer improve your predictions.

Hope this helps you as first step.