How to replace missing observation while Modeling?


I am working on "Bike sharing demand" knowledge based competition of Kaggle. They have provided training data which has included hourly demand of Bikes for 1 to 19th of month for years 2011 and 2012 but there is missing observations. Ideally it should be 1924(12*2)=10994 records but training data set has only 10886 i.e. 58 missing observations.

Please suggest me, should I include this missing observation in modeling considering 0 demand or should not care about these missing observations as this is low number?




This is a very common question / scenario in predictive modeling. Also, at the same time there can be various reasons why there are missing values. The treatment of these values will depend on the reason why they are missing. Here are a few thoughts / hypothesis which come to my mind:

  1. Can you see any patterns in the days / hours which are missing? Are these public holidays? A particular hours on specific days? For example half days on public holidays? If you can find a trend like this, then I would probably assume that the shop was shut and would assume 0 demand as the prediction as well.

  2. If you can’t find any pattern and the observations are missing completely at random, then you can fill them via several ways. You can assume demand to be same as previous hour or average or the hour before and after or the demand at same time previous day / week and see what works for you.

Overall, this is any way a small number, so you can even ignore the missing values as well.

Hope this helps