Sharing the approach for MiniHack-Time Series Problem!

hackathon
time_series

#1

Hi All the top performers,

I request you all to share the approach/strategy used during the competition or even post that if you have got some insights from it . It would be a great learning curve for rest of us.

Sir SRK has already updated us with his things, now we can debate over it also.

Thanks!

Happy Learning!


Solution to Time Series-Practise Problem in R
#2

Hi Top Performers please share your approach.


#3

Hi,

Here is my approach towards the mini datahack time series problem which scored me a score of around 226 on the private leaderboard. (rank 2)

I started data exploration and noticed right off the bat that all the count were even numbers. Moreover, we had observations every hour which steadily increased over the entire data. So we have a trend. However, the maximum value was not at the tail of the dataset.

As an accuracy measure, I used divided the count by two and then multiplied my solution back by 2 at the end of prediction step. The halfcount was fed them to an ARIMA function as my initial solution but was nowhere near the top 50. I then came across an inbuilt R function “HoltWinters” which combines the capabilities of both Holt’s and Winter’s models of time series. Just the model itself (at its default values) brought me inside the top 50. After that it was all parameter tuning based on the observations:

  1. There appeared to be a seasonality which was multiplicative in nature. Hence the data did not increase linearly.
  2. The count remained at 2 for the first fraction of observations. As such the trend did not start immediately along with the data.
  3. There appeared to be an impact of the day of the week on the data. It seemed to indicate a cyclic behavior in general.
  4. As I recently gained proficiency over the Winter’s model, I was aware of the individual equations deployed by the model and tuned alpha (indicates level), beta (indicates trend) and gamma (indicates seasonality) variables as well.

All these features collectively took me to the top 10 in the public leaderboard and rank 2 in the private leaderboard.
I believe there were a few outliers and a bit of approximation which were the cause of the RMSE in the dataset and further tuning could have improved the errrs even better.

It seems that XGBoost captures these intricacies itself and fits well to a time series too. However, time series models are no less and got me the next best position on the leaderboard though it requires data exploration in a different light.


#4

Hi Madhur,

First of all thanks for your elaborate reply on how you approach the problem. Everything has made perfect sense to me.

Could you please tell us also that why ARIMA failed and holt winters i.e. ets() worked better than ARIMA here?

Moreover, using Xgboost and Regression has got better accuracy but from a business eye ( which asks to pass lots of assumptions),is it the right approach to go for?

Thanks once again!


#5

Hi Rahul,

The analysis of time series would decide the selection of the forecasting model that best fits to the data set. With sufficient inspection ARIMA could also be applied to the given data and would be based on the correct estimation of p(Autoregression), d( Degree of homogeniety) and q(Moving average) components. However, ARIMA model does not incorporate multiplicative seasonality which seemed to be the case for this time series. I made my first model based on ARIMA and decided to move to Holt-Winter’s based on multiplicative seasonality.
We all know that currently XGBoost is so much better than all the other available methods. Just as the article based on SRK’s solution. The time series models can be first used to get an understanding of the data and XGboost and regression(since this data had a trend) could then be deployed(again with correct parameter estimation) to get a reduced error. From a business perspective, If I had really been a part of the Unicorn investors group, I would apply more models than just these to get various estimates and set an upper bound as well as lower bound on the forecasts. The range, rather than the exact estimate is usually useful in making a forecast so that we can prepare in cases if there is very low or very high demand (and also incorporate cases such as strikes).
At the same time, value estimates (such as revenue) would require accurate models which also provide interpretation. Whether one uses ARIMA, holt winters, XGboost aur aggregate models, it should be interpretable. In a nutshell, the right approach would depend whether you could justify their usage and your goals.


#6

Hi @Madhur_Modi,
One question here, based on my understanding of the documentation on HoltWinters, gamma should not be used in this case as the frequency of the TS is less than 1.
If I am wrong, could you please point me to some resources to learn about this parameter.

Thanks,
Monica


#7

You probably used the frequency function. I felt that it was being affected by noise and the underlying frequency is probably much higher.
You can find this from a plot or manual look at the data and notice that there are indeed repeating patterns. Gamma is usually used for cyclicity of data. You could search winter’s model online or read it from any supply chain or operations book. I personally read the concepts from supply chain management written by sunil chopra.


#8

Thank you @Madhur_Modi
Yes, you are right there is a definite pattern.
To get a score of 141 for the practice problem I had to apply seasonality outside of the predict function.
I wanted to know if it is possible to get the hourly and weekday patterns applicable using the HoltWinters function itself.
Please let me know your thoughts.
Thanks,
Monica


#9

I’ve used the frequency function in other cases, not the competition, and found that it gives a good approximate pattern cycle, but it was returning 1 for the competition data, so I resorted to manual exploration only.
I think the best practice is to use the frequency function to look if the pre-designed functions could find us, then use data exploration to look and try other possibilities. I think if data cleaning can be performed (which I didn’t do in the competition due to time crunch), the inbuilt function’s results would coincide with the explored data.


#10

Hi,

Is it possible for any one of you to share the datasets(Training and Test) and also the solution to the problem using R?