Time Series Forecasting with Categorical Variables

I am not able to provide the exact values of the dataset due to data privacy issues.

The variables I am using in my dataset are: Year (2011 to 2014), X (Categorical Variable), Y (Response Variable).
My goal is to forecast Y of the entire dataset and then pass the X as a parameter to get the forecast for that categorical variable…

Can you please suggest which methods would be ideal for this scenario? Any help in this matter would be highly appreciated.

Can you please explain the problem a bit more? Maybe you can give an example. I’m not able to get it with the information provided.

@Aarshay, I am trying to forecast Infant Mortality Rate of a State from the data of last 4 years.

The variables in the dataset are: IMR (Response Variable), Year and District (Categorical Variable).

I am trying to forecast the IMR with respect to time and district for the entire data and then find the forecast of a particular district by passing that district as a parameter.

Please let me know if any additional information is required in this regard.


Thanks for providing the additional information. Just for clarification, I’ll rephrase what I understood.

If there are say 10 districts, then you’ll have 10x4=40 observations in the data. Even if there are more districts, we have only 4 time-periods. I think this data is too small for making a time series model like ARIMA.

Also, I would approach this problem opposite to yours. I would simply forecast each district using a CAGR (Compounded Annual Growth Rate) for last 4 years and then add up the districts to get the overall forecast for state.

The behaviour of a particular district is expected to follow the CAGR closely as compared to the whole state. This is because the dynamics (influencing factors) will be similar for a district rather than a state.

This is just my thought. Happy to discuss further.


@Aarshay, Thanks for the reply.

The data is monthly. I have 30412 = 1440 observations in the dataset. Please suggest a suitable technique for this scenario.

My apologies, I should have given much details in my previous email.


No worries…

I this case, you have 12*4=48 observations for each district. As fas as I know, you can make a time series model for 1 district at a time and project individually. If you take all of them together, it becomes more of a predictive modeling regression problem. So there are 2 options:

  1. Make an ARIMA or some suitable time-series forecast for each district and then combine the data at state level. You should choose this if you have both seasonality and trend in your data.
  2. If the trend (increasing or decreasing) isn’t much, then you can model it as a predictive modeling problem by creating features of month, district and state (if you have multiple states). This approach will be able to model the behaviour of different months but won’t be able to capture the overall trend if any.

Please note that I’m no expert in time-series analysis and you should discuss the approach with others at your firm if you choose to adapt one of these.


Hi @krishnamreddy

if I had this problem I will first do a decomposition, this give me the trends, seasonal and residual. Base on the seasonal I shall decide as @Aarshay mentioned to take the month or a longer period for example a quarter as a variables if necessary.
Then go brute force with a linear model or gam (quadratic or cubic), I bet you get less than 10% error , you can test this in one hour I think, if in R look stl() for decomposition.
Hope this help.

© Copyright 2013-2022 Analytics Vidhya