How to use VARMAX with categorical, multivariate time series



I have two different time series:

timestamp location

location is lat/long/floor – these are “significant location changes” from a mobile device, and not every few meters or anything


timestamp app_name metric (0-1)

these apps will be things like dieting apps

I’d like to correlate the two, to see if any location changes (or lack thereof) are predictive of improved metrics in apps.

I know that later, I will be comparing two RNNs, LSTM and ESN, to see if trying to build out a well-tuned LSTM is worth it… that is later. For now, I need to simply get a statistical (classical ML) baseline – like with VARMAX.

I have generated mock data – several thousands of rows of data for 3 apps and three users over about a year of use. It is designed to have positive and negative correlation: both for metrics with app-location pairs and the use/disuse of apps, leading to a lack of metrics. There is a decay function in these apps, so disuse trends down, but the general trend is strongly positive. The data actually uses handy names rather than lat/long/floor location data objects.

I have loaded these in a Jupyter notebook, and generated cross tab rows to convert location categories to columns, so I have some rows like:

ts user home work relatives
<timestamp> user_1 0 0 1

and in another DF:

ts user app metric
<timestamp> user_1 app_1 0.3

With VARMAX itself I only have written a “hello world” of sorts, with statsmodels.tsa.statespace.varmax With it, there was one measurement for each series at every timestamp. I’m not sure how to do it with this data though.

The problem I am facing now is data normalization: if I am to put rows in for each timestamp in both data sets, then I need to normalize the metric data with a rolling average or something … no big deal there but … on the other side is 1/0 categorical data. How do I “smooth” that out ?

can anyone explain what I need to do? And, am I on the right path here? Or should I be converting the categorical data to a range value (based on mean of all metric data near it at some time-based decay… or something)?


Hi @roberto33 , were you able to figure out the solution? I had the same question and could not find an answer.

Also am new to MTS and VAR and wanted help with a few other questions.

  1. How do I make a multivariate time series stationary? Do I deal with each series separately? I came across Johansen’s test to check the staionarity.
  2. How do I evaluate the model? If I create a separate validation set, should the rmse be calculated for each series or row-wise or is there another method?


I do have a solution, I think, but I have yet to test how predictive the model will be with the data that I have. I realized that I need to have time sample data at regular intervals for the statsmodel ARIMA, across all data-points. that changed things considerably: location “change” data becomes “current location” data after ffill(), and it shifts from one to another column over time. It is non-stationary at that point, the traditional means of dealing with that is differencing – there’s an automatic method for differencing That said, it looks like statsmodels VARMAX is already enforcing stationarity by default:

enforce_stationarity [¶]( statsmodels org /dev/generated/statsmodels.tsa.statespace.varmax.VARMAX.html#statsmodels.tsa.statespace.varmax.VARMAX.enforce_stationarity)

boolean, optional – Whether or not to transform the AR parameters to enforce stationarity in the autoregressive component of the model. Default is True.


boolean, optional – Whether or not to transform the MA parameters to enforce invertibility in the moving average component of the model. Default is True.

This article has a nice intro to forecasting time series: machinelearningmastery make-predictions-time-series-forecasting-python (make a url of that, new users can only have 2 links in a post) - the first section, on selecting the model to use, uses mean squared error to evaluate. there are also fit summaries available, I believe.

If that model is not predictive (and assuming my data is properly generated to actually be predicitive), then I will need to do something to the model or move away from ARIMA. The thing is I was hoping to use ARIMA to demonstrate that classical machine learning is insufficient, so if I have to use an RNN then … I still kinda accomplish my goal with this. Other options include:

arXiv:1705.04378v2 - ESN RNN approach (simpler than LSTM to train, less predictive for complex data)
doi:10.1111/j.1467-9892.2007.00537.x - “logistic smooth-transition regression” For multivariate, mixed continuous and categorical data. I’d love to find an implementation of this somewhere