Generating a multivariate time series model with some of the underlying trends

Hi all

I have a dataset for entries of 3 months (aggregated on the daily level) for which I am trying to develop a multivariate time series model. The data has 8 different variables and I need to predict all of them and thought that multivariate time series would be best suited for the same.

I tried it using the article at analytics vidhya and I am not able to understand a few things. I a new to modelling and still learning.

  1. The dataset predictions are way off the validation set. The month-wise is not being followed and predicted values are just getting the increasing trend. It doesnot take into account that a new month is starting.
  2. I used the johan_test for checking the stationary but I think that it is not the only one to be used.

Can anyone advise me the best way to model such kind of trends?

Problem statement - Predict the variables for the next 15 days based on the last 3 months dataset(values aggregated on the daily basis).

Dataset -
FinalDataset_model copy.csv (5.0 KB)

Code -

import pandas as pd
# import matplotlib.pyplot as plt
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.tsa.vector_ar.vecm import coint_johansen
import numpy as np
from sklearn.metrics import mean_squared_error

#read the data
df = pd.read_csv("FinalDataset_to_Model.csv")

#check the dtypes

df['Flight_Date'] = pd.to_datetime(df.Flight_Date , format = '%d/%m/%y')
data = df.drop(['Flight_Date'], axis=1)
data.index = df.Flight_Date

#since the test works for only 12 variables, I have randomly dropped
#in the next iteration, I would drop another and check the eigenvalues
johan_test_temp = data
res = coint_johansen(johan_test_temp,-1,1).eig

#creating the train and validation set
train = data[:int(0.8*(len(data)))]
valid = data[int(0.8*(len(data))):]

#fit the model
model = VAR(endog=train, freq=train.index.inferred_freq)
model_fit =

# make prediction on validation
prediction = model_fit.forecast(model_fit.y, steps=len(valid))

cols = data.columns

pred = pd.DataFrame(index=range(0,len(prediction)),columns=[cols])
for j in range(0,4):
    for i in range(0, len(prediction)):
        pred.iloc[i][j] = prediction[i][j]

#check rmse
for i in range(len(cols)):
    print("\n\npred - ", pred.iloc[i])
    print("valid - ", valid.iloc[i])
    print('rmse value for', i, 'is : ', np.sqrt(mean_squared_error(pred.iloc[i], valid.iloc[i])))

Can anyone advise for the same.


Hi, Can anyone help me with this. Any help would be appreciated

© Copyright 2013-2019 Analytics Vidhya