I have a dataset for entries of 3 months (aggregated on the daily level) for which I am trying to develop a multivariate time series model. The data has 8 different variables and I need to predict all of them and thought that multivariate time series would be best suited for the same.
I tried it using the article at analytics vidhya and I am not able to understand a few things. I a new to modelling and still learning.
- The dataset predictions are way off the validation set. The month-wise is not being followed and predicted values are just getting the increasing trend. It doesnot take into account that a new month is starting.
- I used the johan_test for checking the stationary but I think that it is not the only one to be used.
Can anyone advise me the best way to model such kind of trends?
Problem statement - Predict the variables for the next 15 days based on the last 3 months dataset(values aggregated on the daily basis).
FinalDataset_model copy.csv (5.0 KB)
import pandas as pd # import matplotlib.pyplot as plt from statsmodels.tsa.vector_ar.var_model import VAR from statsmodels.tsa.vector_ar.vecm import coint_johansen import numpy as np from sklearn.metrics import mean_squared_error #read the data df = pd.read_csv("FinalDataset_to_Model.csv") #check the dtypes print(df.dtypes) df['Flight_Date'] = pd.to_datetime(df.Flight_Date , format = '%d/%m/%y') data = df.drop(['Flight_Date'], axis=1) data.index = df.Flight_Date #since the test works for only 12 variables, I have randomly dropped #in the next iteration, I would drop another and check the eigenvalues johan_test_temp = data res = coint_johansen(johan_test_temp,-1,1).eig print(res) #creating the train and validation set train = data[:int(0.8*(len(data)))] valid = data[int(0.8*(len(data))):] #fit the model model = VAR(endog=train, freq=train.index.inferred_freq) model_fit = model.fit() # make prediction on validation prediction = model_fit.forecast(model_fit.y, steps=len(valid)) cols = data.columns pred = pd.DataFrame(index=range(0,len(prediction)),columns=[cols]) for j in range(0,4): for i in range(0, len(prediction)): pred.iloc[i][j] = prediction[i][j] #check rmse for i in range(len(cols)): print("\n\npred - ", pred.iloc[i]) print("valid - ", valid.iloc[i]) print('rmse value for', i, 'is : ', np.sqrt(mean_squared_error(pred.iloc[i], valid.iloc[i])))
Can anyone advise for the same.