I am predicting the number of vehicles in 4 traffic junctions.
So, I have following columns in my dataset :
At the first glance, this problem may look like Time series regression. But, the data given seems like Linear Regression problem.
So, I have applied linear regression in the following manner :
Used get_dummies extensively for all the columns. I used dummy variables for 31 days,24 hours, 7 days of weeks and 4 Junction Ids.
Then applied Linear Regression model in following way :
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(train_data,train_vehicles) clf.fit(x_train,y_train) import math pred=clf.predict(x_test) pred.shape #got result as (12030,) result =  for x in pred: result.append(math.ceil(x)) from sklearn.metrics import mean_squared_error score=mean_squared_error(y_test, result) rmse=math.sqrt(score) print('RMSE is :', rmse)
I am getting RMSE value as 10.636853077462394
My questions are :
Since RMSE value is on lower side , can I say this model is decent ?
Is there any other approach which I can use on this dataset ?
Do I need to check for colinearity ?
How can I check if multiple variables are interrelated ?
Should I go for non-linear regression on this dataset ?