Is linear regression fit for this data

linear_regression
machine_learning
data_science
python

#1

I am predicting the number of vehicles in 4 traffic junctions.

So, I have following columns in my dataset :

  1. DateTime
  2. Junction_ID
  3. Number_of_vehicles

At the first glance, this problem may look like Time series regression. But, the data given seems like Linear Regression problem.

So, I have applied linear regression in the following manner :

  • Used get_dummies extensively for all the columns. I used dummy variables for 31 days,24 hours, 7 days of weeks and 4 Junction Ids.

  • Then applied Linear Regression model in following way :

         from sklearn.model_selection import train_test_split
    
          x_train, x_test, y_train, y_test = train_test_split(train_data,train_vehicles)
    
      clf.fit(x_train,y_train)
    
      import math
    
      pred=clf.predict(x_test)
    
      pred.shape #got result as (12030,)
    
      result = []
      for x in pred:
      result.append(math.ceil(x))
    
      from sklearn.metrics import mean_squared_error
    
      score=mean_squared_error(y_test, result)
      rmse=math.sqrt(score)
      print('RMSE is :', rmse)
    

I am getting RMSE value as 10.636853077462394

My questions are :

  • Since RMSE value is on lower side , can I say this model is decent ?

  • Is there any other approach which I can use on this dataset ?

  • Do I need to check for colinearity ?

  • How can I check if multiple variables are interrelated ?

  • Should I go for non-linear regression on this dataset ?


#2

The little bit of knowledge which I can share with you from my experience is -

  1. RMSE values are always of the scale of your dependent variable. It means that there is no absolute good or bad threshold, however, you can define it based on your dependent variable. So you always need to multiply your RMSE by the scale and after that, you should decide whether it is lower or not.

  2. I have not much experience of solving time series problem. So I request you to do more research for this part of answer but what I can suggest you is to use different regression models like Polynomial regression, SVR and calculate rmse from each model and after that decide which is better approach.

  3. If the number of features are more then definitely you need to check colinearity and if the variables are collinear then you need to remove one of those variable. This will give you the better prediction.

  4. You can check if the mutiple variables are interrelated through correlation score between them.

  5. As suggested in 2nd part you can go through non linear regression also and compare the rmse score for each model.