Difference in train and test values

linear_regression
machine_learning
python

#1

Hello,

Below is the size of train and test dataset -

X_train.shape, y_train.shape, X_test.shape
((548, 7), (548,), (548, 6))

After running a Linear Regression, I am getting an error as follows -
ValueError: shapes (548,6) and (7,) not aligned: 6 (dim 1) != 7 (dim 0)

What is the reason behind it and how to correct it?


#2

hello @ASHISH_17

I suppose you have not removed the target variable from your X_train. So remove that column and try again.

Hope this helps.
Shubham


#3

Thanks for the reply. I found the mistake.

Can you tell me how to change ‘object’ dtype to ‘int’.
As is is a combination of string and int in the form - id100001.

Will doing splitting of id and the int part can work?
Or is there an alternative?

Thanks


#4

Yeah, the best option is to extract out the latter part after id, and then convert them into ‘int’ dtype.


#5

Hi @shubham.jain,

How to correct this error -

ValueError: shapes (622764,10) and (11,) not aligned: 10 (dim 1) != 11 (dim 0)

I am pasting the features that I have used in the train and test -
feature=[‘id’,‘vendor_id’,‘passenger_count’,‘pickup_longitude’,‘pickup_latitude’,‘dropoff_longitude’,‘dropoff_latitude’,‘mm_pickup’,‘dow_pickup’,‘hh_pickup’,‘distance_km’,‘speed’]

X_train=train[features]
y_train=train[‘trip_duration’]

feature_cols=[‘vendor_id’,‘passenger_count’,‘pickup_longitude’,‘pickup_latitude’,‘dropoff_longitude’,‘dropoff_latitude’,‘mm_pickup’,‘dow_pickup’,‘hh_pickup’,‘distance_km’]
X_test=test[feature_cols]

trip duration is the target variable.

Kindly help me get out of this dilemma.

Thanks


#6

@ASHISH_17
feature and feature_cols that you have created should contain same features for the modelling purpose. As far I can see that you have taken ‘id’ in feature but not in feature_cols, which is the reason behind the error.

So remove ‘id’ from the feature list as it would not be a useful feature.
Hope this will solve your problem.


#7

Hey @shubham.jain, I have removed the ‘id’ variable and successfully got the result.

I wanted to know can’t we assign the different no of features to train and test values?


#8

No, the features should be the same.


#9

Thanks a lot for helping me @shubham.jain

I am facing another problem.
I have grouped the pickup datetime on the basis of a day of a year.
So in total there are 180 days. But instead of getting days ranging from 1 to 180, I am getting days from -128 to 127.

Can you tell me what could be the problem?

I have used dt.dayofyear

I was getting a correct order of days but when I plotted a line plot and compared with the test data it showed me -128 to 127 days.