Number of features of the model must match the input

I have developed a Linear Regression model using SKlearn that involves dummy variables (due to categorical variables as input). I created a pickle file and loaded it in another session.

from sklearn.externals import joblib
loaded_model = joblib.load(‘LR_Model.pkl’)

When performing Out-Of-Sample validation on new data set (much smaller than trained), I again transform the categorical variables into dummy variables.

for column in validation_df.columns:
if validation_df[column].dtype==object:
dummyCols=pd.get_dummies(validation_df[column])
validation_df=validation_df.join(dummyCols)
del validation_df[column]

When I try to predict on new data set (Y_predicted = loaded_model.predict(validation_df)),
I get the below error (full error in image):


ValueError: shapes (1349,1000) and (2017,2) not aligned: 1000 (dim 1) != 2017 (dim 0)

When I change the algorithm from Linear Regression to Extra Trees algorithm, the error changes a bit (see attached image)

I know the reason is that the number of columns in the training data is not equal to the number of columns in the Out-Of-Sample validation data set (due to creating dummy variables from categorical variables). However, any solution or work around that may allow the prediction of new data with different column size will help.

Seems like you haven’t enough categories in validation data. And you got less dummy vars then in train data.
I think you must create dummy vars before splitting your data to train, validation and test parts

© Copyright 2013-2019 Analytics Vidhya