SVR: Predicted values are way off from the actual ones despite high r-squared and and low MSE

svm
machine_learning
regression

#1

I am stuck on a SVM regression problem. Please help.

I have trained a SVR model using scikit learn that predicts the future price of bitcoin by using its closing price on previous dates. I have converted date into delta from the first available date using the following function:

btc['Date'] = pd.to_datetime(btc['Date'])     
btc['date_delta'] = (btc['Date'] - btc['Date'].min())  / np.timedelta64(1,'D')

My dataframe’s head looks something like this:

date_delta Close
1654.0 7144.38
1653.0 7022.76

Then I do split into test and training dataset as follows:

msk = np.random.rand(len(btc_select)) < 0.8
btc_train = btc_select[msk]
btc_test = btc_select[~msk]

and do min max scaling of the dataset before training the model as follows:

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(btc_train)
btc_train = scaler.transform(btc_train)
btc_test = scaler.transform(btc_test)

My model is trained using the following function and I find polynomial kernel gives the best result:

def predict_prices(dates_train, prices_train, dates_test, price_test):
    dates_train=np.reshape(dates_train, (len(dates_train),1))
    dates_test=np.reshape(dates_test, (len(dates_test),1))
    svr_lin = SVR(kernel='linear', C=1e3)
    svr_poly = SVR(kernel = 'poly', C=1e3, degree=8)
    svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.8)
    svr_lin.fit(dates_train,prices_train)
    svr_poly.fit(dates_train,prices_train)
    svr_rbf.fit(dates_train,prices_train)
    plt.figure(figsize=(14,10))
    plt.scatter(dates_train, prices_train, color='black', label='Data')
    plt.plot(dates_train, svr_rbf.predict(dates_train), color='red', label='RBF model')
    plt.plot(dates_train, svr_lin.predict(dates_train), color='green', label='Linear model')
    plt.plot(dates_train, svr_poly.predict(dates_train), color='blue', label='Polynomial model')
    plt.xlabel('Date')
    plt.ylabel('Price')
    plt.title('Support Vector Regression')
    plt.legend()
    plt.show()
    print('Lin Score:', svr_lin.score(dates_test, price_test))
    print('Poly Score:', svr_poly.score(dates_test, price_test))
    print('Rbf Score:', svr_rbf.score(dates_test, price_test))
    scores = cross_val_score(svr_poly, dates_train, prices_train, cv=6, scoring='neg_mean_squared_error')
    accuracy = metrics.r2_score(price_test, svr_poly.predict(dates_test))
    print('R-Squared Value for the Polynomial Kernel:', accuracy)
    print('Cross Validation Mean Squared Error for the Polynomial Kernel:', scores)
    return svr_poly

I got the following accuracy and cross validation scores:

Lin Score: 0.3290332147578777
Poly Score: 0.8724266575682722
Rbf Score: 0.836449334307112
R-Squared Value for the Polynomial Kernel: 0.8724266575682722
Cross Validation Mean Squared Error for the Polynomial Kernel: [-0.13853584 -0.00069995 -0.00043713 -0.00041959 -0.00341142 -0.00352207]

But when I try to predict the btc price for a datapoint after transforming the date_delta and inverse transforming the predicted output the results are way off. Need help as to what is going wrong.

transform_inp = scaler.transform([[1654.0,0.0]])
transform_inp[0,0]
1.000604960677556

predicted_val = model.predict(np.array(transform_inp[0,0]))
predicted_val
array([0.73674025])

Now doing the inverse transform I get the following:

scaler.inverse_transform([[predicted_val[0],0]])
array([[1217.83164131,   68.43      ]])

The output is 1217 USD which is way off from the actual price of 7144 USD. Can you please tell me what is wrong here?