Regression and k-fold values python

crossvalidation
regression
python

#1

Hello everyone,
I am very new to the ML field and just started to get my hand dirty by applying some techniques. I am learning by doing and reading stuff as and when things come up. I require expert inputs as I am stuck with a question that I could not have an answer to.

I have created a model using using regression and was trying to check if is is an underfit or overfit by running CV on top of it using python (sklearn). This is the first algo that I have created to pls. bear with the novice questions. I have used a polynomial relation between x, y using regression model
Date set link: http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset#

Below is my code:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn import linear_model, cross_validation
import math
data = pd.read_csv("location")
from sklearn import metrics

x = data.temp.values
y = data.cnt.values

x = x.reshape(len(x), 1)
y = y.reshape(len(y), 1)

train_x = x[:482]
test_x = x[482:]

train_y = y[:482]
test_y = y[482:]


regr = linear_model.LinearRegression()
clf = regr.fit(train_x**4.2, train_y)
scores = cross_validation.cross_val_score(clf, x, y, cv=10)

print('Coefficients: \n', regr.coef_)
print("Residual sum of squares: %.2f"
    % np.mean((regr.predict(test_x) - test_y) ** 2))
#Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(test_x, test_y))
print(scores)

Below is the o/p
Residual sum of squares: 1906498.12
Variance score: 0.26
[-14.82381293 -0.29423447 -13.56067979 -1.6288903 -0.31632439
0.53459687 -1.34069996 -1.61042692 -4.03220519 -0.24332097]

What I need to understand is what do these different values mean? In all the codes online everyone has taken mean of it, why?

Is this the right approach to solve the problem to validate if the model is appropriate or not? Please provide your expert inputs.

Thanks for your help!!

Best


#2

Hi @amandeepsharma89,

Basically you have made a 10 fold CV, so you are getting 10 scores (one for each fold). It’s correct to take a mean of it to find your average CV score for a model.

I think you are doing things right here, however there are many other methods of model validation which you might want to explore.

Hope this helps.

Regards,
Aayush