I am very new to the ML field and just started to get my hand dirty by applying some techniques. I am learning by doing and reading stuff as and when things come up. I require expert inputs as I am stuck with a question that I could not have an answer to.
I have created a model using using regression and was trying to check if is is an underfit or overfit by running CV on top of it using python (sklearn). This is the first algo that I have created to pls. bear with the novice questions. I have used a polynomial relation between x, y using regression model
Date set link: http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset#
Below is my code:
import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import numpy as np from sklearn import linear_model, cross_validation import math data = pd.read_csv("location") from sklearn import metrics x = data.temp.values y = data.cnt.values x = x.reshape(len(x), 1) y = y.reshape(len(y), 1) train_x = x[:482] test_x = x[482:] train_y = y[:482] test_y = y[482:] regr = linear_model.LinearRegression() clf = regr.fit(train_x**4.2, train_y) scores = cross_validation.cross_val_score(clf, x, y, cv=10) print('Coefficients: \n', regr.coef_) print("Residual sum of squares: %.2f" % np.mean((regr.predict(test_x) - test_y) ** 2)) #Explained variance score: 1 is perfect prediction print('Variance score: %.2f' % regr.score(test_x, test_y)) print(scores)
Below is the o/p
Residual sum of squares: 1906498.12
Variance score: 0.26
[-14.82381293 -0.29423447 -13.56067979 -1.6288903 -0.31632439
0.53459687 -1.34069996 -1.61042692 -4.03220519 -0.24332097]
What I need to understand is what do these different values mean? In all the codes online everyone has taken mean of it, why?
Is this the right approach to solve the problem to validate if the model is appropriate or not? Please provide your expert inputs.
Thanks for your help!!