Why Datasets are categorized into Train and Test data

data
data_mining
data_science

#1

Hey,
I have been doing bike sharing problem on kaggle. In the dataset I noticed that it has been classified into train and test data.
Why instead of splitting we analyse the whole dataset altogether.

Thanks


#2

to avoid overfitting

In statistics and machine learning, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.

In order to avoid overfitting, it is necessary to use additional techniques (e.g. cross-validation, regularization, early stopping, pruning, Bayesian priors on parameters or model comparison), that can indicate when further training is not resulting in better generalization. The basis of some techniques is either (1) to explicitly penalize overly complex models, or (2) to test the model’s ability to generalize by evaluating its performance on a set of data not used for training, which is assumed to approximate the typical unseen data that a model will encounter.