I am assuming that your Insurance model is a classification model. The data splitting operation is called Partitioning. You can split (partition) the data set into 2 or 3 pieces, depending on how your modeling algorithm works. Random sampling can be used to create the data sets (most tools provide several ways to partition data sets). If the algorithm automatically splits the incoming data set internally into training and testing data set (like KNIME), then you need to split your data set into only two pieces: a modeling data set (to submit to the algorithm) and a validation data set (some tools call the testing data set the “validation set” and some call it the “testing set”). If your algorithm requires 2 input data sets, your must set the partitioning process node to produce 3 data sets (like IBM SPSS Modeler). In any event, the overall process requires 3 data sets: (1) training set; (2) testing set, and; (3) validation set. The model is trained with the training set and prediction error is calculated internally after the first iteration of the model, using the testing data set. Then, a training parameter is changed slightly, and another iteration of the model is run. Many iterations of the algorithm are run in this way, and the best model is kept. Machine learning algorithms learn case-by-case, not globally with means and standard deviations.

After the modeling algorithm is finished and predictions are produced, the 3rd data set (the validation) set is used to calculate the prediction accuracy. You can’t use the training or testing data set for accuracy checking, or you would use the same data to check the accuracy that was used to train the model (= a logical tautology). Calculate the Sensitivity accuracy (for predicting the positive target class), the Specificity accuracy (for predicting the negative target class), and the overall accuracy (the mean of Sensitivity and Specificity). Look up the definitions on the web.

Bob Nisbet

Data Science Instructor

University of California, Irvine