How to test your prediction for insurance company dataset

Hello everyone,

I have a dataset of an insurance company for my data science class project. My ultimate business objective in this project is to sell more insurances to existing customers/customer segments.

Firstly, I want to cluster my customers through k-mean model with RFM scores then use apriori algorithm to find association rules among this clusters. Later, I can find which customer/customer segments I can sell more. Yet my teacher want me to test my prediction and he said that since the policies are repeated every year you can not split your data in terms of last 3 months is test data and the rest of the 9 months is train data. To sum up, he wants me to test my prediction in an accurate way. How can i test my data in this specific case?

I am assuming that your Insurance model is a classification model. The data splitting operation is called Partitioning. You can split (partition) the data set into 2 or 3 pieces, depending on how your modeling algorithm works. Random sampling can be used to create the data sets (most tools provide several ways to partition data sets). If the algorithm automatically splits the incoming data set internally into training and testing data set (like KNIME), then you need to split your data set into only two pieces: a modeling data set (to submit to the algorithm) and a validation data set (some tools call the testing data set the “validation set” and some call it the “testing set”). If your algorithm requires 2 input data sets, your must set the partitioning process node to produce 3 data sets (like IBM SPSS Modeler). In any event, the overall process requires 3 data sets: (1) training set; (2) testing set, and; (3) validation set. The model is trained with the training set and prediction error is calculated internally after the first iteration of the model, using the testing data set. Then, a training parameter is changed slightly, and another iteration of the model is run. Many iterations of the algorithm are run in this way, and the best model is kept. Machine learning algorithms learn case-by-case, not globally with means and standard deviations.
After the modeling algorithm is finished and predictions are produced, the 3rd data set (the validation) set is used to calculate the prediction accuracy. You can’t use the training or testing data set for accuracy checking, or you would use the same data to check the accuracy that was used to train the model (= a logical tautology). Calculate the Sensitivity accuracy (for predicting the positive target class), the Specificity accuracy (for predicting the negative target class), and the overall accuracy (the mean of Sensitivity and Specificity). Look up the definitions on the web.
Bob Nisbet
Data Science Instructor
University of California, Irvine

© Copyright 2013-2019 Analytics Vidhya