ANN model via R - H2o


#1

Dear all, I built a ANN model through R using H2o package aiming at predicting users fitting a certain profile. It is a classification model based on 1 and 0 (1 fitting the users profile, 0 is not). As customary with machine learning algorithms, I split the data into train and test where the 1/0 values were already made available (it is in fact a supervised model), and I managed to came up with a very high accuracy (88% circa) as well as high sensitivity, specificity, Precision and Negative Predicted Values (NPV). All looks beautiful when I applied the model on the test data after training it on the training data set. The problem starts when I apply the same model that worked so perfectly in the test data set to a new data set of users where these values (1/0) are unknown: since this data is refreshed daily and looking at the past users registered within 90 and 30 days ago (just like the test dataset), I had the chance to compare all users flagged as 1/0 one day with the same data set the day after, or few days after.
With deception I found out that a good deal of users that were flagged as 1 the previous day were flagged as 0 today, despite of little or no variance throughout all variables.
Basically since the daily fresh data is catching all users registered between 30 and 90 days ago, I have the same users overlapping in the report and have the chance to compare the results of the model. Although there is a high accuracy within the 0 population, the model performed very poorly on 1. Which made me wonder: how reliable is this model? All the evaluation coefficients scored high on the test data but when applied to the new fresh data (mirroring the same variable as the test data) it performed very poorly on 1… Can anyone help me to understand if there is any issue? And what that is? Thanks


#2

Have you checked if you dataset is unbalanced ?
Try to see if you have almost equal number of ones and zeros in your target variable.
If not try stratified sampling, or you have oversampling and undersampling methods to balance the data.
If the model still runs badly after balancing it might be due to overfitting so use dropout activation functions in h2o…
The below is link to a kernel i wrote on kaggle for balancing the data(It might help as it is in r)

https://www.kaggle.com/himakund/cc-fraud-detection-balancing-the-unbalanced