Fraud predictive model


#1

Hi,
I would like to build a fraud predictive model for Financial services(fraud rate -approx.0.1%) to predict which business entity will have fraud in next 24 hrs. I have 100 million transactions for only 3 months data but my server can not process more than 20 million transactions.
I can not use 1 year worth data but even if i try to use only 3 months data, i can not process it. If the data is too big, will sampling work effectively? If yes, what are the best sampling techniques in this situation?


#2

Hi
the ratio fraud unfraud is what you should look at as you will have one unbalance class. Base on the method you use you can have a ratio of one to 6 or 10 for example will random forest and still have good results.
If you could process 1 million observations you will have 1 fraud for ten unfraud, then you have to sample the unfound and take all the fraud. The sample of the unfraud could be stratified I suppose base on clustered unfraud transations, but this is based on your population.
Hope this help
Alain