I would like to build a fraud predictive model for Financial services(fraud rate -approx.0.1%) to predict which business entity will have fraud in next 24 hrs. I have 100 million transactions for only 3 months data but my server can not process more than 20 million transactions.
I can not use 1 year worth data but even if i try to use only 3 months data, i can not process it. If the data is too big, will sampling work effectively? If yes, what are the best sampling techniques in this situation?
the ratio fraud unfraud is what you should look at as you will have one unbalance class. Base on the method you use you can have a ratio of one to 6 or 10 for example will random forest and still have good results.
If you could process 1 million observations you will have 1 fraud for ten unfraud, then you have to sample the unfound and take all the fraud. The sample of the unfraud could be stratified I suppose base on clustered unfraud transations, but this is based on your population.
Hope this help