I need help in one of my key projects that I am working for Mercedes Benz. Basically, I am supporting their after sales business and we are using analytics solutions to help them.
The project that I need help on is audit tool. In MB, there are multiple auditors who visit dealerships once in every few months to audit fraudulent warranty claims and get some money back. Right now, they pick claims for audit either randomly or based on their past experience. This is a very tedious approach as they are missing out on a lot of claims that can return them potential money.
I have to use data science on their historical data to come up with patterns so that the model can tell which claim they should audit. But I have few challenges:
- This is a highly imbalanced dataset. Only 1.2% of claims audited have resulted in actual debits
- One claim can have multiple labor operations/parts. My understanding of machine learning till now is we should have a key column around which we should build models but because of multiple parts/labor in a claim I am not sure how to proceed further.
- There are some text columns which have useful information and should be used as features.
- There are around 500 dealers and more than 10000 damages which are part of this dataset. How can I use analytics on such high levels in the data?
Can somebody please look at it and help me out?