I’m new to machine learning and data science and trying to apply some of the things I’ve learned to real data in my company. I’ve attached the data I’m working with. The data is a pivot table with the number of insurance claims captured in each dev month.
As an example using 201708: By the end of August, 12 523 claims were captured where the “incident” occurred in August (dev month 0). At the end of September, this number had increased to 13996 i.e. 1473 claims were captured in September where the “incident” occurred in August. If you plot this out as shown below, you get a big jump in the first month followed by the total number flattening out (as it’s weird to report a claim several months after the incident occurred). The goal is to predict what the value will be when we reach dev month 10.
To do this I’ve restructured my data (attached) and built a regression model on top of it. I do the natural log of the claim counts which I found improves the fit.
I’m wondering if there isn’t a better way to do this? The type of data makes me think I should be doing some sort of time series, but I’m not sure how to then factor in the cumulative way the trend works? Any assistance/advice on this would be greatly appreciated.
example_data.csv (1.6 KB)
restructured_data.csv (9.5 KB)