Predicting mobile data usage


Okay. I am trying to predict data usage for a user at any point in time. For example, I should be able to predict the data you have consumed since your last purchase. I have a dataset that shows

  1. Date of purchase of data
  2. Amount Paid
  3. Data

I have been cracking around this dataset, are these variables enough? What algorithm will be the best (I am thinking Multiple Linear Regression)?




Do you have the actual values for some rows?

Could you give more details of the dataset, i.e. the number of rows, number of columns for train and test respectively.


Yes. There are actual values. I have 3 columns and up to 1000000 rows. The columns are:

  1. Date of purchase of data
  2. Amount Paid
  3. Data bought.

10/01/2018 13:15:00 | 5000 | 10G
13/12/2018 08:30:30 | 3500 | 6.5G


You certainly need to create more variables using the existing variables. From the date of purchase you can calculate number of days from purchase to today, or the month of purchase too. You can create another feature using the amount paid and data bought for instance, their ratio (amount/data) . Similarly, think about the features you can create (you can read articles on feature engineering and get an idea about it).

You might find the below article on feature tools helpful. Also, you cannot decide what algorithm is the best without exploring the data. You can create a benchmark model first and then proceed and try different models to see what works best.