How to train and test dataset with multiple fold data set?



I would like to know how to deal with dataset which has multiple fold information. For example: If I have list of patients and I want to predict the risk of disease and I want to use attributes like weight, heart rate every hour and other tests. So I will have a data which represents same patient multiple times in the dataset like below:

Patient - Weight - HeartRate - RiskScore

0001        45         90           9.011
0001        47         93           9.23
0002        69         100          4.33
0002        68         101          3.44
0002        69         103          2.43

So for above training set I am predicting the risk score for any new patient with similar data. Now how do I train such data. Do I need to get some averages for this kind of data or is there any algorithm which can handle such data set and train.
Can you please help with this?


Hi @anwarm1,

I would require just one clarification. How can a Patient have different Weight, HeartRate and RiskScore. Are these experiments done at different point of time? If yes, do we have any time related detail in the dataset?


So, they calculate weight like for every few hours… so its kind of time based calculation… like get blood test done for every morning and evening in serious illness and then you have these multiple values for same patient and these values are sometimes important to predict the risk of severity of disease etc.


That’s an interesting problem statement. My concern is, if you try to merge the values, you would loose a lot of data. For instance, if we consider a patient whose weight keeps decreasing, the risk score has chances to shoot up. But if you take the average, you tend to loose this information.

I had a few questions, do you have the same number of rows for the patients? like approximately two, three or four. If yes, then you can create more columns like 'weight_1', 'weight_2', 'weight_3'. For those patients who have only two values for weight, fill the second value on the third column. For those who have four, fill the third column with an average of 3rd and 4th value.


Unfortunately, in real case scenario’s the number of observations (in this case weight) is not the same for other patients. It could be in some cases but not all the time.
Ok, so that’s a good idea to have separate column. Can we also have the columns representing timeframe like ‘weight_1’ representing like 1st hour, ‘weight_2’ representing 2nd hour and so on?


I considered that the number of weights would not be same for all rows, and the example I gave previously was considering one row to have 2 weights, second to have 3, and third to have 4. Can you tell me the max and min number of rows that each ID has?


One of the attribute has around 60 values and it might range between 30 to 60 max.