Predictor Variables in my data are dependent on each other. How to build an efficient model?


I am trying to build a model using logistic regression (scikit learn package) on python to predict the output of next combination of code . Below is the sample of data I am using for model building (Actual training dataset is of more than 500 thousand observations with more than 3000 unique codes).


For the identifier 123 processed with code A1597, the output at code level is 0. Similarly, for same identifier, the output of B6578 is 1 at a code level. At the identifier level for 123, the combined overall output is 0 (if for an identifier, any output at code level is 0, the overall output will be 0. If all codes are having output as 1, the overall output will be 1).

Note: 0=No, 1=Yes, output=code level output, overall_output=identifier level output

I have classified my data into binary values as shown below.

If I have a combination of codes and any code is having the output as 0, then it will be because of other code(s) within the same identifier.

My question is, how would I build the model in such a way when the same combination occurs in future, it will predict the code level output based on the historical pattern? (For ex.: if a code combination for an upcoming identifier is A1597 & B6578, the output for A1597 should be predicted as 0 and for B6578, it should be predicted as 1.)


If your data is deterministic, you don’t even need a learning algorithm. You can use a matrix of Identifier x Code and populate it with the code level output. Here is an example:

For the overall output you can just make use of the sum of each Identifier row. If the sum of the input is equal to the sum of the Identifier row, then the overall output is 1, otherwise the output is 0.