Is the output value of different statistic model logistic regression and XGBoost carry the same meaning and measurement?


#1

Hi I am new here, try do research some of the answer tru Internet but still can’t find a good answer.So here goes:

I am training a few models using python and split my data set into A and B.

  1. First, I train both data set A and B with logistic regression model.

  2. Second, I train data set B with XGBoost.

For 1) can I directly compare and interpret the output of both model ? Can I say that score 0.5 from model A is the same as score 0.5 from model B ?

and For 2) Can I directly compare the value from Logistic Regression and XGBoost. Does 0.5 means the same thing?

Your Advice is very much appreciated.

Thank you. Cheers


#2

Hi @GsWong,

Recreating your case again. I am assuming that you have two datasets - A and B. You have trained model1= LogisticRegression() on A and B, followed by training B with model2=XGBoostClassifier(). Now, that you want to compare these both models based on the predicted output. Please do correct me if any of the following assumptions are wrong.

  1. No, the 0.5 is just a metric. It’s entirely possible that the dataset has imbalanced classes or has biased instances. You could compare the two model trained on a sample of same dataset. If the dataset is widely different then you cannot make any conclusions. Please note that you could compare the model on future data from the same source (where dataset A and B were collected). If the score on future data is near to the A and B then the model is said to be generalized.

  2. Again, you could compare the model1 and model2 in the following conditions.

    • Both the models trained on the same dataset (B, A or A+B)
    • The evaluation metric when scoring the models should be same (model1 = model2, {metric:accuracy')

#3

Hi @Shaz13

Thank you for taking time to answer my questions. Your assumption is correct. I have dig in to more research and understand that the output of logistic regression are probability (reference 1) .So is XGboost (reference 2).

Thus base on my findings and back to my Q1 and Q2, should not they be comparable since the output are strictly probability?

Thus let’s say the following:
Data sets A : Female and other X variables
Data sets B: Male and other X variables
Y(Dependent Variables) : 1 and 0 , 1 if they graduate with first class honors; 0 if they graduate without first class honors.

Thus, can i say 0.5 means 0.5 chances the person will graduate as first class honors for all 3 models.

Reference 1: https://en.wikipedia.org/wiki/Logistic_regression)
Reference 2: https://github.com/dmlc/xgboost/tree/master/demo/binary_classification

Again, many thanks for your time.