XGBoost only predicts NAN after removing all NANs from the training data in python

machine_learning
xgboost
data_science
python

#1

Hi, I asked a question on StackOverflow, but they did not answer my question, so I decided to try it here.

Hello!
I’m trying to get my code to work, it used to give no errors, until I changed some things in my data and now it’s totally giving no output. It seems like the predictor predicts nan’s which I find strange, as none of the input values are nan’s. This error is raised when I run the xgb.train on a sample of 5000 of the dataset (with over 300000 observations). When I run it on a smaller sample of the dataset, this error does not occur.

The code I ran:

Statadata= pd.read_stata(‘figtemp.dta’)
Statadata = Statadata.drop(Statadata[(Statadata[‘periodf’] == 3) | (Statadata[‘periodf’] == 4)].index)
Statadata = Statadata.drop(Statadata[(Statadata[‘periods’] == 3) | (Statadata[‘periods’] == 4)].index)
Statadata.drop(Statadata[Statadata[‘zcstscoreela’].isnull()].index, inplace=True)
Statadata.drop(Statadata[Statadata[‘zcstscoremath’].isnull()].index, inplace=True)

eng = Statadata[Statadata[‘department’]==‘english’]
eng = eng.drop(eng[eng[‘zcstscoreelaprior’].isnull()].index)

math = Statadata[Statadata[‘department’]==‘math’]
math = math.drop(math[math[‘zcstscoremathprior’].isnull()].index)

y_en_gpa = eng[‘gpatotal’]
y_en_cst = eng[‘zcstscoreela’]
X_en = eng.copy()
del X_en[‘gpatotal’]
del X_en[‘zcstscoremath’]
del X_en[‘zcstscoreela’]
del X_en[‘pareduccode’]
del X_en[‘cstscoreela’]
del X_en[‘cstscoremath’]

y_math_gpa = math[‘gpatotal’]
y_math_cst = math[‘zcstscoremath’]
X_math = math.copy()
del X_math[‘gpatotal’]
del X_math[‘zcstscoremath’]
del X_math[‘zcstscoreela’]
del X_math[‘pareduccode’]
del X_math[‘cstscoreela’]
del X_math[‘cstscoremath’]

english:

deleting the columns and rows with missing values:

missing_en=X_en.isnull().sum()
missingbool_en=missing_en<25
selected_en=X_en.columns[missingbool_en]
selected_en=X_en[selected_en]
selected_en=selected_en.dropna(0)
y_en_cst=y_en_cst[selected_en.index]
y_en_gpa=y_en_gpa[selected_en.index]

math:

deleting the columns and rows with missing values:

missing_math=X_math.isnull().sum()
missingbool_math=missing_math<25
selected_math=X_math.columns[missingbool_math]
selected_math=X_math[selected_math]
selected_math=selected_math.dropna(0)
y_math_cst=y_math_cst[selected_math.index]
y_math_gpa=y_math_gpa[selected_math.index]

columns_to_overwrite = [‘department’, ‘crsnamef’, ‘markf’, ‘crsnames’, ‘marks’, ‘cstlevelela’, ‘cstlevelmath’, ‘status’, ‘grade’, ‘gpaavg’]
columns_to_overwrite2 = [ ‘markf’, ‘crsnames’, ‘marks’, ‘cstlevelela’, ‘cstlevelmath’, ‘status’, ‘grade’]

new_en=pd.get_dummies(selected_en[‘crsnamef’])
for i in columns_to_overwrite2:
nieuw_en=pd.get_dummies(selected_en[i])
new_en=new_en.merge(nieuw_en, left_index=True, right_index=True, suffixes=[’_1’,’_2’])

selected_en=selected_en.drop(labels=columns_to_overwrite, axis=“columns”)
selected_en=new_en.merge(selected_en,left_index=True, right_index=True)

math:

Creating the dummy variables for the categorical string variables

new_math=pd.get_dummies(selected_math[‘crsnamef’])
for i in columns_to_overwrite2:
nieuw_math=pd.get_dummies(selected_math[i])
new_math=new_math.merge(nieuw_math, left_index=True, right_index=True, suffixes=[’_1’,’_2’])

selected_math=selected_math.drop(labels=columns_to_overwrite, axis=“columns”)
selected_math=new_math.merge(selected_math,left_index=True, right_index=True)

X_train_math_gpa, X_test_math_gpa, y_train_math_gpa, y_test_math_gpa = train_test_split(selected_math, y_math_gpa, random_state=4)
X_train_math_cst, X_test_math_cst, y_train_math_cst, y_test_math_cst = train_test_split(selected_math, y_math_cst, random_state=4)

paramstest2 = {
‘max_depth’: 8,
‘min_child_weight’: 3,
‘gamma’: 0.4,
‘subsample’: 0.7,
‘colsample_bytree’: 0.7,
}
data_train = xgb.DMatrix(X_train_math_gpa, label=y_train_math_gpa)
data_test = xgb.DMatrix(X_test_math_gpa, label=y_test_math_gpa)

model=xgb.train(paramstest2, data_train, 5000, evals=[(data_test, “test”)], verbose_eval=100, early_stopping_rounds=50)

it gives me the following error:
[13:24:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 26 extra nodes, 6 pruned nodes, max_depth=6
[0] test-rmse:nan
Will train until test-rmse hasn’t improved in 50 rounds.
[13:24:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 66 extra nodes, 2 pruned nodes, max_depth=8
[13:24:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 36 extra nodes, 46 pruned nodes, max_depth=8
[13:24:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 44 pruned nodes, max_depth=6
[13:24:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 24 extra nodes, 92 pruned nodes, max_depth=7
[13:24:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 80 pruned nodes, max_depth=7
[13:24:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 50 pruned nodes, max_depth=4
[13:24:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 92 pruned nodes, max_depth=5
[13:24:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 102 pruned nodes, max_depth=5
[13:24:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 112 pruned nodes, max_depth=5
[13:24:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6

[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 170 pruned nodes, max_depth=0
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 206 pruned nodes, max_depth=0
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 160 pruned nodes, max_depth=0
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 178 pruned nodes, max_depth=0
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 142 pruned nodes, max_depth=3
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 154 pruned nodes, max_depth=0
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 188 pruned nodes, max_depth=0
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 150 pruned nodes, max_depth=0
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 160 pruned nodes, max_depth=0
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 166 pruned nodes, max_depth=0
[13:24:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 182 pruned nodes, max_depth=0
Traceback (most recent call last):
File “”, line 1, in
File “/Users/catlinbruys/PycharmProjects/Bachelor_Thesis/venv/lib/python3.6/site-packages/xgboost/training.py”, line 204, in train
xgb_model=xgb_model, callbacks=callbacks)
File “/Users/catlinbruys/PycharmProjects/Bachelor_Thesis/venv/lib/python3.6/site-packages/xgboost/training.py”, line 99, in _train_internal
evaluation_result_list=evaluation_result_list))
File “/Users/catlinbruys/PycharmProjects/Bachelor_Thesis/venv/lib/python3.6/site-packages/xgboost/callback.py”, line 247, in callback
best_msg = state[‘best_msg’]
KeyError: ‘best_msg’

What can I do to solve this problem? I really need a solution as it is for a very important project. Thanks


#3

But why do I keep getting errors. When I run the xgb.fit() and xgb.predict() method, it only predicts nan’s. How is this possible?


#4

Hi @kateb4,

Can you share the dataset you are working on?


#5

I found the error. I still had some nan’s left in my y_math_gpa, for some reason, xgboost decided to predict everything as nan. After removing them, my code ran smoothly


#6