Exception during xgboost prediction: can not initialize DMatrix from DMatrix - python-3.x

I trained a xgboost model in Python using the Scikit-Learn Python API, and serialized it using pickle library. I uploaded the model to ML Engine, but when I try to do online predictions, i get the following exception:
Prediction failed: Exception during xgboost prediction: can not initialize DMatrix from DMatrix
An example of the json I'm using for prediction is the following:
{
"instances":[
[
24.90625,
21.6435643564356,
20.3762376237624,
24.3679245283019,
30.2075471698113,
28.0947368421053,
16.7797359774725,
14.9262079299572,
17.9888028979966,
15.3333284503293,
19.6535308744024,
17.1501961307627,
0.0,
0.0,
0.0,
0.0,
0.0,
509.0,
497.0,
439.0,
427.0,
407.0,
1.0,
1.0,
1.0,
1.0,
1.0,
2.0,
23.0,
10.0,
58.0,
11.0,
20.0,
23.3617021276596,
23.3617021276596,
23.3617021276596,
23.3617021276596,
23.3617021276596,
23.9423076923077,
26.3082269243683,
23.6212606363851,
22.6752334301282,
27.4343583104833,
34.0090408101173,
11.1991944104063,
7.33420726455092,
8.15160392948917,
11.4119236389594,
17.9429092915607,
18.0573102225845,
32.8902876598084,
-0.00286123032904149,
-0.00286123032904149,
-0.00286123032904149,
-0.00286123032904149,
-0.00286123032904149,
-0.0028328611898017,
0.0534138904223018,
0.0534138904223018,
0.0534138904223018,
0.0534138904223018,
0.0534138904223018,
0.0531491870801522
]
]
}
I use the following code to train my model:
def _train_model(X, y):
clf = xgb.XGBClassifier(max_depth=6,
learning_rate=0.01,
n_estimators=100,
n_jobs=-1)
clf.fit(X, y)
return clf
Where X and y are both numpy.ndarray:
Type of X: <class 'numpy.ndarray'> Type of y: <class 'numpy.ndarray'>
Also I'm using xgboost 0.72.1, Python 3.5 and ML runtime 1.9.
Any one knows what can be the source of the problem?
Thanks!

Seems like the issue is due to the pickling. I was able to reproduce it and working on a fix, but meanwhile could you try exporting your classifier like below instead?
clf._Booster.save_model('./model.bst')
That should unblock you for now. If it didn't, feel free to reach out to cloudml-feedback#google.com.

I also faced similar problem or feature mismatch when I tried score the test data using the the trained XGBoost model that was dumped in .pkl format.
However after saving the model in .bst format, I was able to score the same training data without any issues. Looks like there is a difference in the two implementations of .pkl and .bst format when it comes to XGBoost.

Going a little further, and answering kuza's question above on loading the saved model:
save model:
clf._Booster.save_model('./model.bst')
loading the saved model:
model = xgboost.Booster({'nthread': 4}) # initialize before loading model
model.load_model('./model.bst') # load model
This cleared up 2 issues that I had with using pickle on the model. Issue 1 was a weird exeption: ValueError: feature_names mismatch:
Also check if you are using predict_proba on the loaded model, and getting a weird exception. The fix for that was just to use the straight predict function vice predict_proba.

Related

Get feature importance with PySpark and XGboost

I have trained a model using XGboost and PySpark
params = {
'eta': 0.1,
'gamma': 0.1,
'missing': 0.0,
'treeMethod': 'gpu_hist',
'maxDepth': 10,
'maxLeaves': 256,
'growPolicy': 'depthwise',
'objective': 'binary:logistic',
'minChildWeight': 30.0,
'lambda_': 1.0,
'scalePosWeight': 2.0,
'subsample': 1.0,
'nthread': 1,
'numRound': 100,
'numWorkers': 1,
}
classifier = XGBoostClassifier(**params).setLabelCol(label).setFeaturesCols(features)
model = classifier.fit(train_data)
When I try to get the feature importance using
model.nativeBooster.getFeatureScore()
It returns the following error:
Py4JError: An error occurred while calling o2167.getFeatureScore. Trace:
py4j.Py4JException: Method getFeatureScore([]) does not exist
Is there a correct way of getting feature importance when using XGboost with PySpark
I am a newbie in this field. I happened to encounter what you are experiencing. You may want to try using: model.nativeBooster.getScore("", "gain") or model.nativeBooster.getFeatureScore('').
My 'model' is of type "sparkxgb.xgboost.XGBoostClassificationModel".
Regards

XGBoost: OS Error : [WinError -529697949] Windows Error 0xe06d7363 running XGBClassifier with large dataset, CPU Mode

Getting this error while trying to run XGBClassifier and GridsearchCV for hyperparameter optimization. I have seen this issue being opened in Github but closed and marked resolved but no solution provided. Has anyone actually found a soultion to this error?
My dataset:
X = np array with 350000 rows and 1715 columns (after one hot encoding)
y = 350000 rows and 1 column (target)
My Code:
X = train.drop(['Breakage'], axis=1,) #features (read from dataframe)
y = train['Breakage'] #target (read from dataframe)
X= X.as_matrix() #convert to np array
y= y.as_matrix() #convert to np array
y = np.reshape(y,(-1, 1)) #reshape array
X = X.astype('uint8') #Change dtype to avoid overcommmit error in windows
y = y.astype('uint8') #Change dtype to avoid overcommmit error in windows
#define estimators and learning rate
model = XGBClassifier()
n_estimators = [100, 200, 300, 400, 500]
learning_rate = [0.0001, 0.001, 0.01, 0.1]
# GRidSearchCV
param_grid = dict(learning_rate=learning_rate, n_estimators=n_estimators)
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=7)
grid_search = GridSearchCV(model, param_grid, n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X,y)
The output Error:
OSError: [WinError -529697949] Windows Error 0xe06d7363
Can anyone please tell me what i am doing wrong
I have same problem as yours.
This answer will help you.
Windows Error using XGBoost with python
Change the 'xgb.fit()' sklearn api to 'xgb.train()' will solve this problem.
I had this issue and it was something to do with the installation of xgboost.
I was using code that already worked before, and although there was a lot of memory usage, it is not a memory error.
What worked for me was
Close the IDE
Uninstall xgboost
Upgrade pip (not sure if necessary)
Reinstall xgboost
Rebuild and install python project
Run again
Source of inspiration from this question here:
Windows Error 0xe06d7363 when using Cross Validation XGboost

Cannot clone object <tensorflow.python.keras.wrappers.scikit_learn.KerasClassifier object

This is with regards to TF 2.0.
Please find below my code that performs GridSearch along with Cross Validation using sklearn.model_selection.GridSearchCV for the mnist dataset that works perfectly fine.
# Build Function to create model, required by KerasClassifier
def create_model(optimizer_val='RMSprop',hidden_layer_size=16,activation_fn='relu',dropout_rate=0.1,regularization_fn=tf.keras.regularizers.l1(0.001),kernel_initializer_fn=tf.keras.initializers.glorot_uniform,bias_initializer_fn=tf.keras.initializers.zeros):
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(units=hidden_layer_size, activation=activation_fn,kernel_regularizer=regularization_fn,kernel_initializer=kernel_initializer_fn,bias_initializer=bias_initializer_fn),
tf.keras.layers.Dropout(dropout_rate),
tf.keras.layers.Dense(units=hidden_layer_size,activation='softmax',kernel_regularizer=regularization_fn,kernel_initializer=kernel_initializer_fn,bias_initializer=bias_initializer_fn)
])
optimizer_val_final=optimizer_val
model.compile(optimizer=optimizer_val, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
return model
#Create the model with the wrapper
model = tf.keras.wrappers.scikit_learn.KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=2)
#Initialize the parameter grid
nn_param_grid = {
'epochs': [10],
'batch_size':[128],
'optimizer_val': ['Adam','SGD'],
'hidden_layer_size': [128],
'activation_fn': ['relu'],
'dropout_rate': [0.2],
'regularization_fn':['l1','l2','L1L2'],
'kernel_initializer_fn':['glorot_normal', 'glorot_uniform'],
'bias_initializer_fn':[tf.keras.initializers.zeros]
}
#Perform GridSearchCV
grid = GridSearchCV(estimator=model, param_grid=nn_param_grid, verbose=2, cv=3,scoring=precision_custom,return_train_score=False,n_jobs=-1)
grid_result = grid.fit(x_train, y_train)
My idea is to pass different optimizers with different learning rates , say Adam for learning rates 0.1,0.01 and 0.001. I also want to try out SGD with different learning rates and momentum values.
In that case , when I pass 'optimizer_val': [tf.keras.optimizers.Adam(0.1)], , I get the error as given below:
Cannot clone object <tensorflow.python.keras.wrappers.scikit_learn.KerasClassifier object at 0x7fe08b210e10>, as the constructor either does not set or modifies parameter optimizer_val
Please advise as to how can I rectify this error.
This is sklearn bug. You should reduce the version of sklearn:
conda install scikit-learn==0.21.2
It's OK!
You can fix the issue with changing the list into tuple.
If there is any single valued instance then you can use list.
#Initialize the parameter grid
nn_param_grid = {
'epochs': [10],
'batch_size':[128],
'optimizer_val': ('Adam','SGD'),
'hidden_layer_size': [128],
'activation_fn': ['relu'],
'dropout_rate': [0.2],
'regularization_fn':('l1','l2','L1L2'),
'kernel_initializer_fn':('glorot_normal', 'glorot_uniform'),
'bias_initializer_fn':[tf.keras.initializers.zeros]
}
Found this comment online and it helped!
For those who are getting following error due to above statement:
Cannot clone object <keras.wrappers.scikit_learn.KerasClassifier object at 0x7f93ddc5d1d0>, as the constructor either does not set or modifies parameter layers
Change the layers from array of list to array of tuple:
layers => [(20,), (45, 30, 15), (40, 20)]
Don't forget to add comma after
(20,) otherwise another error/warning will appear - FitFailedWarning:
Estimator fit failed. The score on this train-test partition for these
parameters will be set to nan. Details: TypeError: 'int' object is
not iterable Because single tuple without comma is treated as int.
Only installing TensorFlow 2.8 helped with this issue. Notice that it is available only via pip Anaconda TensorFlow 2.7 vs. Pypi TensorFlow 2.8
To check your version of Tensorflow type: conda list tensorflow
(base) C:\Users\User> conda list tensorflow-gpu
# Name Version Build Channel
tensorflow-gpu 2.4.1 pyhd8ed1ab_3 conda-forge
To uninstall type: conda uninstall tensorflow and to install version 2.8 type:
pip install tensorflow-gpu==2.8

ValueError: Circular reference detected in LightGBM

I get the following error when I train LightGBM model:
# Train the model
import lightgbm as lgb
lgb_train = lgb.Dataset(x_train, y_train)
lgb_val = lgb.Dataset(x_test, y_test)
parameters = {
'application': 'binary',
'objective': 'binary',
'metric': 'auc',
'is_unbalance': 'true',
'boosting': 'gbdt',
'num_leaves': 31,
'feature_fraction': 0.5,
'bagging_fraction': 0.5,
'bagging_freq': 20,
'learning_rate': 0.05,
'verbose': 0
}
model = lgb.train(parameters,
train_data,
valid_sets=test_data,
num_boost_round=5000,
early_stopping_rounds=100)
y_pred = model.predict(test_data)
If you used cut or qcut functions for binning and did not encode later (one-hot encoding, label encoding ..). this may be the cause of the error. Try to use an encoding.
I hope it works.
I had what might be the same problem.
Post the whole traceback to make sure.
For me it was a problem serializing to JSON, which LightGBM does under the hood to save the booster for later use.
Check your dataset for any date/datetime columns, or anything that remotely looks like a date, and either drop it or convert to something JSON can handle.
Mine had all been converted to categorical dtype by some Pandas code I had poorly written, and I usually do the initial GBM run fairly fast-n-dirty to see what variables show up as important. LightGBM let me make the data binaries for training (i.e. it would have thrown an error if they were datetime or timedelta dtypes before letting me run anything). It would run the training just fine, report an AUC, then fail after the last training step when it was dumping the categoricals to JSON. It was maddening, with a cryptic traceback.
Hope this helps.
If you have any time delta variable in the dataset convert it into an int using the dt.days attribute. I faced the same issue it is the issue reported in Github of light gbm

Get survival function with Spark ML

I am training an Accelerated failure time model with PySpark (from pyspark.ml.regression import AFTSurvivalRegression)
Now I want to apply the model to new data and get the probability that the event will happen before time t (survival function), which method should I use ? The documentation is not clear to me : https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.regression.AFTSurvivalRegression
For example, if I do the following:
from pyspark.ml.regression import AFTSurvivalRegression
from pyspark.ml.linalg import Vectors
training = spark.createDataFrame([
(1.218, 1.0, Vectors.dense(1.560, -0.605)),
(2.949, 0.0, Vectors.dense(0.346, 2.158)),
(3.627, 0.0, Vectors.dense(1.380, 0.231)),
(0.273, 1.0, Vectors.dense(0.520, 1.151)),
(4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", "features"])
quantileProbabilities = [0.25, 0.75]
aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
quantilesCol="quantiles")
model = aft.fit(training)
model.transform(training).show(truncate=False)
I get as an output :
Does it mean that for the first line, P(event happening between 0.832 and 9.48) = 50% ?
Thanks

Resources