Python XGBoost prediction discrepancies with DMatrix - python-3.x

I found there are 2 problems with xbgoost predictions. I trained the model with XGBClassifier and tried to load the model using Booster for prediction, I found
Predictions are slightly different using xbg.Booster and xgb.Classifier, see below.
Predictions are different between list and numpy array when using DMatrix, see below,
Some difference is quite big, I am not sure why this is happening and which prediction should be the source of truth?

For the second question, your data types could change when you convert a list to a numpy array (depending on the numpy version you're using). For example on numpy 1.19.5, try converting list ["1",1] to a numpy array and see the result.

Related

Passing a python list to keras model.fit

So right now I'm using keras and training my model works perfectly fine, but I have to pass my data as numpy ndarray. So I have to convert my list of data to numpy ndarray first and then pass it to keras for training. When I try to pass my python list/array, even tho it's the same shape as numpy array I get back errors. Is there any way to not use numpy for this or am I stuck with it?
Can you further explain your problem. What is the error message you are getting and are you getting this error during training or predicting?
Also if you could post some code samples that would help to

Separate tensorflow dataset to different outputs in tensorflow2

I have a dataset with 3 tensor outputs of data, label and path:
import tensorflow as tf #tensroflow version 2.1
data=tf.constant([[0,1],[1,2],[2,3],[3,4],[4,5],[5,6],[6,7],[7,8],[8,9],[9,0]],name='data')
labels=tf.constant([0,1,0,1,0,1,0,1,0,1],name='label')
path=tf.constant(['p0','p1','p2','p3','p4','p5','p6','p7','p8','p9'],name='path')
my_dataset=tf.data.Dataset.from_tensor_slices((data,labels,path))
I want to separate my_dataset back to 3 datasets of data, labels and paths (or 3 tensors) without iterating over it and without converting it to numpy.
In tensorflow 1.X this is done simply using
d,l,p=my_dataset.make_one_shot_iterator().get_next()
and then converting the tensors to datasets. How to do it in tensorflow2?
Thanks!
The solution I found does not look very "pythonic" but it works.
I used the map() method:
data= my_dataset.map(lambda x,y,z:x)
labels= my_dataset.map(lambda x,y,z:y)
paths= my_dataset.map(lambda x,y,z:z)
After this separation, the order of the labels stays the same.

Numpy arrays used in training in TF1--Keras have much lower accuracy in TF2

I had a neural net in keras that performed well. Now with the deprecation that came with Tensorflow 2 I had to rewrite the model. Now it is giving me worse accuracy metrics.
My suspicion is that tf2 wants you to use their data structure to train models and they give a example of how to go from Numpy to tf.data.Dataset here.
So I did:
train_dataset = tf.data.Dataset.from_tensor_slices((X_train_deleted_nans, y_train_no_nans))
train_dataset = train_dataset.shuffle(SHUFFLE_CONST).batch(BATCH_SIZE)
Once the training starts I get this warning error:
2019-10-04 23:47:56.691434: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
Appending .repeat() to the creation of my tf.data.Dataset solved my error. Like suggested by duysqubix in his eloquent solution posted here:
https://github.com/tensorflow/tensorflow/issues/32817#issuecomment-539200561

Machine Learning liner Regression - Sklearn

I'm new to the Machine learning domain and in Learn Regression i have some doubt
1:While practicing the sklearn learn regression model prediction method getting the below error.
Code:
sklearn.linear_model.LinearRegression.predict(25)
Error:
"ValueError: Expected 2D array, got scalar array instead: array=25. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
Do i need to pass a 2-D array? Checked on sklearn documentation page any haven't found any thing for version update.
**Running my code on Kaggle
https://www.kaggle.com/aman9d/bikesharingdemand-upx/
2: Is index of dataset going to effect model's score (weights)?
First of all you should put your code as you use:
# import, instantiate, fit
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X, y)
# use the predict method
linreg.predict(25)
Because what you post in the question is not properly executable, predict method is not static for the class LinearRegression.
When you fit a model, the first step is recognize which kind of data will be the input, in your case will be similar to X, that means that if you pass something with different shape of X to the model it will raise an error.
In your example X seems to be a pd.DataFrame() instance with only 1 column, this should be replaceable with an array of 2 dimension representing the number of examples by the number of features, so if you try:
linreg.predict([[25]])
should work.
For example if you were trying a regression with more than 1 feature aka column, let's say temp and humidity, your input would look like this:
linreg.predict([[25, 56]])
I hope this will help you and always keep in mind which is the shape of your data.
Documentation: LinearRegression fit
X : array-like or sparse matrix, shape (n_samples, n_features)

'scipy.sparse.coo.coo_matrix' as input for models

I am using sklearn.preprocessing.OneHotEncoder for handling categorical values in my model. I noticed the results of the 'transform' method is a sparse matrix scipy.sparse.coo.coo_matrix and it seems that it can be used directly for training the model (in this case, Ridge regression).
When dealing with toy problems (<100 examples), it seems that there is no difference if I feed the model with the sparse matrix or with the corresponding np array of the matrix (matrix.toarray()), but with larger datasets (>20K examples) it looks to me that the conversion to array is absolutely required.
I know there are some issues when using sparse matrices (for example the dot product of np throws errors) and I am wondering if this is another, and therefore I should always use np arrays as input for the scikit-learn models.

Resources