Calibration with xgboost - scikit-learn

I'm wondering if I can do calibration in xgboost. To be more specific, does xgboost come with an existing calibration implementation like in scikit-learn, or are there some ways to put the model from xgboost into a scikit-learn's CalibratedClassifierCV?
As far as I know in sklearn this is the common procedure:
# Train random forest classifier, calibrate on validation data and evaluate
# on test data
clf = RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)
clf_probs = clf.predict_proba(X_test)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid", cv="prefit")
sig_clf.fit(X_valid, y_valid)
sig_clf_probs = sig_clf.predict_proba(X_test)
sig_score = log_loss(y_test, sig_clf_probs)
print "Calibrated score is ",sig_score
If I put an xgboost tree model into the CalibratedClassifierCV an error will be thrown (of course):
RuntimeError: classifier has no decision_function or predict_proba method.
Is there a way to integrate the excellent calibration module of scikit-learn with xgboost?
Appreciate your insightful ideas!

Answering to my own question, an xgboost GBT can be integrated with scikit-learn by writing a wrapper class like the case below.
class XGBoostClassifier():
def __init__(self, num_boost_round=10, **params):
self.clf = None
self.num_boost_round = num_boost_round
self.params = params
self.params.update({'objective': 'multi:softprob'})
def fit(self, X, y, num_boost_round=None):
num_boost_round = num_boost_round or self.num_boost_round
self.label2num = dict((label, i) for i, label in enumerate(sorted(set(y))))
dtrain = xgb.DMatrix(X, label=[self.label2num[label] for label in y])
self.clf = xgb.train(params=self.params, dtrain=dtrain, num_boost_round=num_boost_round)
def predict(self, X):
num2label = dict((i, label)for label, i in self.label2num.items())
Y = self.predict_proba(X)
y = np.argmax(Y, axis=1)
return np.array([num2label[i] for i in y])
def predict_proba(self, X):
dtest = xgb.DMatrix(X)
return self.clf.predict(dtest)
def score(self, X, y):
Y = self.predict_proba(X)
return 1 / logloss(y, Y)
def get_params(self, deep=True):
return self.params
def set_params(self, **params):
if 'num_boost_round' in params:
self.num_boost_round = params.pop('num_boost_round')
if 'objective' in params:
del params['objective']
self.params.update(params)
return self
See full example here.
Please don't hesitate to provide a smarter way of doing this!

A note from the hell scape that is July 2020:
You no longer need a wrapper class. The predict_proba method is built into the xgboost sklearn python apis. Not sure when they were added but they are there for v1.0.0 on for certain.
Note: this is of course only true for classes that would have the predict_proba method. Ex: The XGBRegressor doesn't. The XGBClassifier does.

Related

How do I log a confusion matrix into Wanddb?

I'm using pytorch lightning, and at the end of each epoch, I create a confusion matrix from torchmetrics.ConfusionMatrix (see code below). I would like to log this into Wandb, but the Wandb confusion matrix logger only accepts y_targets and y_predictions. Does anyone know how to extract the updated confusion matrix y_targets and y_predictions from a confusion matrix, or alternatively give Wandb my updated confusion matrix in a way that it can be processed into eg a heatmap within wandb?
class ClassificationTask(pl.LightningModule):
def __init__(self, model, lr=1e-4, augmentor=augmentor):
super().__init__()
self.model = model
self.lr = lr
self.save_hyperparameters() #not being used at the moment, good to have ther in the future
self.augmentor=augmentor
self.matrix = torchmetrics.ConfusionMatrix(num_classes=9)
self.y_trues=[]
self.y_preds=[]
def training_step(self, batch, batch_idx):
x, y = batch
x=self.augmentor(x)#.to('cuda')
y_pred = self.model(x)
loss = F.cross_entropy(y_pred, y,) #weights=class_weights_tensor
acc = accuracy(y_pred, y)
metrics = {"train_acc": acc, "train_loss": loss}
self.log_dict(metrics)
return loss
def validation_step(self, batch, batch_idx):
loss, acc = self._shared_eval_step(batch, batch_idx)
metrics = {"val_acc": acc, "val_loss": loss, }
self.log_dict(metrics)
return metrics
def _shared_eval_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = F.cross_entropy(y_hat, y)
acc = accuracy(y_hat, y)
self.matrix.update(y_hat,y)
return loss, acc
def validation_epoch_end(self, outputs):
confusion_matrix = self.matrix.compute()
wandb.log({"my_conf_mat_id" : confusion_matrix})
def configure_optimizers(self):
return torch.optim.Adam((self.model.parameters()), lr=self.lr)
I'm actually working on the same issue currently. I found this great PR Feature request for PyTorch lightning. Perhaps this could be of help. I think a possible solution is utilizing torch metrics confusion matrix and then incorporating that into your train/val/test steps and logging them.
https://github.com/Lightning-AI/metrics/issues/880

What should I think about when writing a custom loss function?

I'm trying to get my toy network to learn a sine wave.
I output (via tanh) a number between -1 and 1, and I want the network to minimise the following loss, where self(x) are the predictions.
loss = -torch.mean(self(x)*y)
This should be equivalent to trading a stock with a sinusoidal price, where self(x) is our desired position, and y are the returns of the next time step.
The issue I'm having is that the network doesn't learn anything. It does work if I change the loss function to be torch.mean((self(x)-y)**2) (MSE), but this isn't what I want. I'm trying to focus the network on 'making a profit', not making a prediction.
I think the issue may be related to the convexity of the loss function, but I'm not sure, and I'm not certain how to proceed. I've experimented with differing learning rates, but alas nothing works.
What should I be thinking about?
Actual code:
%load_ext tensorboard
import matplotlib.pyplot as plt; plt.rcParams["figure.figsize"] = (30,8)
import torch;from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F;import pytorch_lightning as pl
from torch import nn, tensor
def piecewise(x): return 2*(x>0)-1
class TsDs(torch.utils.data.Dataset):
def __init__(self, s, l=5): super().__init__();self.l,self.s=l,s
def __len__(self): return self.s.shape[0] - 1 - self.l
def __getitem__(self, i): return self.s[i:i+self.l], torch.log(self.s[i+self.l+1]/self.s[i+self.l])
def plt(self): plt.plot(self.s)
class TsDm(pl.LightningDataModule):
def __init__(self, length=5000, batch_size=1000): super().__init__();self.batch_size=batch_size;self.s = torch.sin(torch.arange(length)*0.2) + 5 + 0*torch.rand(length)
def train_dataloader(self): return DataLoader(TsDs(self.s[:3999]), batch_size=self.batch_size, shuffle=True)
def val_dataloader(self): return DataLoader(TsDs(self.s[4000:]), batch_size=self.batch_size)
dm = TsDm()
class MyModel(pl.LightningModule):
def __init__(self, learning_rate=0.01):
super().__init__();self.learning_rate = learning_rate
super().__init__();self.learning_rate = learning_rate
self.conv1 = nn.Conv1d(1,5,2)
self.lin1 = nn.Linear(20,3);self.lin2 = nn.Linear(3,1)
# self.network = nn.Sequential(nn.Conv1d(1,5,2),nn.ReLU(),nn.Linear(20,3),nn.ReLU(),nn.Linear(3,1), nn.Tanh())
# self.network = nn.Sequential(nn.Linear(5,5),nn.ReLU(),nn.Linear(5,3),nn.ReLU(),nn.Linear(3,1), nn.Tanh())
def forward(self, x):
out = x.unsqueeze(1)
out = self.conv1(out)
out = out.reshape(-1,20)
out = nn.ReLU()(out)
out = self.lin1(out)
out = nn.ReLU()(out)
out = self.lin2(out)
return nn.Tanh()(out)
def step(self, batch, batch_idx, stage):
x, y = batch
loss = -torch.mean(self(x)*y)
# loss = torch.mean((self(x)-y)**2)
print(loss)
self.log("loss", loss, prog_bar=True)
return loss
def training_step(self, batch, batch_idx): return self.step(batch, batch_idx, "train")
def validation_step(self, batch, batch_idx): return self.step(batch, batch_idx, "val")
def configure_optimizers(self): return torch.optim.SGD(self.parameters(), lr=self.learning_rate)
#logger = pl.loggers.TensorBoardLogger(save_dir="/content/")
mm = MyModel(0.1);trainer = pl.Trainer(max_epochs=10)
# trainer.tune(mm, dm)
trainer.fit(mm, datamodule=dm)
#
If I understand you correctly, I think that you were trying to maximize the unnormalized correlation between the network's prediction, self(x), and the target value y.
As you mention, the problem is the convexity of the loss wrt the model weights. One way to see the problem is to consider that the model is a simple linear predictor w'*x, where w is the model weights, w' it's transpose, and x the input feature vector (assume a scalar prediction for now). Then, if you look at the derivative of the loss wrt the weight vector (i.e., the gradient), you'll find that it no longer depends on w!
One way to fix this is change the loss to,
loss = -torch.mean(torch.square(self(x)*y))
or
loss = -torch.mean(torch.abs(self(x)*y))
You will have another big problem, however: these loss functions encourage unbound growth of the model weights. In the linear case, one solves this by a Lagrangian relaxation of a hard constraint on, for example, the norm of the model weight vector. I'm not sure how this would be done with neural networks as each layer would need it's own Lagrangian parameter...

Visualize the output of Vgg16 model by TSNE plot?

I need to visualize the output of Vgg16 model which classify 14 different classes.
I load the trained model and I did replace the classifier layer with the identity() layer but it doesn't categorize the output.
Here is the snippet:
the number of samples here is 1000 images.
epoch = 800
PATH = 'vgg16_epoch{}.pth'.format(epoch)
checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
class Identity(nn.Module):
def __init__(self):
super(Identity, self).__init__()
def forward(self, x):
return x
model.classifier._modules['6'] = Identity()
model.eval()
logits_list = numpy.empty((0,4096))
targets = []
with torch.no_grad():
for step, (t_image, target, classess, image_path) in enumerate(test_loader):
t_image = t_image.cuda()
target = target.cuda()
target = target.data.cpu().numpy()
targets.append(target)
logits = model(t_image)
print(logits.shape)
logits = logits.data.cpu().numpy()
print(logits.shape)
logits_list = numpy.append(logits_list, logits, axis=0)
print(logits_list.shape)
tsne = TSNE(n_components=2, verbose=1, perplexity=10, n_iter=1000)
tsne_results = tsne.fit_transform(logits_list)
target_ids = range(len(targets))
plt.scatter(tsne_results[:,0],tsne_results[:,1],c = target_ids ,cmap=plt.cm.get_cmap("jet", 14))
plt.colorbar(ticks=range(14))
plt.legend()
plt.show()
here is what this script has been produced: I am not sure why I have all colors for each cluster!
The VGG16 outputs over 25k features to the classifier. I believe it's too much to t-SNE. It's a good idea to include a new nn.Linear layer to reduce this number. So, t-SNE may work better. In addition, I'd recommend you two different ways to get the features from the model:
The best way to get it regardless of the model is by using the register_forward_hook method. You may find a notebook here with an example.
If you don't want to use the register, I'd suggest this one. After loading your model, you may use the following class to extract the features:
class FeatNet (nn.Module):
def __init__(self, vgg):
super(FeatNet, self).__init__()
self.features = nn.Sequential(*list(vgg.children())[:-1]))
def forward(self, img):
return self.features(img)
Now, you just need to call FeatNet(img) to get the features.
To include the feature reducer, as I suggested before, you need to retrain your model doing something like:
class FeatNet (nn.Module):
def __init__(self, vgg):
super(FeatNet, self).__init__()
self.features = nn.Sequential(*list(vgg.children())[:-1]))
self.feat_reducer = nn.Sequential(
nn.Linear(25088, 1024),
nn.BatchNorm1d(1024),
nn.ReLU()
)
self.classifier = nn.Linear(1024, 14)
def forward(self, img):
x = self.features(img)
x_r = self.feat_reducer(x)
return self.classifier(x_r)
Then, you can run your model returning x_r, that is, the reduced features. As I told you, 25k features are too much for t-SNE. Another method to reduce this number is by using PCA instead of nn.Linear. In this case, you send the 25k features to PCA and then train t-SNE using the PCA's output. I prefer using nn.Linear, but you need to test to check which one you get a better result.

Python 3: How to evaluate the Adam Gradient in Tensor Flow 2.0? I would like to replace my implementation

I have the following code that is working well. However, I strongly believe that the Tensorflow 2.0 implementation of the Adam Gradient is more efficient than my naive implementation.
How can I replace the evaluation of the Adam Gradient by the Tensorflow 2.0 implementation?
import tensorflow as tf
import numpy as np
def linearModelGenerator(numberSamples):
x = tf.random.normal(shape=(numberSamples,))
y = 3*tf.ones(shape=(numberSamples,)) + tf.constant(5.0) * x + tf.random.normal(shape=(numberSamples,),stddev=0.01)
return x,y
class Adam:
def __init__(self,shapes,lr=0.001,beta1=0.9,beta2=0.999,epsilon=1e-07):
self.lr=lr
self.beta1=beta1
self.beta2=beta2
self.epsilon=epsilon
self.shapes=shapes
self.m=np.shape(shapes)[0]
self.listM=[]
self.listV=[]
self.t=0
for i in range(self.m):
if(np.isscalar(shapes[i])):
self.listM.append(0)#append(tf.zeros(shapes[i]))
self.listV.append(0)#append(tf.zeros(shapes[i]))
else:
self.append(tf.zeros(shapes[i]))
self.append(tf.zeros(shapes[i]))
def evalGradient(self,*args):
adamGrad=[]
self.t=self.t+1
for i in range(self.m):
grad=args[i]
self.listM[i]=self.beta1*self.listM[i]+(1-self.beta1)*grad
self.listV[i]=self.beta2*self.listV[i]+(1-self.beta2)*(grad*grad)
hatM=self.listM[i]/(1-(self.beta1)**self.t)
hatV=self.listV[i]/(1-(self.beta2)**self.t)
adamGrad.append(hatM/(tf.math.sqrt(hatV)+(tf.ones(np.shape(hatV))*self.epsilon)))
return adamGrad
class LinearModel:
def __init__(self):
self.weight = tf.Variable(-1.0)
self.bias = tf.Variable(-1.0)
def __call__(self, x):
return self.weight * x + self.bias
def loss(y, pred):
return tf.reduce_mean(tf.square(y - pred))
def trainAdam(linear_model,adam, x, y):
with tf.GradientTape() as t:
current_loss = loss(y, linear_model(x))
gradWeight, gradBias = t.gradient(current_loss, [linear_model.weight, linear_model.bias])
gradAdamList=adam.evalGradient(gradWeight,gradBias)
gradAdamWeight=gradAdamList[0]
gradAdamBias=gradAdamList[1]
linear_model.weight.assign_sub(adam.lr * gradAdamWeight)
linear_model.bias.assign_sub(adam.lr * gradAdamBias)
if __name__=="__main__":
numberSamples=100
x,y=linearModelGenerator(numberSamples)
linear_model = LinearModel()
epochs = 1000
shapes=[]
shapes.append(1)
shapes.append(1)
adam=Adam(shapes,lr=0.1)
for epoch_count in range(epochs):
real_loss = loss(y, linear_model(x))
trainAdam(linear_model,adam, x, y)
print('w',linear_model.weight.numpy())
print('bias',linear_model.bias.numpy())
print('real_loss',real_loss.numpy())
I would like to keep the general structure of the code, but to replace the Adam Gradient Implementation.
The built-in optimizers in TensorFlow 2 can not only be used with tf.keras.Model.fit(), but also with tf.GradientTape(). With the latter, you can just call its apply_gradients() method directly. The optimizer object will keep track of the accumulators and running moments internally. Roughly, your code can be modified as follows:
adam = tf.optimizers.Adam(learning_rate)
def trainAdam(linear_model,adam, x, y):
with tf.GradientTape() as t:
current_loss = loss(y, linear_model(x))
gradWeight, gradBias = t.gradient(current_loss, [linear_model.weight, linear_model.bias])
adam.apply_gradients(zip([gradWeight, gradBias], [linear_model.weight, linear_model.bias]))

pyspark and pytorch: unable to retrieve gradients on the driver

I'm new to pytorch and I'm trying to explore the feasibility of its usage with spark (for now I'm working in spark standalone).
As for now I'm struggling on a very specific topic.
Let's start with a very simple model:
# linmodel.py
import torch
import torch.nn as nn
import numpy as np
def standardize(x):
return (x - np.mean(x)) / np.std(x)
def add_noise(y):
rnd = np.random.randn(y.shape[0])
return y + rnd
def cost(target, predicted):
cost = torch.sum((torch.t(target) - predicted) ** 2)
return cost
class LinModel(nn.Module):
def __init__(self, in_size, out_size):
super(LinModel, self).__init__() # always call parent's init
self.linear = nn.Linear(in_size, out_size, bias=False) # layer parameters
def forward(self, x):
return self.linear(x)
Which instantiates a basic linear model, along with some utility functions.
The goal is to approximate a target matrix, and to keep track of how the
gradients behave.
I'm trying to achieve the following:
create my target matrix
split the inputs on the workers
instantiate models and optimizer on the workers
compute the approximation on subsets of input
retrieve the gradients for further analysis
And everything works fine until point 5.
Here's the code:
#test.py
import torch
import torch.nn as nn
import numpy as np
import torch.optim
from torch.autograd import Variable
from pyspark import SparkContext
import linmodel
def prepare_input(nsamples=400):
Xold = np.linspace(0, 1000, nsamples).reshape([nsamples, 1])
X = linmodel.standardize(Xold)
W = np.random.randint(1, 10, size=(5, 1))
Y = W.dot(X.T) # target
for i in range(Y.shape[1]):
Y[:, i] = linmodel.add_noise(Y[:, i])
x = Variable(torch.from_numpy(X), requires_grad=False).type(torch.FloatTensor)
y = Variable(torch.from_numpy(Y), requires_grad=False).type(torch.FloatTensor)
print("created torch variables {} {}".format(x.size(), y.size()))
return x, y, W
def initialize(tup):
x, y = tup[0] # data
m, o = tup[1] # model and optimizer
model, optimizer = torch_step(x, y, m, o)
# here we have the gradients
print('gradient: {}'.format([param.grad.data for param in model.parameters()]))
return (x, y), (model, optimizer)
def create_model():
model = linmodel.LinModel(1, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)
return model, optimizer
def torch_step(x, y, model, optimizer):
prediction = model(x)
loss = linmodel.cost(y, prediction)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return model, optimizer
def main(sc, num_partitions=4):
x, y, W = prepare_input()
parts_x = list(torch.split(x, int(x.size()[0] / num_partitions)))
parts_y = list(torch.split(y, int(x.size()[0] / num_partitions), 1))
rdd_models = sc.parallelize([create_model() for _ in range(num_partitions)]).repartition(num_partitions)
rdd_x = sc.parallelize(parts_x).repartition(num_partitions)
rdd_y = sc.parallelize(parts_y).repartition(num_partitions)
parts = rdd_x.zip(rdd_y) # [((100x1), (5x100)), ...]
full = parts.zip(rdd_models).map(initialize).cache()
models_out = full.map(lambda x: x[1][0]).collect()
test_model = models_out[0]
print(type(test_model))
print('gradient: {}'.format([param.grad.data for param in test_model.parameters()]))
if __name__ == '__main__':
sc = SparkContext(appName='test')
main(sc)
As you can see in the comments , when the function initialize is mapped on the full rdd, if you inspect the logs of the executors you'll find the gradients to be computed.
When I collect the result and try to access the very same attribute on the driver I receive a AttributeError: 'NoneType' object has no attribute 'data'
meaning that all the model.grad attribute are set to None.
I'm sure I'm missing something big here, but I cannot see it.
Any hint is appreciated.
Thanks a lot.
There are two major mistakes in your approach (according to me):
Since you want a distributed training, your approach of instantiating the model separately in all of the executors is wrong. You should instantiate the model in the head node (node where spark driver is located) and then distribute that model to all the executors. So each executor independently does forward pass and calculates the gradients on its portion of data and passes the gradients to the head node for weight update (weight update has to be serialized). Then the updated network is again scattered to the executors for the next iteration.
A much bigger concern is that I am not very sure if the gradient buffers are copied to the head node from the executors when you perform .collect(). Due to which model.grad can be set to None. To begin debugging, I suggest you have only one executor (and 1 partition) and then perform a .collect() to see if the gradient buffers are being copied. Or if you are good at Java or Scala, you can look at the collect() method's implementation.
Hope this helps.....

Resources