Is there any cluster evaluation method implemented in pyTorch? - pytorch

I have found k-means implementation in PyTorch using GPUs which is 30 times faster than CPUs. Is there any method (such as Silhouette score, Dunn's index, ...) implemented preferably in PyTorch that uses GPUs?

I implemented the silhouette score in PyTorch based on a numpy implementation from Alexandre Abraham (https://gist.github.com/AlexandreAbraham/5544803):
https://github.com/maxschelski/pytorch-cluster-metrics
After installing the package you can calculate the silhouette score as follows:
from torchclustermetrics import silhouette
score = silhouette.score(X, labels)
With X being the multi-dimensional data (NumPy array or PyTorch tensor; first dimension for samples) and labels being a 1D array of labels for each sample.
I tested the code on PyTorch = 1.10.1 (cuda11.3_cudnn8_0).
In my hands it gave an approximately 30 fold speed up on a GPU compared to the scikit-learn implementation.

Related

Is there a way to plot loss function vs iterations for sklearn implementation of SVM?

I'm using an sklearn implementation of support vector machine for a binary classification problem. I wanted to store loss function values for each iteration and plot it to visualize model performance. Is this possible?

DCGANs discriminator accuracy metric using PyTorch

I am implementing DCGANs using PyTorch.
It works well in that I can get reasonable quality generated images, however now I want to evaluate the health of the GAN models by using metrics, mainly the ones introduced by this guide https://machinelearningmastery.com/practical-guide-to-gan-failure-modes/
Their implementation uses Keras which SDK lets you define what metrics you want when you compile the model, see https://keras.io/api/models/model/. In this case the accuracy of the discriminator, i.e. percentage of when it successfully identifies an image as real or generated.
With the PyTorch SDK, I can't seem to find a similar feature that would help me easily acquire this metric from my model.
Does Pytorch provide the functionality to be able to define and extract common metrics from a model?
Pure PyTorch does not provide metrics out of the box, but it is very easy to define those yourself.
Also there is no such thing as "extracting metrics from model". Metrics are metrics, they measure (in this case accuracy of discriminator), they are not inherent to the model.
Binary accuracy
In your case, you are looking for binary accuracy metric. Below code works with either logits (unnormalized probability outputed by discriminator, probably last nn.Linear layer without activation) or probabilities (last nn.Linear followed by sigmoid activation):
import typing
import torch
class BinaryAccuracy:
def __init__(
self,
logits: bool = True,
reduction: typing.Callable[
[
torch.Tensor,
],
torch.Tensor,
] = torch.mean,
):
self.logits = logits
if logits:
self.threshold = 0
else:
self.threshold = 0.5
self.reduction = reduction
def __call__(self, y_pred, y_true):
return self.reduction(((y_pred > self.threshold) == y_true.bool()).float())
Usage:
metric = BinaryAccuracy()
target = torch.randint(2, size=(64,))
outputs = torch.randn(size=(64, 1))
print(metric(outputs, target))
PyTorch Lightning or other third party
You can also use PyTorch Lightning or other framework on top of PyTorch which defines metrics like accuracy

Stratified cross validation with Pytorch

My goal is to make binary classification, using neural network.
The problem is that dataset is unbalanced, I have 90% of class 1 and 10 of class 0.
To deal with it I want to use Stratified cross-validation.
The problem that is I am working with Pytorch, I can't find any example and documentation doesn't provide it, and I'm student, quite new for neural networks.
Can anybody help?
Thank you!
The easiest way I've found is to do you stratified splits before passing your data to Pytorch Dataset and DataLoader. That lets you avoid having to port all your code to skorch, which can break compatibility with some cluster computing frameworks.
Have a look at skorch. It's a scikit-learn compatible neural network library that wraps PyTorch. It has a function CVSplit for cross validation or you can use sklearn.
From the docs:
net = NeuralNetClassifier(
module=MyModule,
train_split=None,
)
from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(net, X, y, cv=5)

How to make polynomial features using sparse matrix in Scikit-learn

I am using Scikit-learn for converting my train data to polynomials features and then fit it to a linear model.
model = Pipeline([('poly', PolynomialFeatures(degree=3)),
('linear', LinearRegression(fit_intercept=False))])
model.fit(X, y)
But it throws an error
TypeError: A sparse matrix was passed, but dense data is required
I know my data is sparse matrix format. So when I try to convert my data to dense matrix it shows memory error. Because my data is huge(50k~). Because of these large amounts of data I can't convert it to a dense matrix.
I also find Github Issues where this feature is requested. But still not implemented.
So please can someone tell how to use sparse data format in PolynomialFeatures in Scikit-learn without converting it to dense format?
This is a new feature in the upcoming 0.20 version of sklearn. See Release History - V0.20 - Enhancements If you really wanted to test it out you could install the development version by following the instructions in Sklean - Advanced Installation - Install Bleeding Edge.
Since version 0.21.0, the PolynomialFeatures class accepts CSR matrices for degrees 2 and 3. The method laid out here is used, and the computation is much, much faster than if the input is a CSC matrix or dense (assuming the data's sparse to any reasonable degree - even slightly).
While we are waiting for the latest update of Sklearn - you can find an implementation of sparse interaction here:
https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/features/SparseInteractions.py

KNearestNeighbour is not running in multithread in Scikit-Learn

I'm succesfully using scikit-learn on my machine. I'm experimenting with an anaconda implemnetation (that relies on MKL for multithreading) and an openblas implementation.
I'd really like to use a parallel version of k-nearest neighbour classifier, and according to https://github.com/scikit-learn/scikit-learn/pull/4009 , sklearn should have merged this changes 1 year ago, in version 0.17.
Multithreading works successfully for PCA, and all numpy operations. I can tell multithreading is working due to high number of threads I can see when I do dot products and PCA. When I lunch KNN is taking around 10 minutes.
I’m classifying a high dimensional dataset of MNIST (image digits). So I’m doing PCA to get vector of dimension 35-50, and then I’m doing a nonlinear expansion, so I’m getting vector of dimension 600-100. That’s why I need parallelism so badly.
My version of sklearn is:
print('The scikit-learn version is {}.'.format(sklearn.version))
The scikit-learn version is 0.18.1.
I'm using python3 and this is a sample of the code:
def classify_knn(train, test, train_labels):
clf = KNeighborsClassifier(algorithm='ball_tree')
clf = clf.fit(train, train_labels)
return clf.predict(test)
I've tried with and without 'ball_tree'. No one should using python 2.7 in 2017 and neither do I.
Just passing as a parameter
n_jobs = -1
solved the issue.

Resources