I'm succesfully using scikit-learn on my machine. I'm experimenting with an anaconda implemnetation (that relies on MKL for multithreading) and an openblas implementation.
I'd really like to use a parallel version of k-nearest neighbour classifier, and according to https://github.com/scikit-learn/scikit-learn/pull/4009 , sklearn should have merged this changes 1 year ago, in version 0.17.
Multithreading works successfully for PCA, and all numpy operations. I can tell multithreading is working due to high number of threads I can see when I do dot products and PCA. When I lunch KNN is taking around 10 minutes.
I’m classifying a high dimensional dataset of MNIST (image digits). So I’m doing PCA to get vector of dimension 35-50, and then I’m doing a nonlinear expansion, so I’m getting vector of dimension 600-100. That’s why I need parallelism so badly.
My version of sklearn is:
print('The scikit-learn version is {}.'.format(sklearn.version))
The scikit-learn version is 0.18.1.
I'm using python3 and this is a sample of the code:
def classify_knn(train, test, train_labels):
clf = KNeighborsClassifier(algorithm='ball_tree')
clf = clf.fit(train, train_labels)
return clf.predict(test)
I've tried with and without 'ball_tree'. No one should using python 2.7 in 2017 and neither do I.
Just passing as a parameter
n_jobs = -1
solved the issue.
Related
I have found k-means implementation in PyTorch using GPUs which is 30 times faster than CPUs. Is there any method (such as Silhouette score, Dunn's index, ...) implemented preferably in PyTorch that uses GPUs?
I implemented the silhouette score in PyTorch based on a numpy implementation from Alexandre Abraham (https://gist.github.com/AlexandreAbraham/5544803):
https://github.com/maxschelski/pytorch-cluster-metrics
After installing the package you can calculate the silhouette score as follows:
from torchclustermetrics import silhouette
score = silhouette.score(X, labels)
With X being the multi-dimensional data (NumPy array or PyTorch tensor; first dimension for samples) and labels being a 1D array of labels for each sample.
I tested the code on PyTorch = 1.10.1 (cuda11.3_cudnn8_0).
In my hands it gave an approximately 30 fold speed up on a GPU compared to the scikit-learn implementation.
I am using Scikit-learn for converting my train data to polynomials features and then fit it to a linear model.
model = Pipeline([('poly', PolynomialFeatures(degree=3)),
('linear', LinearRegression(fit_intercept=False))])
model.fit(X, y)
But it throws an error
TypeError: A sparse matrix was passed, but dense data is required
I know my data is sparse matrix format. So when I try to convert my data to dense matrix it shows memory error. Because my data is huge(50k~). Because of these large amounts of data I can't convert it to a dense matrix.
I also find Github Issues where this feature is requested. But still not implemented.
So please can someone tell how to use sparse data format in PolynomialFeatures in Scikit-learn without converting it to dense format?
This is a new feature in the upcoming 0.20 version of sklearn. See Release History - V0.20 - Enhancements If you really wanted to test it out you could install the development version by following the instructions in Sklean - Advanced Installation - Install Bleeding Edge.
Since version 0.21.0, the PolynomialFeatures class accepts CSR matrices for degrees 2 and 3. The method laid out here is used, and the computation is much, much faster than if the input is a CSC matrix or dense (assuming the data's sparse to any reasonable degree - even slightly).
While we are waiting for the latest update of Sklearn - you can find an implementation of sparse interaction here:
https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/features/SparseInteractions.py
Spark ML LinearRegression seems to regress against a single label.
LabeledPoint(label: Double, features: Array[Double])
https://spark.apache.org/docs/0.8.1/api/mllib/org/apache/spark/mllib/regression/LabeledPoint.html
However, with my problem, I need to predict a vector
e.g.
LabeledPoint(label: Array[Double], features: Array[Double])
Is there a way for me to do this? (this is supported in sickit-learn and I am trying to do it in spark)
ps 1: If this is not possible in MLLib directly, is there a tutorial on how to implement this from scratch using spark?
ps 2: My output labels is a 60 element vector. So I could run a LinearRegression 60 times and then run 60 predictions to predict. But that seems like a hack
There is no native implementation from what I've known but if you look at the scikit-learn implementation for Multioutput regression it says that the "strategy consists of fitting one regressor per target. Since each target is represented by exactly one regressor it is possible to gain knowledge about the target by inspecting its corresponding regressor".
This means that a potential implementation could be to parallelize the regression step for each target. You could then distribute the calculation at the same time to speed things up.
I am making a simple model with titan X. But every time the result is different.
It is a simple model.
1. Averaging embedding of words
2. Using a hidden layer input
3. Binary Classifier using softmax
In the each learning process, cost value Subtly different.
predict result also Subtly different.( ex) First attempt: 56.43, Second attempt: 56.34 .... )
Strange thing is, if do not update Embedding Layer, It is equal to all the experimental result.
However, When testing with GTX 970, It same with learning process-cost value subtly different.
But predict result, It is equal to all the experimental result.
I know CuDNN that can cause this. But I'm not use CNN architecture. just average, relu, softmax...
Help me if you know this problem.
The environment is as follows.
1. GPU: TITAN X
2. theano version: 0.9.0
3. OS: Ubuntu 14.04
Thanks....
Does python's scikit-learn have any regression models that work well with sparse data?
I was poking around and found this "sparse linear regression" module, but it seems outdated. (It's so old that scikit-learn was called 'scikits-learn' at the time, I think.)
Most scikit-learn regression models (linear such as Ridge, Lasso, ElasticNet or non-linear, e.g. with RandomForestRegressor) support both dense and sparse input data recent versions of scikit-learn (0.16.0 is the latest stable version at the time of writing).
Edit: if you are unsure, check the docstring of the fit method of the class of interest.