FastText 0.9.2 - why is recall 'nan'? - python-3.x

I trained a supervised model in FastText using the Python interface and I'm getting weird results for precision and recall.
First, I trained a model:
model = fasttext.train_supervised("train.txt", wordNgrams=3, epoch=100, pretrainedVectors=pretrained_model)
Then I get results for the test data:
def print_results(N, p, r):
print("N\t" + str(N))
print("P#{}\t{:.3f}".format(1, p))
print("R#{}\t{:.3f}".format(1, r))
But the results are always odd, because they show precision and recall #1 as identical, even for different datasets, e.g. one output is:
N 46425
P#1 0.917
R#1 0.917
Then when I look for the precision and recall for each label, I always get recall as 'nan':
And the output is:
{'__label__1': {'precision': 0.9202150724134941, 'recall': nan, 'f1score': 1.8404301448269882}, '__label__5': {'precision': 0.9134956983264135, 'recall': nan, 'f1score': 1.826991396652827}}
Does anyone know why this might be happening?
P.S.: To try a reproducible example of this behavior, please refer to and run it with FastText 0.9.2

It looks like FastText 0.9.2 has a bug in the computation of recall, and that should be fixed with this commit.
Installing a "bleeding edge" version of FastText e.g. with
pip install git+ --quiet
and rerunning your code should allow to get rid of the nan values in the recall computation.


keras prediction output's decimal point without scientifc notation?

Through Keras, I have a question while studying at CNN.
How do I output the results from the predictive calculation of the model to the fixed decimal point?
The results of my current model are array([[6.527474e-05, 5.269228e-05, 9.998820e-01]], dtype=float32)
I want my model output -
(9.998820e-01) -> 0.9998820
Both are actually the same.
(9.998820e-01) and 0.9998820
you can enable this option, to display the numbers without scientific notation while you print the numpy arrays.

How to make polynomial features using sparse matrix in Scikit-learn

I am using Scikit-learn for converting my train data to polynomials features and then fit it to a linear model.
model = Pipeline([('poly', PolynomialFeatures(degree=3)),
('linear', LinearRegression(fit_intercept=False))]), y)
But it throws an error
TypeError: A sparse matrix was passed, but dense data is required
I know my data is sparse matrix format. So when I try to convert my data to dense matrix it shows memory error. Because my data is huge(50k~). Because of these large amounts of data I can't convert it to a dense matrix.
I also find Github Issues where this feature is requested. But still not implemented.
So please can someone tell how to use sparse data format in PolynomialFeatures in Scikit-learn without converting it to dense format?
This is a new feature in the upcoming 0.20 version of sklearn. See Release History - V0.20 - Enhancements If you really wanted to test it out you could install the development version by following the instructions in Sklean - Advanced Installation - Install Bleeding Edge.
Since version 0.21.0, the PolynomialFeatures class accepts CSR matrices for degrees 2 and 3. The method laid out here is used, and the computation is much, much faster than if the input is a CSC matrix or dense (assuming the data's sparse to any reasonable degree - even slightly).
While we are waiting for the latest update of Sklearn - you can find an implementation of sparse interaction here:

KNearestNeighbour is not running in multithread in Scikit-Learn

I'm succesfully using scikit-learn on my machine. I'm experimenting with an anaconda implemnetation (that relies on MKL for multithreading) and an openblas implementation.
I'd really like to use a parallel version of k-nearest neighbour classifier, and according to , sklearn should have merged this changes 1 year ago, in version 0.17.
Multithreading works successfully for PCA, and all numpy operations. I can tell multithreading is working due to high number of threads I can see when I do dot products and PCA. When I lunch KNN is taking around 10 minutes.
I’m classifying a high dimensional dataset of MNIST (image digits). So I’m doing PCA to get vector of dimension 35-50, and then I’m doing a nonlinear expansion, so I’m getting vector of dimension 600-100. That’s why I need parallelism so badly.
My version of sklearn is:
print('The scikit-learn version is {}.'.format(sklearn.version))
The scikit-learn version is 0.18.1.
I'm using python3 and this is a sample of the code:
def classify_knn(train, test, train_labels):
clf = KNeighborsClassifier(algorithm='ball_tree')
clf =, train_labels)
return clf.predict(test)
I've tried with and without 'ball_tree'. No one should using python 2.7 in 2017 and neither do I.
Just passing as a parameter
n_jobs = -1
solved the issue.

Theano, My code is always different results come out With gpu(Titan X)

I am making a simple model with titan X. But every time the result is different.
It is a simple model.
1. Averaging embedding of words
2. Using a hidden layer input
3. Binary Classifier using softmax
In the each learning process, cost value Subtly different.
predict result also Subtly different.( ex) First attempt: 56.43, Second attempt: 56.34 .... )
Strange thing is, if do not update Embedding Layer, It is equal to all the experimental result.
However, When testing with GTX 970, It same with learning process-cost value subtly different.
But predict result, It is equal to all the experimental result.
I know CuDNN that can cause this. But I'm not use CNN architecture. just average, relu, softmax...
Help me if you know this problem.
The environment is as follows.
2. theano version: 0.9.0
3. OS: Ubuntu 14.04

How to calculate the mean training score using GridSearchCV in Scikit-Learns

I would like to plot the mean validation vs mean training score for Linear Support Vector machine in a similar fashion as done here:
However when running similar code the parameter compute_training_scores does not seem to exist.
Also this parameter is not documented [1]. I checked the current master branch on Github and it does not seem to be committed yet.
I am using Scikit-learn 0.14.1
I am a bit confused here. Is there branch or tag that I need in order to get the same functionality or is there a alternative way to calculate this?
The code in question:
param_grid = {'C': 10. ** np.arange(-3, 4)}
grid_search = GridSearchCV(svm, param_grid=param_grid, cv=3, verbose=3, compute_training_score=True), y_train);
plt.plot([c.mean_validation_score for c in grid_search.cv_scores_], label="validation error")
plt.plot([c.mean_training_score for c in grid_search.cv_scores_], label="training error")
plt.xticks(np.arange(6), param_grid['C']); plt.xlabel("C"); plt.ylabel("Accuracy");plt.legend(loc='best');
If I run the same code without the offending parameter I get:
AttributeError: '_CVScoreTuple' object has no attribute 'mean_training_score'
mean_validation_score and mean_training_score will be available in the next scikit-learn release, 0.15. You need to install from GitHub to get it.
