I trained a supervised model in FastText using the Python interface and I'm getting weird results for precision and recall.
First, I trained a model:
model = fasttext.train_supervised("train.txt", wordNgrams=3, epoch=100, pretrainedVectors=pretrained_model)
Then I get results for the test data:
def print_results(N, p, r):
print("N\t" + str(N))
print("P#{}\t{:.3f}".format(1, p))
print("R#{}\t{:.3f}".format(1, r))
print_results(*model.test('test.txt'))
But the results are always odd, because they show precision and recall #1 as identical, even for different datasets, e.g. one output is:
N 46425
P#1 0.917
R#1 0.917
Then when I look for the precision and recall for each label, I always get recall as 'nan':
print(model.test_label('test.txt'))
And the output is:
{'__label__1': {'precision': 0.9202150724134941, 'recall': nan, 'f1score': 1.8404301448269882}, '__label__5': {'precision': 0.9134956983264135, 'recall': nan, 'f1score': 1.826991396652827}}
Does anyone know why this might be happening?
P.S.: To try a reproducible example of this behavior, please refer to https://github.com/facebookresearch/fastText/issues/1072 and run it with FastText 0.9.2
It looks like FastText 0.9.2 has a bug in the computation of recall, and that should be fixed with this commit.
Installing a "bleeding edge" version of FastText e.g. with
pip install git+https://github.com/facebookresearch/fastText.git#b64e359d5485dda4b4b5074494155d18e25c8d13 --quiet
and rerunning your code should allow to get rid of the nan values in the recall computation.
Related
Through Keras, I have a question while studying at CNN.
How do I output the results from the predictive calculation of the model to the fixed decimal point?
The results of my current model are array([[6.527474e-05, 5.269228e-05, 9.998820e-01]], dtype=float32)
I want my model output -
(9.998820e-01) -> 0.9998820
Both are actually the same.
(9.998820e-01) and 0.9998820
you can enable this option, to display the numbers without scientific notation while you print the numpy arrays.
np.set_printoptions(suppress=True)
I am using Scikit-learn for converting my train data to polynomials features and then fit it to a linear model.
model = Pipeline([('poly', PolynomialFeatures(degree=3)),
('linear', LinearRegression(fit_intercept=False))])
model.fit(X, y)
But it throws an error
TypeError: A sparse matrix was passed, but dense data is required
I know my data is sparse matrix format. So when I try to convert my data to dense matrix it shows memory error. Because my data is huge(50k~). Because of these large amounts of data I can't convert it to a dense matrix.
I also find Github Issues where this feature is requested. But still not implemented.
So please can someone tell how to use sparse data format in PolynomialFeatures in Scikit-learn without converting it to dense format?
This is a new feature in the upcoming 0.20 version of sklearn. See Release History - V0.20 - Enhancements If you really wanted to test it out you could install the development version by following the instructions in Sklean - Advanced Installation - Install Bleeding Edge.
Since version 0.21.0, the PolynomialFeatures class accepts CSR matrices for degrees 2 and 3. The method laid out here is used, and the computation is much, much faster than if the input is a CSC matrix or dense (assuming the data's sparse to any reasonable degree - even slightly).
While we are waiting for the latest update of Sklearn - you can find an implementation of sparse interaction here:
https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/features/SparseInteractions.py
I'm succesfully using scikit-learn on my machine. I'm experimenting with an anaconda implemnetation (that relies on MKL for multithreading) and an openblas implementation.
I'd really like to use a parallel version of k-nearest neighbour classifier, and according to https://github.com/scikit-learn/scikit-learn/pull/4009 , sklearn should have merged this changes 1 year ago, in version 0.17.
Multithreading works successfully for PCA, and all numpy operations. I can tell multithreading is working due to high number of threads I can see when I do dot products and PCA. When I lunch KNN is taking around 10 minutes.
I’m classifying a high dimensional dataset of MNIST (image digits). So I’m doing PCA to get vector of dimension 35-50, and then I’m doing a nonlinear expansion, so I’m getting vector of dimension 600-100. That’s why I need parallelism so badly.
My version of sklearn is:
print('The scikit-learn version is {}.'.format(sklearn.version))
The scikit-learn version is 0.18.1.
I'm using python3 and this is a sample of the code:
def classify_knn(train, test, train_labels):
clf = KNeighborsClassifier(algorithm='ball_tree')
clf = clf.fit(train, train_labels)
return clf.predict(test)
I've tried with and without 'ball_tree'. No one should using python 2.7 in 2017 and neither do I.
Just passing as a parameter
n_jobs = -1
solved the issue.
I am making a simple model with titan X. But every time the result is different.
It is a simple model.
1. Averaging embedding of words
2. Using a hidden layer input
3. Binary Classifier using softmax
In the each learning process, cost value Subtly different.
predict result also Subtly different.( ex) First attempt: 56.43, Second attempt: 56.34 .... )
Strange thing is, if do not update Embedding Layer, It is equal to all the experimental result.
However, When testing with GTX 970, It same with learning process-cost value subtly different.
But predict result, It is equal to all the experimental result.
I know CuDNN that can cause this. But I'm not use CNN architecture. just average, relu, softmax...
Help me if you know this problem.
The environment is as follows.
1. GPU: TITAN X
2. theano version: 0.9.0
3. OS: Ubuntu 14.04
Thanks....
I would like to plot the mean validation vs mean training score for Linear Support Vector machine in a similar fashion as done here: http://youtu.be/9qg9__n4X2A?t=20m33s
However when running similar code the parameter compute_training_scores does not seem to exist.
Also this parameter is not documented [1]. I checked the current master branch on Github and it does not seem to be committed yet.
I am using Scikit-learn 0.14.1
I am a bit confused here. Is there branch or tag that I need in order to get the same functionality or is there a alternative way to calculate this?
The code in question:
param_grid = {'C': 10. ** np.arange(-3, 4)}
grid_search = GridSearchCV(svm, param_grid=param_grid, cv=3, verbose=3, compute_training_score=True)
grid_search.fit(X_train, y_train);
plt.plot([c.mean_validation_score for c in grid_search.cv_scores_], label="validation error")
plt.plot([c.mean_training_score for c in grid_search.cv_scores_], label="training error")
plt.xticks(np.arange(6), param_grid['C']); plt.xlabel("C"); plt.ylabel("Accuracy");plt.legend(loc='best');
If I run the same code without the offending parameter I get:
AttributeError: '_CVScoreTuple' object has no attribute 'mean_training_score'
[1] http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html
mean_validation_score and mean_training_score will be available in the next scikit-learn release, 0.15. You need to install from GitHub to get it.