how to reuse the classifier in the pickled pipeline in sklearn? - nlp

I have read the answer in another post https://stackoverflow.com/a/25794131/4566048
the classifier is pickled, how about the TfidfVectorizer? how can I use it from the pickled pipeline? since I need it to transform my feature vector, I still need to use it right?

After some digging around, I seem to have solved the problem. I will answer my own question here in case it can help anyone with same doubt in the future.
I found that only save the classifier is not enough, CountVectorizer and TfidfTransformer which are used to do the feature vector extraction need to be saved as well for it to work.
hope that helps!

Related

How to save model from best iteration in xgboost?

I am using XGBClassifier for my image classification. As i am new to machine learning and xgboost. But recently i got to know that the model i am saving by using pickle library after certain iteration is the last iteration not the best iteration. Can anyone tell me how can i save the model from best iteration? Obviously i am using early stop.
I kindly apologize if i make any mistake in asking questions. Please i need the solution as soon as possible because i need it for my thesis.
And those who are suggesting me older questions for best iteration please my question is different i want to save the best iteration in pickle format so that i can use it in future not just use it in predict later in the same code.
Thank you.
use joblib dump/load to save/load the model, and get the booster of the model, to get the best iteration

Custom Loss Function with Spacy Textcat

I've been looking around for a while now. I would like to know if it's possible to modify/customize the loss function of the spaCy textcategorizer.
I mean, when you want to distill a model (for instance BERT) and want to add a regression component in the loss function to optimize (regarding the probabilities of each class instead only the labels), I don't understand where I should look for. I tried to explore some spaCy code but there is only a function to get the loss.
If someone know where to look for to visualize the loss function and change it (by writing a subclass for instance) it would be nice !
Thanks
Arnault
SpaCy is ultimately built on top of thinc and therefore, if you want to do custom work, you should tinker with Thinc, not SpaCy. SpaCy typically allows you to initialize a pipe with a raw Thinc model.
Especially since SpaCy's philosophy is to provide one implementation that works well not necessarily a super customizable framework.

sklearn TtfidfVectorizer stopwords_

Is there a way to get the tf and idf for the stopwords_ attribute of sklearn's TtfidfVectorizer (not stopwords)?
They are already calculated, so the model should have these values, but has anyone ever used them? If not, I guess I have to hack the internal code and get them myself, correct?
[UPDATE]
For anyone who might end up on this question, as an update, what I ended up doing is hacking sklearn/feature_extraction/text.py and exporting the words and values as tuples for class CountVectorizer rather than just the words.

what are the methods to check if my model fits the data (without using graphs)

I am working on a binary logistic regression data set in python. I want to know if there are any numerical methods to calculate how well the model fits the data.
please don't include graphical methods like plotting etc.
Thanks :)
read through 3.3.2. Classification metrics in sklearn documentation.
http://scikit-learn.org/stable/modules/model_evaluation.html
hope it helps.

Image Augmentation of Siamese CNN

I have a task to compare two images and check whether they are of the same class (using Siamese CNN). Because I have a really small data set, I want to use keras imageDataGenerate.
I have read through the documentation and have understood the basic idea. However, I am not quite sure how to apply it to my use case, i.e. how to generate two images and a label that they are in the same class or not.
Any help would be greatly appreciated?
P.S. I can think of a much more convoluted process using sklearn's extract_patches_2d but I feel there is an elegant solution to this.
Edit: It looks like creating my own data generator may be the way to go. I will try this approach.

Resources