sklearn TtfidfVectorizer stopwords_ - scikit-learn

Is there a way to get the tf and idf for the stopwords_ attribute of sklearn's TtfidfVectorizer (not stopwords)?
They are already calculated, so the model should have these values, but has anyone ever used them? If not, I guess I have to hack the internal code and get them myself, correct?
[UPDATE]
For anyone who might end up on this question, as an update, what I ended up doing is hacking sklearn/feature_extraction/text.py and exporting the words and values as tuples for class CountVectorizer rather than just the words.

Related

tensorflow seq2seq model outputting the same output

I am developing an encoder-decoder model in order to predict titles for lecture transcripts. but the model is predicting the same title no matter what the input is. Any idea what may have caused such a problem?
If you would like to solve this, I will strongly recommend you to provide your code as an example, better including your loss, accuracy or something people will be more familiar about your problem. However, here are some conditions that will run into that problem: 1) your code was not doing the things you would like to do somehow. 2) LSTM sometimes experience gradient explode or gradient vanish problem, although it was said to fix those problem that a RNN structure will face, it still get into that problem form time to time anyway. 3) forget to shuffle your dataset before training, which makes your model learn the same pattern of one kind all the time. If all the things that mentioned above did not fit in your case, try to provide your code and dataset information to make it clear.

Use pretrained embedding in Spanish with Torchtext

I am using Torchtext in an NLP project. I have a pretrained embedding in my system, which I'd like to use. Therefore, I tried:
my_field.vocab.load_vectors(my_path)
But, apparently, this only accepts the names of a short list of pre-accepted embeddings, for some reason. In particular, I get this error:
Got string input vector "my_path", but allowed pretrained vectors are ['charngram.100d', 'fasttext.en.300d', ..., 'glove.6B.300d']
I found some people with similar problems, but the solutions I can find so far are "change Torchtext source code", which I would rather avoid if at all possible.
Is there any other way in which I can work with my pretrained embedding? A solution that allows to use another Spanish pretrained embedding is acceptable.
Some people seem to think it is not clear what I am asking. So, if the title and final question are not enough: "I need help using a pre-trained Spanish word-embedding in Torchtext".
It turns out there is a relatively simple way to do this without changing Torchtext's source code. Inspiration from this Github thread.
1. Create numpy word-vector tensor
You need to load your embedding so you end up with a numpy array with dimensions (number_of_words, word_vector_length):
my_vecs_array[word_index] should return your corresponding word vector.
IMPORTANT. The indices (word_index) for this array array MUST be taken from Torchtext's word-to-index dictionary (field.vocab.stoi). Otherwise Torchtext will point to the wrong vectors!
Don't forget to convert to tensor:
my_vecs_tensor = torch.from_numpy(my_vecs_array)
2. Load array to Torchtext
I don't think this step is really necessary because of the next one, but it allows to have the Torchtext field with both the dictionary and vectors in one place.
my_field.vocab.set_vectors(my_field.vocab.stoi, my_vecs_tensor, word_vectors_length)
3. Pass weights to model
In your model you will declare the embedding like this:
my_embedding = toch.nn.Embedding(vocab_len, word_vect_len)
Then you can load your weights using:
my_embedding.weight = torch.nn.Parameter(my_field.vocab.vectors, requires_grad=False)
Use requires_grad=True if you want to train the embedding, use False if you want to freeze it.
EDIT: It looks like there is another way that looks a bit easier! The improvement is that apparently you can pass the pre-trained word vectors directly during the vocabulary-building step, so that takes care of steps 1-2 here.

How to stop training some specific weights in TensorFlow

I'm just beginning to learn TensorFlow and I have some problems with it.In training loop I want to ignore the small weights and stop training them. I've assigned these small weights to zero. I searched the tf API and found tf.Variable(weight,trainable=False) can stop training the weight. If the value of the weight is equal to zero I will use this function. I tried to use .eval() but there occurred an exception ValueError("Cannot evaluate tensor using eval(): No default ". I have no idea how to get the value of the variable when in training loop. Another way is to modify the tf.train.GradientDescentOptimizer(), but I don't know how to do it. Has anyone implemented this code yet or any other methods suggested? Thanks in advance!
Are you looking to apply regularization to the weights?
There is an apply_regularization method in the API that you can use to accomplish that.
See: How to exactly add L1 regularisation to tensorflow error function
I don't know any use-case for stopping training of some variables, probably it's not what you should do.
Anyway, calling tf.Variable() (if I got you right) is not going to help you, because it's called just once when the graph is defined. The first argument is initial_value: as the name suggests, it's assigned only during initialization.
Instead, you can use tf.assign like this:
with tf.Session() as session:
assign_op = var.assign(0)
session.run(assign_op)
It will update the variable during the session, which is what you're asking for.

how to reuse the classifier in the pickled pipeline in sklearn?

I have read the answer in another post https://stackoverflow.com/a/25794131/4566048
the classifier is pickled, how about the TfidfVectorizer? how can I use it from the pickled pipeline? since I need it to transform my feature vector, I still need to use it right?
After some digging around, I seem to have solved the problem. I will answer my own question here in case it can help anyone with same doubt in the future.
I found that only save the classifier is not enough, CountVectorizer and TfidfTransformer which are used to do the feature vector extraction need to be saved as well for it to work.
hope that helps!

How can I convert probability into score?

I am now working on a document recommendation program and I am kinda stuck here.
For each document, I have a score assigned according to user's actions. Then, when a new document comes in, I need to predict how user will like it and rerank the whole documents again according to their scores. My solution is to use a threshold to divide those scores into "recommend" and "not recommend". Then naiveBayes or other classification models can either give me a label or return the possibility of that label (I am using NLTK package to do text analytics).
Am I on the right way? My question is when I get that possibility, how can I convert it into the score that I use to do the ranking? Or I should use logistic regression in scikit instead?
Thanks!
It sounds like you are trying to force a ranking problem into a classification problem. What you really want to do is learn how to rank the documents given a "query".
I would suggest trying out something like the SVM-Rank algorithm. It takes as input a set of "recommended" and "not recommended" vectors and then learns how to rank them so that the recommended ones come first. There is also a simple python tool in dlib you can use to do it. See here for an example: http://dlib.net/svm_rank.py.html

Resources