n-gram vectorization using TfidfVectorizer

n-gram vectorization using TfidfVectorizer - scikit-learn

I am using TfidfVectorizer
with following parameters:
smooth_idf=False, sublinear_tf=False, norm=None, analyzer='word', ngram_range=(1,2)
I am vectorizing following text: "red sun, pink candy. Green flower."
Here is output of get_feature_names():
['candy', 'candy green', 'coffee', 'flower', 'green', 'green flower', 'hate', 'icecream', 'like', 'moon', 'pink', 'pink candy', 'red', 'red sun', 'sun', 'sun pink']
Since "candy" and "green" are part of the separate sentences, why is "candy green" n-gram created?
Is there a way to prevent creation of n-grams spawning multiple sentences?

Depends on how you are passing that to TfidfVectorizer!
If passed as a single document, TfidfVectorizer will only keep words which contain 2 or more alphanumeric characters. Punctuation is completely ignored and always treated as a token separator. So your sentence becomes:
['red', 'sun', 'pink', 'candy', 'green', 'flower']
Now from these tokens, ngrams are generated.
Since TfidfVectorizer is a bag-of-words technique, working on words appearing in a document, it does not keep any information about the structure or order of words in a single document.
If you want them to be treated separately, then you should detect the sentences yourself and pass them as different documents.
Or else, pass your own analyzer and ngram generator to the TfidfVectorizer.
For more information on how TfidfVectorizer actually works, see my other answer:
sklearn TfidfVectorizer : Generate Custom NGrams by not removing stopword in them

Related

How to know which contextual embedding to use at test time

Models like BERT generate contextual embeddings for words with different contextual meanings, like 'bank', 'left'.
I don't understand which contextual embedding the model chooses to use at test time? Given a test sentence for classification, when we load the pre-trained bert, how do we initialize the word (token) embedding to use the right contextual embedding over the other embeddings of the same word?
More specifically, there is a convert_to_id() function which converts a word to an id? how does one id represent the correct contextual embedding for the input sentence at test time? Thank you.
I searched all over online but only found explanation about the difference between static vs. contextual embedding, the high level concept is easy to get, but how is that really achieved is unclear. I also search some code example, but the convert_to_id() makes me further confused as I asked in my question.

TL;DR There's only one embedding for the word "left." There's also no way to know which meaning the word has if the sequence is only one word. BERT uses the representation of it's start-of-sequence token (i.e., [CLS]) to represent each sequence, and that representation will differ depending on what context the word "left" is used in.
Given your example of text classification, the input sentence is first tokenized using the WordPiece tokenizer, and the [CLS] token's representation is fed to a feedforward layer for classification.
You can't really debate context when given single words, so I'll use two different sentences:
I left my house this morning.
You'll see my house on your left.
The steps to performing text classification typically are:
Tokenize your input text and receive the necessary input_ids and attention_masks.
Feed this tokenized input into your model and receive the outputs.
Feed the [CLS] token's representation to a classifier layer (typically a feedforward network).
The two sentences are tokenized to (using bert-base-uncased):
['[CLS]', 'i', 'left', 'my', 'house', 'this', 'morning', '.', '[SEP]']
['[CLS]', 'you', "'", 'll', 'see', 'my', 'house', 'on', 'your', 'left', '.', '[SEP]']
The [CLS] token's representation for each sentence will be different because the sentences have different words (i.e., contexts). The result is therefore different.
>>> from transformers import AutoModel, AutoTokenizer
>>> bert = AutoModel.from_pretrained('bert-base-uncased')
>>> tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
>>> sentence1 = "I left my house this morning."
>>> sentence2 = "You'll see my house on your left."
>>> sentence1_inputs = tokenizer(sentence1, return_tensors='pt')
>>> sentence2_inputs = tokenizer(sentence2, return_tensors='pt')
>>> sentence1_outputs = bert(**sentence1_inputs)
>>> sentence2_outputs = bert(**sentence2_inputs)
>>> cls1 = sentence1_outputs[1]
>>> cls2 = sentence2_outputs[1]
>>> print(cls1[0][:5])
... tensor([-0.8832, -0.3484, -0.8044, 0.6536, 0.6409], grad_fn=<SliceBackward0>)
>>> print(cls2[0][:5])
... tensor([-0.8791, -0.4069, -0.8994, 0.7371, 0.7010], grad_fn=<SliceBackward0>)

Identifying phrases which contrast two corpora

I would like to identify compound phrases in one corpus (e.g. (w_1, w_2) in Corpus 1) which not only appear significantly more often than their constituents (e.g. (w_1),(w_2) in Corpus 1) within the corpus but also more than they do in a second corpus (e.g. (w_1, w_2) in Corpus 2). Consider the following informal example. I have the two corpora each consisting of a set of documents:
[['i', 'live', 'in', 'new', 'york'], ['new', 'york', 'is', 'busy'], ...]
[['los', 'angeles', 'is', 'sunny'], ['los', 'angeles', 'has', 'bad', 'traffic'], ...].
In this case, I would like new_york to be detected as a compound phrase. However, when corpus 2 is replaced by
[['i', 'go', 'to', 'new', york'], ['i', 'like', 'new', 'york'], ...],
I would like new_york to be relatively disregarded.
I could just use a ratio between n-gram scores between corresponding phrases in corpora, but I don't see how to scale to general n. Normally, phrase detection for n-grams with n>2 is done by recursing on n and gradually inserting compound phrases into the documents by thresholding a score function. This insures that at step n, if you want to score the n-gram (w_1, ..., w_n), then you can always normalize by the constituent m-grams for m<n. But with a different corpus, these are not guaranteed to appear.
A reference to the literature or a relevant hack will be appreciated.

Word vocabulary generated by Word2vec and Glove models are different for the same corpus

I'm using CONLL2003 dataset to generate word embeddings using Word2vec and Glove.
The number of words returned by word2vecmodel.wv.vocab is different(much lesser) than glove.dictionary.
Here is the code:
Word2Vec:
word2vecmodel = Word2Vec(result ,size= 100, window =5, sg = 1)
X = word2vecmodel[word2vecmodel.wv.vocab]
w2vwords = list(word2vecmodel.wv.vocab)
Output len(w2vwords) = 4653
Glove:
from glove import Corpus
from glove import Glove
import numpy as np
corpus = Corpus()
nparray = []
allwords = []
no_clusters=500
corpus.fit(result, window=5)
glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)
Output: len(glove.dictionary) = 22833
The input is a list of sentences. For example:
result[1:5] =
['Peter', 'Blackburn'],
['BRUSSELS', '1996-08-22'],
['The',
'European',
'Commission',
'said',
'Thursday',
'disagreed',
'German',
'advice',
'consumers',
'shun',
'British',
'lamb',
'scientists',
'determine',
'whether',
'mad',
'cow',
'disease',
'transmitted',
'sheep',
'.'],
['Germany',
"'s",
'representative',
'European',
'Union',
"'s",
'veterinary',
'committee',
'Werner',
'Zwingmann',
'said',
'Wednesday',
'consumers',
'buy',
'sheepmeat',
'countries',
'Britain',
'scientific',
'advice',
'clearer',
'.']]
There are totally 13517 sentences in the result list.
Can someone please explain why the list of words for which the embeddings are created are drastically different in size?

You haven't mentioned which Word2Vec implementation you're using, but I'll assume you're using the popular Gensim library.
Like the original word2vec.c code released by Google, Gensim Word2Vec uses a default min_count parameter of 5, meaning that any words appearing fewer than 5 times are ignored.
The word2vec algorithm needs many varied examples of a word's usage is different contexts to generate strong word-vectors. When words are rare, they fail to get very good word-vectors themselves: the few examples only show a few uses that may be idiosyncractic compared to what a larger sampling would show, and can't be subtly balanced against many other word representations in the manner that's best.
But further, given that in typical word-distributions, there are many such low-frequency words, altogether they also tend to make the word-vectors for other more-frequent qords worse. The lower-frequency words are, comparatively, 'interference' that absorbs training state/effort to the detriment of other more-improtant words. (At best, you can offset this effect a bit by using more training epochs.)
So, discarding low-frequency words is usually the right approach. If you really need vectors-for those words, obtaining more data so that those words are no longer rare is the best approach.
You can also use a lower min_count, including as low as min_count=1 to retain all words. But often discarding such rare words is better for whatever end-purpose for which the word-vectors will be used.

What is the input format of fastText and why does my model doesn't give me a meaningful similar output?

My goal is to find similarities between a word and a document. For example, I want to find the similarity between "new" and a document, for simplicity, say "Hello World!".
I used word2vec from gensim, but the problem is it does not find the similarity for an unseen word. Thus, I tried to use fastText from gensim as it can find similarity for words that are out of vocabulary.
Here is a sample of my document data:
[['This', 'is', 'the', 'only', 'rule', 'of', 'our', 'household'],
['If',
'you',
'feel',
'a',
'presence',
'standing',
'over',
'you',
'while',
'you',
'sleep',
'do'],
['NOT', 'open', 'your', 'eyes'],
['Ignore', 'it', 'and', 'try', 'to', 'fall', 'asleep'],
['This',
'may',
'sound',
'a',
'bit',
'like',
'the',
'show',
'Bird',
'Box',
'from',
'Netflix']]
I simply train data like this:
from gensim.models.fasttext import FastText
model = FastText(sentences_cleaned)
Consequently, I want to find the similarity between say, "rule" and this document.
model.wv.most_similar("rule")
However, fastText gives me this:
[('the', 0.1334390938282013),
('they', 0.12790171802043915),
('in', 0.12731242179870605),
('not', 0.12656228244304657),
('and', 0.11071767657995224),
('of', 0.08563747256994247),
('I', 0.06609072536230087),
('that', 0.05195673555135727),
('The', 0.002402491867542267),
('my', -0.009009800851345062)]
Obviously, it must have "rule" as the top similarity since the word "rule" appears in the first sentence of the document. I also tried stemming/lemmatization, but it doesn't work either.
Was my input format correct? I've seen lots of documents are using .cor or .bin format and I don't know what are those.
Thanks for any reply!

model.wv.most_similar('rule') asks for that's model's set-of-word-vectors (.wv) to return the words most-similar to 'rule'. That is, you've provided neither any document (multiple words) as a query, nor is there any way for the FastText model to return either a document itself, or a name of any documents. Only words, as it has done.
While FastText trains on texts – lists of word-tokens – it only models words/subwords. So it's unclear what you expected instead: the answer is of the proper form.
Those don't look like words very-much like 'rule', but you'll only get good results from FastText (and similar word2vec-algorithms) if you train them with lots of varied data showing many subtly-contrasting realistic uses of the relevant words.
How many texts, with how many words, are in your sentences_cleaned data? (How many uses of 'rule' and related words?)
In any real FastText/Word2Vec/etc model, trained with asequate data/parameters, no single sentence (like your 1st sentence) can tell you much about what the results "should" be. That only emerged from the full rich dataset.

Word2vec: add external word to every context

I'm looking for a simple "hack" to implement the following idea: I want to have a specific word appear artificially in the context of every word (the underlying goal is to try and use word2vec for supervised sentence classification).
An example is best:
Say I have the sentence: "The dog is in the garden", and a window of 1.
So we would get the following pais of (target, context):
(dog, The), (dog, is), (is, dog), (is, in), etc.
But what I would like to feed to the word2vec algo is this:
(dog, The), (dog, is), **(dog, W)**, (is, dog), (is, in), **(is, W)**, etc.,
as if my word W was in the context of every word.
where W is a word of my choosing, not in the existing vocabulary.
Is there an easy way to do this in R or python ?

I imagined you have list of sentences and list of labels for each sentence:
sentences = [
["The", "dog", "is", "in", "the", "garden"],
["The", "dog", "is", "not", "in", "the", "garden"],
]
Then you created the word-context pairs:
word_context = [("dog", "The"), ("dog", "is"), ("is", "dog"), ("is", "in") ...]
Now if for each sentence you have a label, you can add labels to context of all words:
labels = [
"W1",
"W2",
]
word_labels = [
(word, label)
for sent, label in zip(sentences, labels)
for word in sent
]
word_context += word_labels
Unless you want to keep the order in word-context pairs!

Take a look at the 'Paragraph Vectors' algorithm – implemented as the class Doc2Vec in Python gensim. In it, each text example gets an extra pseudoword that essentially floats over the full example, contributing itself to every skip-gram-like (called PV-DBOW in Paragraph Vectors) or CBOW-like (called PV-DM in Paragraph Vectors) training-context.
Also take a look at Facebook's 'FastText' paper and library. It's essentially an extension of word2vec in two different directions:
First, it has the option of learning vectors for subword fragments (chracter n-grams) so that future unknown words can get rough-guess vectors from their subwords.
Second, it has the option of trying to predict not just nearby-words during vector-training, but also known-classification-labels for the containing text example (sentence). As a result, the learned word-vectors may be better for subsequent classification of other future sentences.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

n-gram vectorization using TfidfVectorizer - scikit-learn

Related

How to know which contextual embedding to use at test time

Identifying phrases which contrast two corpora

Word vocabulary generated by Word2vec and Glove models are different for the same corpus

What is the input format of fastText and why does my model doesn't give me a meaningful similar output?

Word2vec: add external word to every context

Categories

Resources