sklearn Vectorizer (NLP task) : Generating Custom NGrams which are capable of scaling up for n >= 3 - scikit-learn

I would like to build a vectorizer in sklearn which can scale up for higher values of n. Here n is the number of different words considered as single vocab element.
My idea is that for n = 1 and n = 2, my custom vectorizer remains the same as sklearn vectorizers, but for n>=3, I would like to replace "I am good","Harry will play" with "I x good" and "Harry x play".
Example: Let's consider that I want to build a vectorizer which scales upto n = 4. Now, take an example sentence "Harry will play tommorow".
Then, "Harry will play tommorow" can break as:-
All 1,2 length vocab words, "Harry x play", "will x tommorow" and "Harry x x tommorow".
Since, the order of elements in this vocabulary is same as that for n = 2, and words of form "A x B" will not be any rarer than "A B", I believe that this model may scale better and give performance benefits.
I searched over the net to find a method to do this and while there are many tutorials for building custom vectorizers all of them end up using there pre-implemented n-gram method.

Related

How to generate GloVe embeddings for POS tags? Python

For a sentence analysis task, I would like to take the sequence of POS tags associated with the sentence and feed it to my model as if the POS tags are words.
I am using GloVe to make representations of each word in the sentence and SpaCy to generate POS tags. However, GloVe embeddings do not make much sense for POS tags. So I will have to somehow create embeddings for each POS tag. What is the best way to do create embeddings for POS tags, so that I can feed POS sequences into my model in the same way I would feed sentences? Could anyone point to code examples of how to do this with GloVe in Python?
Added context
My task is a binary classification of sentence pairs, based on their resemblance (similar meaning vs different meaning).
I would like to use POS tags as words, so that the POS tags serve as an additional bit of information to compare the sentences. My current model does not use an LSTM as a way to predict sequences.
Most word embedding models still rely on an underlying assumption that the meaning of a word is induced by its usage context. For example, learning a word2vec embedding with skipgram or continuous bag of words formulations implicitly assumes a model in which the representation vector of the word is based on the context words that co-occur with the target word, specifically by learning to create embeddings that best solve the classification task of distinguishing pairs of words that contextually co-occur from random pairs of words (so-called negative sampling).
But if the input is changed to be a sequence of discrete labels (POS tags), this assumption doesn't seem like it needs to remain accurate or reasonable. Part of speech labels have an assigned meaning that is not really induced by the context of being surrounded by other part of speech labels, so it's unlikely that standard learning tasks which are used to produce word embeddings would work when treating POS labels as if they were words from a much smaller vocabulary.
What is the overall sentence analysis task in your situation?
Added after question was updated with the learning task at hand.
Let's assume you can create POS input vectors for each sentence example. If there are N different POS labels possible, it means your input will consist of one vector from word embeddings and another vector of length N, where the value in component i represents the number of terms in the input sentence that possess POS label P_i.
For example, let's pretend the only POS labels possible are 'article', 'noun' and 'verb', and you have a sentence with ['article', 'noun', 'verb', 'noun']. Then this transforms into [1, 2, 1], and probably you want to normalize it by the length of the sentence. Let's call this input pos1 for sentence number 1 and pos2 for sentence number 2.
Let's call the word embedding vector input for sentence 1 as sentence1. sentence1 will be calculated by looking up each word embedding from a separate source, like a pretrained word2vec model or fastText or GloVe, and summing them up (using continuous bag of words). And the same for sentence2.
It's assumed that your batches of training data would already be processed into these vector formats, so a given single input would be a 4-tuple of vectors: the looked up CBOW embedding vector for sentence 1, same for sentence 2, and the calculated discrete representation vector for POS labels of sentence 1, and same for sentence 2.
A model that could work from this data might be like this:
from keras.engine.topology import Input
from keras.layers import Concatenate
from keras.layers.core import Activation, Dense
from keras.models import Model
sentence1 = Input(shape=word_embedding_shape)
sentence2 = Input(shape=word_embedding_shape)
pos1 = Input(shape=pos_vector_shape)
pos2 = Input(shape=pos_vector_shape)
# Note: just choosing 128 as an embedding space dimension or intermediate
# layer size... in your real case, you'd choose these shape params
# based on what you want to model or experiment with. They don't mean
# anything here.
sentence1_branch = Dense(128)(sentence1)
sentence1_branch = Activation('relu')(sentence1_branch)
# ... do whatever other sentence1-only stuff
sentence2_branch = Dense(128)(sentence2)
sentence2_branch = Activation('relu')(sentence2_branch)
# ... do whatever other sentence2-only stuff
pos1_embedding = Dense(128)(pos1)
pos1_branch = Activation('relu')(pos1_embedding)
# ... do whatever other pos1-only stuff
pos2_embedding = Dense(128)(pos2)
pos2_branch = Activation('relu')(pos2_embedding)
# ... do whatever other pos2-only stuff
unified = Concatenate([sentence1_branch, sentence2_branch,
pos1_branch, pos2_branch])
# ... do dense layers, whatever, to the concatenated intermediate
# representations
# finally boil it down to whatever final prediction task you are using,
# whether it is predicting a sentence similarity score (Dense(1)),
# or predicting a binary label that indicates whether the sentence
# pairs are similar or not (Dense(2) then followed by softmax activation,
# or Dense(1) followed by some type of probability activation like sigmoid).
# Assume your data is binary labeled for similar sentences...
unified = Activation('softmax')(Dense(2)(unified))
unified.compile(loss='binary_crossentropy', other parameters)
# Do training to learn the weights...
# A separate model that will just produce the embedding output
# from a POS input vector, relying on weights learned from the
# training process.
pos_embedding_model = Model(inputs=[pos1], outputs=[pos1_embedding])

Use features based on tf idf score for text classification using naive bayes (sklearn)

I am learning to implement text classification (into two classes) using tfidf and naive bayes by referring to this blog and sklearn tfidf
below is the code snippet:
kf = StratifiedKFold(n_splits=5)
totalNB = 0
totalMatNB = np.zeros((2,2));
for train_index, test_index in kf.split(documents, labels):
X_train = [documents[i] for i in train_index]
X_test = [documents[i] for i in test_index]
y_train, y_test = labels[train_index], labels[test_index]
vectorizer = TfidfVectorizer(min_df=2, max_df= 0.2, use_idf= True, stop_words=stop_words)
train_corpus_tf_idf = vectorizer.fit_transform(X_train)
test_corpus_tf_idf = vectorizer.transform(X_test)
model2 = MultinomialNB()
model2.fit(train_corpus_tf_idf, y_train)
result2 = model2.predict(test_corpus_tf_idf)
totalMatNB = totalMatNB + confusion_matrix(y_test, result2)
totalNB = totalNB + sum(y_test == result2)
The above code is working as expected.
I have read the documents, but I am still confuse about min_df and max_df.
How to use the features for the classification based on the tf-idf score, i.e. filter the features based on tf-idf score
eg.
use the features whose tf-idf score is greater than x [ score(features) >x]
use the features whose tf-idf score between x and y [ y> score(features)>x ] or [ y>= score(features)>=x ]
When training the vectorizer, setting specific values for min_df and max_df is supposed to help you tweak the eventual tf-idf representation to best suit your needs by limiting the vocabulary. It also helps with reducing the dimension of the vector representation which is usually a good thing since they tend to be huge.
Setting a high min_df value will remove relatively infrequent terms from the representation. If your eventual model is not supposed to care too much about very unique terms this would be a good thing.
Setting a low max_df will remove relatively frequent terms from the representation. If your eventual model doesn't care about words that are used in many contexts (e.g. "the", "or", "and") then this would be a good thing. Note that "low" here can mean either a nonzero integer > 1 or a float < 1 close to 0.
Important note: your suggestion of filtering features after-the-fact based on their tf-idf weight is a totally different thing. Setting min_df and max_df when fitting the vectorizer will limit the eventual vocabulary based on document frequency across the entire training sample. Whereas the eventual tf-idf weight in a given vector is a document-specific value (since it's also impacted by the term frequency in that specific document).
Hope this helps!

How Word Mover's Distance (WMD) uses word2vec embedding space?

According to WMD paper, it's inspired by word2vec model and use word2vec vector space for moving document 1 towards document 2 (in the context of Earth Mover Distance metric). From the paper:
Assume we are provided with a word2vec embedding matrix
X ∈ Rd×n for a finite size vocabulary of n words. The
ith column, xi ∈ Rd, represents the embedding of the ith
word in d-dimensional space. We assume text documents
are represented as normalized bag-of-words (nBOW) vectors,
d ∈ Rn. To be precise, if word i appears ci times in
the document, we denote di = ci/cj (for j=1 to n). An nBOW vector
d is naturally very sparse as most words will not appear in
any given document. (We remove stop words, which are
generally category independent.)
I understand the concept from the paper, however, I couldn't understand how wmd uses word2vec embedding space from the code in Gensim.
Can someone explain it in a simple way? Does it calculate the word vectors in a different way because I couldn't understand where in this code word2vec embedding matrix is used?
WMD Fucntion from Gensim:
def wmdistance(self, document1, document2):
# Remove out-of-vocabulary words.
len_pre_oov1 = len(document1)
len_pre_oov2 = len(document2)
document1 = [token for token in document1 if token in self]
document2 = [token for token in document2 if token in self]
dictionary = Dictionary(documents=[document1, document2])
vocab_len = len(dictionary)
# Sets for faster look-up.
docset1 = set(document1)
docset2 = set(document2)
# Compute distance matrix.
distance_matrix = zeros((vocab_len, vocab_len), dtype=double)
for i, t1 in dictionary.items():
for j, t2 in dictionary.items():
if t1 not in docset1 or t2 not in docset2:
continue
# Compute Euclidean distance between word vectors.
distance_matrix[i, j] = sqrt(np_sum((self[t1] - self[t2])**2))
def nbow(document):
d = zeros(vocab_len, dtype=double)
nbow = dictionary.doc2bow(document) # Word frequencies.
doc_len = len(document)
for idx, freq in nbow:
d[idx] = freq / float(doc_len) # Normalized word frequencies.
return d
# Compute nBOW representation of documents.
d1 = nbow(document1)
d2 = nbow(document2)
# Compute WMD.
return emd(d1, d2, distance_matrix)
For the purposes of WMD, a text is considered a bunch of 'piles' of meaning. Those piles are placed at the coordinates of the text's words – and that's why WMD calculation is dependent on a set of word-vectors from another source. Those vectors position the text's piles.
The WMD is then the minimal amount of work needed to shift one text's piles to match another text's piles. And the measure of the work needed to shift from one pile to another is the euclidean distance between those pile's coordinates.
You could just try a naive shifting of the piles: look at the first word from text A, shift it to the first word from text B, and so forth. But that's unlikely to be the cheapest shifting – which would likely try to match nearer words, to send the 'meaning' on the shortest possible paths. So actually calculating the WMD is an iterative optimization problem – significantly more expensive than just a simple euclidean-distance or cosine-distance between two points.
That optimization is done inside the emd() call in the code you excerpt. But what that optimization requires is the pairwise distances between all words in text A, and all words in text B – because those are all the candidate paths across which meaning-weight might be shifted. You can see those pairwise distances calculated in the code to populate the distance_matrix, using the word-vectors already loaded in the model and accessible via self[t1], self[t2], etc.

ALS model - how to generate full_u * v^t * v?

I'm trying to figure out how an ALS model can predict values for new users in between it being updated by a batch process. In my search, I came across this stackoverflow answer. I've copied the answer below for the reader's convenience:
You can get predictions for new users using the trained model (without updating it):
To get predictions for a user in the model, you use its latent representation (vector u of size f (number of factors)), which is multiplied by the product latent factor matrix (matrix made of the latent representations of all products, a bunch of vectors of size f) and gives you a score for each product. For new users, the problem is that you don't have access to their latent representation (you only have the full representation of size M (number of different products), but what you can do is use a similarity function to compute a similar latent representation for this new user by multiplying it by the transpose of the product matrix.
i.e. if you user latent matrix is u and your product latent matrix is v, for user i in the model, you get scores by doing: u_i * v for a new user, you don't have a latent representation, so take the full representation full_u and do: full_u * v^t * v This will approximate the latent factors for the new users and should give reasonable recommendations (if the model already gives reasonable recommendations for existing users)
To answer the question of training, this allows you to compute predictions for new users without having to do the heavy computation of the model which you can now do only once in a while. So you have you batch processing at night and can still make prediction for new user during the day.
Note: MLLIB gives you access to the matrix u and v
The quoted text above is an excellent answer, however, I'm struggling to understand how to programmatically implement this solution. For example, the matrix u and v can be obtained with:
# pyspark example
# ommitted for brevity ... loading movielens 1M ratings
model = ALS.train(ratings, rank, numIterations, lambdaParam)
matrix_u = model.userFeatures()
print(matrix_u.take(2)) # take a look at the dataset
This returns:
[
(2, array('d', [0.26341307163238525, 0.1650490164756775, 0.118405282497406, -0.5976635217666626, -0.3913084864616394, -0.1379186064004898, -0.3866392970085144, -0.1768060326576233, -0.38342711329460144, 0.48550787568092346, -0.18867433071136475, -0.02757863700389862, 0.1410026103258133, 0.11498363316059113, 0.03958914801478386, 0.034536730498075485, 0.08427099883556366, 0.46969038248062134, -0.8230801224708557, -0.15124185383319855, 0.2566414773464203, 0.04326820373535156, 0.19077207148075104, 0.025207923725247383, -0.02030213735997677, 0.1696728765964508, 0.5714617967605591, -0.03885050490498543, -0.09797532111406326, 0.29186877608299255, -0.12768596410751343, -0.1582849770784378, 0.01933656632900238, -0.09131495654582977, 0.26577943563461304, -0.4543033838272095, -0.11789630353450775, 0.05775507912039757, 0.2891307771205902, -0.2147761881351471, -0.011787488125264645, 0.49508437514305115, 0.5610293745994568, 0.228189617395401, 0.624510645866394, -0.009683617390692234, -0.050237834453582764, -0.07940001785755157, 0.4686132073402405, -0.02288617007434368])),
(4, array('d', [-0.001666820957325399, -0.12487432360649109, 0.1252429485321045, -0.794727087020874, -0.3804478347301483, -0.04577340930700302, -0.42346617579460144, -0.27448347210884094, -0.25846347212791443, 0.5107921957969666, 0.04229479655623436, -0.10212298482656479, -0.13407345116138458, -0.2059325873851776, 0.12777331471443176, -0.318756639957428, 0.129398375749588, 0.4351944327354431, -0.9031049013137817, -0.29211774468421936, -0.02933369390666485, 0.023618215695023537, 0.10542935132980347, -0.22032295167446136, -0.1861676126718521, 0.13154461979866028, 0.6130356192588806, -0.10089754313230515, 0.13624103367328644, 0.22037173807621002, -0.2966669499874115, -0.34058427810668945, 0.37738317251205444, -0.3755438029766083, -0.2408779263496399, -0.35355791449546814, 0.05752146989107132, -0.15478627383708954, 0.3418906629085541, -0.6939512491226196, 0.4279302656650543, 0.4875738322734833, 0.5659542083740234, 0.1479463279247284, 0.5280753970146179, -0.24357643723487854, 0.14329688251018524, -0.2137598991394043, 0.011986476369202137, -0.015219110995531082]))
]
I can also do similar to get the v matrix:
matrix_v = model.productFeatures()
print(matrix_v.take(2)) # take a look at the dataset
This results in:
[
(2, array('d', [0.019985994324088097, 0.0673416256904602, -0.05697149783372879, -0.5434763431549072, -0.40705952048301697, -0.18632276356220245, -0.30776089429855347, -0.13178342580795288, -0.27466219663619995, 0.4183739423751831, -0.24422742426395416, -0.24130797386169434, 0.24116989970207214, 0.06833088397979736, -0.01750543899834156, 0.03404173627495766, 0.04333991929888725, 0.3577033281326294, -0.7044714689254761, 0.1438472419977188, 0.06652364134788513, -0.029888223856687546, -0.16717877984046936, 0.1027146726846695, -0.12836599349975586, 0.10197233408689499, 0.5053384900093079, 0.019304445013403893, -0.21254844963550568, 0.2705852687358856, -0.04169371724128723, -0.24098040163516998, -0.0683765709400177, -0.09532768279314041, 0.1006036177277565, -0.08682398498058319, -0.13584329187870026, -0.001340558985248208, 0.20587041974067688, -0.14007550477981567, -0.1831497997045517, 0.5021498203277588, 0.3049483597278595, 0.11236990243196487, 0.15783801674842834, -0.044139936566352844, -0.14372406899929047, 0.058535050600767136, 0.3777201473712921, -0.045475270599126816])),
(4, array('d', [0.10334215313196182, 0.1881643384695053, 0.09297363460063934, -0.457258403301239, -0.5272660255432129, -0.0989445373415947, -0.2053477019071579, -0.1644461452960968, -0.3771175146102905, 0.21405018866062164, -0.18553146719932556, 0.011830524541437626, 0.29562288522720337, 0.07959598302841187, -0.035378433763980865, -0.11786794662475586, -0.11603366583585739, 0.3776192367076874, -0.5124108791351318, 0.03971947357058525, -0.03365595266222954, 0.023278912529349327, 0.17436474561691284, -0.06317273527383804, 0.05118614062666893, 0.4375131130218506, 0.3281322419643402, 0.036590900272130966, -0.3759073317050934, 0.22429685294628143, -0.0728025734424591, -0.10945595055818558, 0.0728464275598526, 0.014129920862615108, -0.10701996833086014, -0.2496117204427719, -0.09409723430871964, -0.11898282915353775, 0.18940524756908417, -0.3211393356323242, -0.035668935626745224, 0.41765937209129333, 0.2636736035346985, -0.01290816068649292, 0.2824321389198303, 0.021533429622650146, -0.08053319901227951, 0.11117415875196457, 0.22975310683250427, 0.06993964314460754]))
]
However, I'm not sure how to progress from this to full_u * v^t * v
This new user is not the the matrix U, so you don't have its latent representation in 'k' factors, you only know its full representation, i.e., all its ratings. full_u here means all of the new user ratings in a dense format (not the sparse format ratings are) e.g.:
[0 2 0 0 0 1 0] if user u has rated item 2 with a 2 and item 6 with a 1.
then you can get v pretty much like you did and turning it to a matrix in numpy for instance:
pf = model.productFeatures()
Vt = np.matrix(np.asarray(pf.values().collect()))
then is is just a matter of multiplying:
full_u*Vt*Vt.T
Vt and V are transposed compared to the other answer but that's just a matter of convention.
Note that the Vt*Vt.T product is fixed, so if you are going to use this for multiple new users it will be computationally more efficient to pre-compute it. Actually for more than one user it would be better to put all their ratings in bigU (in the same format as my one new user example) and just do the matrix product:
bigU*Vt*Vt.T to get all the ratings for all the new users. Might still be worth checking that the product is done in the most efficient way in terms of number of operations.
Just a word of warning. People talk about the user and product matrices like they are left and right singular vectors. But as far as I understand, the method used to find U and V is an optimization of a straight squared error cost function, which makes none of the orthogonality guarantees of SVD.
In other words, think algebraically about what the above answer claims. We have a full ratings matrix R, an n by p matrix of ratings for n users over p products. We decompose it with k latent factors to approximate R = UV, where the rows of U, an n by k matrix, are the latent user representations, and the columns of V, a k by p matrix, are the latent product representations. In order to find latent user representations for a matrix R of entirely new users without refitting the model, we need:
R = U V
R V^{-1} = U V V^{-1}
R V^{-1} = U I_{k}
R V^{-1} = U
where I_{k} is the k dimensional identity matrix and V^{-1} is the p by k right inverse of V. The tip above assumes that V^{T} = V^{-1}. This is not guaranteed. And in general there is no guarantee that assuming this is true will give you anything but nonsense answers.
Let me know if I'm missing something in the optimization method behind MLLib's CF implementation. Is there a trick in the ALS model that guarantees orthogonality that I'm missing?

How to use vector representation of words (as obtained from Word2Vec,etc) as features for a classifier?

I am familiar with using BOW features for text classification, wherein we first find the size of the vocabulary for the corpus which becomes the size of our feature vector. For each sentence/document, and for all its constituent words, we then put 0/1 depending on the absence/presence of that word in that sentence/document.
However, now that I am trying to use vector representation of each word, is creating a global vocabulary essential?
Suppose the size of the vectors is N (usually between 50 or 500). The naive way of generalizing the traditional of generalizing BOW is just replacing 0 bit (in BOW) with N zeros, and replacing 1 bit (in BOW) with the the real vector (say from Word2Vec). Then the size of the features would be N * |V| (Compared to |V| feature vectors in the BOW, where |V| is the size of the vocabs). This simple generalization should work fine for decent number of training instances.
To make the feature vectors smaller, people use various techniques like using recursive combination of vectors with various operations. (See Recursive/Recurrent Neural Network and similar tricks, for example: http://web.engr.illinois.edu/~khashab2/files/2013_RNN.pdf or http://papers.nips.cc/paper/4204-dynamic-pooling-and-unfolding-recursive-autoencoders-for-paraphrase-detection.pdf )
To get a fixed length feature vector for each sentence, although the number of words in each sentence is different, do as follows:
tokenize each sentence into constituent words
for each word get word vector (if it is not there ignore the word)
average all the word vectors you got
this will always give you a d-dim vector (d is word vector dim)
below is the code snipet
def getWordVecs(words, w2v_dict):
vecs = []
for word in words:
word = word.replace('\n', '')
try:
vecs.append(w2v_model[word].reshape((1,300)))
except KeyError:
continue
vecs = np.concatenate(vecs)
vecs = np.array(vecs, dtype='float')
final_vec = np.sum(vecs, axis=0)
return final_vec
words is list of tokens obtained after tokenizing a sentence.

Resources