Custom tokenizer not working in countvectorizer sklearn - scikit-learn

I am trying to make a Countvectorizer with a custom tokenizer function. I am facing a weird problem with it. In below code temp_tok is a list of 5 values which is used as vocabulary later.
temp_tok = ["or", "Normal sinus rhythm", "sinus", "anuj","Normal sinus"]
def tokenize(text):
return [temp_tok[0],temp_tok[1], "sinus", "Normal sinus"]
def tokenize2(text):
return [i for i in temp_tok if i in text]
text = "Normal sinus rhythm"
The output of text for both functions is same which is
tokenize(text)
output = ['or', 'Normal sinus rhythm', 'sinus', 'Normal sinus']
But when I build vectorizer with these tokenizer, it gives unexpected output for tokenize2. My vocabulary is temp_tok for both. I experimented with n_gram range but it is not helping.
vectorizer = CountVectorizer(vocabulary=temp_tok,tokenizer = tokenize)
vectorizer2 = CountVectorizer(vocabulary=temp_tok,tokenizer = tokenize2)
While vectorizer.transform([text]) is giving expected output, vectorizer2.transform([text]) is giving 1 only for "or" and "sinus"
vectorizer.transform(["Normal sinus rhythm"]).toarray()
array([[1, 1, 1, 0, 1]])
vectorizer.transform(["Normal sinus rhythm"]).toarray()
array([[1, 0, 1, 0, 0]])
I also tried passing dictionary instead of list temp_tok as vocabulary to Countvectorizer but it doesn't help. Is this sklearn problem or I am doing something wrong?

Countvectorizer is passing the text by converting it to lower case. So tokenize2 is not working while tokenize works well.
This can be seen by adding a print function in tokenize2.
def tokenize2(text):
print(text)
return [i for i in temp_tok if i in text]
A good solution would be to change the elements in temp_tok to lower cases. Else any technique to handle small case, capital case would work.

Related

sklearn dcg_score not working as expected

This is my code:
from sklearn.metrics import dcg_score
true_relevance = np.asarray([[10]])
scores = np.asarray([[.1]])
dcg_score(true_relevance, scores)
The below code should produce 10 as the dcg_score. The formula from wikipedia gives 10/log2 = 10 But, instead I get ValueError: Only ('multilabel-indicator', 'continuous-multioutput', 'multiclass-multioutput') formats are supported. Got binary instead
Did anyone encounter this?
Since computing dcg on a single element is not meaningful, the sklearn library requires at least two y_true and y_score elements in the corresponding arrays.
You can check this by exploring the sklearn code (or through debugging): https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b611bf873bd5836748647221480071a87/sklearn/utils/multiclass.py#L158
Like:
true_relevance = np.asarray([[10, 5]])
scores = np.asarray([[.1, .2]])
dcg_score(true_relevance, scores)

Confusion in Pre-processing text for Roberta Model

I want to apply Roberta model for text similarity. Given a pair of sentences,the input should be in the format <s> A </s></s> B </s>. I figure out two possible ways to generate the input ids namely
a)
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
list1 = tokenizer.encode('Very severe pain in hands')
list2 = tokenizer.encode('Numbness of upper limb')
sequence = list1+[2]+list2[1:]
In this case, sequence is [0, 12178, 3814, 2400, 11, 1420, 2, 2, 234, 4179, 1825, 9, 2853, 29654, 2]
b)
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
list1 = tokenizer.encode('Very severe pain in hands', add_special_tokens=False)
list2 = tokenizer.encode('Numbness of upper limb', add_special_tokens=False)
sequence = [0]+list1+[2,2]+list2+[2]
In this case, sequence is [0, 25101, 3814, 2400, 11, 1420, 2, 2, 487, 4179, 1825, 9, 2853, 29654, 2]
Here 0 represents <s> token and 2 represents </s> token. I'm not sure which is the correct way to encode the given two sentences for calculating sentence similarity using Roberta model.
The easiest way is probably to directly use the provided function by HuggingFace's Tokenizers themselves, namely the text_pair argument in the encode function, see here. This allows you to directly feed in two sentences, which will be giving you the desired output:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
sequence = tokenizer.encode(text='Very severe pain in hands',
text_pair='Numbness of upper limb',
add_special_tokens=True)
This is especially convenient if you are dealing with very long sequences, as the encode function automatically reduces your lengths according to the truncaction_strategy argument. You obviously don't have to worry about this, if it is only short sequences.
Alternatively, you can also make use of the more explicit build_inputs_with_special_tokens() function of the RobertaTokenizer, specifically, which could be added to your example like so:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
list1 = tokenizer.encode('Very severe pain in hands', add_special_tokens=False)
list2 = tokenizer.encode('Numbness of upper limb', add_special_tokens=False)
sequence = tokenizer.build_inputs_with_special_tokens(list1, list2)
Note that in that case, you have to generate the sequences list1 and list2 still without any special tokens, as you have already done correctly.

Feed stop_words with bigrams in sklearn CountVectorizer()

Is there an inexpensive and easy way to prevent sklearn's CountVectorizer from only stopping unigrams with the stop_words parameter, and make it stop bigrams as well? What I mean is illustrated in the following snippet:
from sklearn.feature_extraction.text import CountVectorizer
texts = ['hello this is text number one yes yes',
'hello this is text number two stackflow']
stop_words = {'hello this'}
model = CountVectorizer(analyzer='word',
ngram_range=(1,2),
max_features=3,
stop_words=stop_words)
doc_vectors = model.fit_transform(texts).toarray()
print(doc_vectors)
print(model.get_feature_names())
So what this code does, is output the following:
>>> [[1 1 1]
>>> [1 1 1]]
>>> ['hello', 'hello this', 'is']
As you can see, I wanted the bigram 'hello this' to be counted out (it's fed to stop words). I've seen a few posts where they use pipelines or custom analyzers, and I've browsed through the documentation, but isn't there an easier way around this problem?
Thanks!

Visualise word2vec generated from gensim using t-sne

I have trained a doc2vec and corresponding word2vec on my own corpus using gensim. I want to visualise the word2vec using t-sne with the words. As in, each dot in the figure has the "word" also with it.
I looked at a similar question here : t-sne on word2vec
Following it, I have this code :
import gensim
import gensim.models as g
from sklearn.manifold import TSNE
import re
import matplotlib.pyplot as plt
modelPath="/Users/tarun/Desktop/PE/doc2vec/model3_100_newCorpus60_1min_6window_100trainEpoch.bin"
model = g.Doc2Vec.load(modelPath)
X = model[model.wv.vocab]
print len(X)
print X[0]
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X[:1000,:])
plt.scatter(X_tsne[:, 0], X_tsne[:, 1])
plt.show()
This gives a figure with dots but no words. That is I don't know which dot is representative of which word. How can I display the word with the dot?
Two parts to the answer: how to get the word labels, and how to plot the labels on a scatterplot.
Word labels in gensim's word2vec
model.wv.vocab is a dict of {word: object of numeric vector}. To load the data into X for t-SNE, I made one change.
vocab = list(model.wv.key_to_index)
X = model.wv[vocab]
This accomplishes two things: (1) it gets you a standalone vocab list for the final dataframe to plot, and (2) when you index model, you can be sure that you know the order of the words.
Proceed as before with
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)
Now let's put X_tsne together with the vocab list. This is easy with pandas, so import pandas as pd if you don't have that yet.
df = pd.DataFrame(X_tsne, index=vocab, columns=['x', 'y'])
The vocab words are the indices of the dataframe now.
I don't have your dataset, but in the other SO you mentioned, an example df that uses sklearn's newsgroups would look something like
x y
politics -1.524653e+20 -1.113538e+20
worry 2.065890e+19 1.403432e+20
mu -1.333273e+21 -5.648459e+20
format -4.780181e+19 2.397271e+19
recommended 8.694375e+20 1.358602e+21
arguing -4.903531e+19 4.734511e+20
or -3.658189e+19 -1.088200e+20
above 1.126082e+19 -4.933230e+19
Scatterplot
I like the object-oriented approach to matplotlib, so this starts out a little different.
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.scatter(df['x'], df['y'])
Lastly, the annotate method will label coordinates. The first two arguments are the text label and the 2-tuple. Using iterrows(), this can be very succinct:
for word, pos in df.iterrows():
ax.annotate(word, pos)
[Thanks to Ricardo in the comments for this suggestion.]
Then do plt.show() or fig.savefig(). Depending on your data, you'll probably have to mess with ax.set_xlim and ax.set_ylim to see into a dense cloud. This is the newsgroup example without any tweaking:
You can modify dot size, color, etc., too. Happy fine-tuning!
With the following, you can convert your model to a TSV and then use this page for visualization.
with open(self.word_tensors_TSV, 'bw') as file_vector, open(self.word_meta_TSV, 'bw') as file_metadata:
for word in model.wv.vocab:
file_metadata.write((word + '\n').encode('utf-8', errors='replace'))
vector_row = '\t'.join(str(x) for x in model[word])
file_vector.write((vector_row + '\n').encode('utf-8', errors='replace'))
:)

tf-idf results analysis with python

I am trying to produce tf-idf on plain corpus of about 200k tokens. I produced vector counter at first that term frequency. Then I produced tf-idf matrix and got following results. My code is
from sklearn.feature_extraction.text import TfidfVectorizer
with open("D:\history.txt", encoding='utf8') as infile:
contents = infile.readlines()
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=1.0, max_features=200000,
min_df=0.0,
use_idf=True, ngram_range=(1,3))
tfidf_matrix = tfidf_vectorizer.fit_transform(contents) #fit the vectorizer to contents
print(tfidf_matrix)
Results
(0, 8371) 0.0296607326158
(0, 27755) 0.159032195629
(0, 59369) 0.0871403881289
: :
(551, 64746) 0.0324104689629
(551, 10118) 0.0324104689629
(551, 9308) 0.0324104689629
While I want to get results in following form
(551, good ) 0.0324104689629
You can use the indexing from the sparse output tfidf_matrix and TfidfVectorizer.get_feature_names() to make the output you required:
features = tfidf_vectorizer.get_feature_names()
indices = zip(*tfidf_matrix.nonzero())
for row,column in indices:
print('(%d, %s) %f' %(row, features[column], X[row, column])

Resources