sklearn DictVectorizer() throwing error with a dictionary as input - scikit-learn

I'm fairly new to sklearn's DictVectorizer, and am trying to create a function where DictVectorizer will output feature names from a list of bigrams that I have used to form a from a feature dictionary. The input to my function is a string, and the function should return a list consisting of a formed into dictionaries (something like this).
def features (str) -> List[Dict[Text, Union[Text, int]]]:
# my feature dictionary should have 'bigram' as the key, and the values will be the bigrams themselves. your feature dict needs to have "bigram" as a key
# bigram: a form of "w[i]-w[i+1]"
# This is my bigram list (as structured above)
bigrams: List[Dict[Text, Union[Text, int]]] = []
# here is my code:
bigrams = {'bigram':i for j in sentence for i in zip(j.split(" ").
[:-1], j.split(" ")[1:])}
return bigrams
vect = DictVectorizer(sparse=False)
text = str()
feature_catalog = features(text)
vect.fit(feature_catalog)
print(sorted(vectorizer.get_feature_names_out()))
Everything works fine until the code advances to the DictVectorizer blocks (hidden in the class itself). This is what I get:
AttributeError Traceback (most recent call last)
/var/folders/pl/k80fpf9s4f9_3rp8hnpw5x0m0000gq/T/ipykernel_3804/266218402.py in <module>
22 features = get_feature(text)
23
---> 24 vectorizer.fit(features)
25
26 print(sorted(vectorizer.get_feature_names()))
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/feature_extraction/_dict_vectorizer.py in fit(self, X, y)
159
160 for x in X:
--> 161 for f, v in x.items():
162 if isinstance(v, str):
163 feature_name = "%s%s%s" % (f, self.separator, v)
AttributeError: 'str' object has no attribute 'items'
Any ideas? This ultimately going to be used as part of a larger processsing effort on a corpus.

Related

How to encode empty string using BERT

I have recently been trying to encode an empty string with CamemBERT (BERT model for French). I wasn't sure on how to do that. If I try to simply encode an empty string,
from transformers import CamembertModel, CamembertTokenizer
import torch
tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
camembert = CamembertModel.from_pretrained("camembert-base")
tokenized_sentence = tokenizer.tokenize("")
encoded_sentence = tokenizer.encode(tokenized_sentence, return_tensors='pt')
embeddings = camembert(encoded_sentence)
embeddings.last_hidden_state.squeeze()[0] # embedding of the CLS token
I get the error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-21-553400f369a8> in <module>
1 # Tokenize in sub-words with SentencePiece
2 tokenized_sentence = tokenizer.tokenize("")
----> 3 encoded_sentence = tokenizer.encode(tokenized_sentence, return_tensors='pt')
4 embeddings = camembert(encoded_sentence)
5 embeddings.last_hidden_state.squeeze()[0] # embeddings.last_hidden_state[0][0]
~/anaconda3/envs/r_nlp2/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in encode(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, return_tensors, **kwargs)
2057 ``convert_tokens_to_ids`` method).
2058 """
-> 2059 encoded_inputs = self.encode_plus(
2060 text,
2061 text_pair=text_pair,
~/anaconda3/envs/r_nlp2/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2376 )
2377
-> 2378 return self._encode_plus(
2379 text=text,
2380 text_pair=text_pair,
~/anaconda3/envs/r_nlp2/lib/python3.8/site-packages/transformers/tokenization_utils.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
459 )
460
--> 461 first_ids = get_input_ids(text)
462 second_ids = get_input_ids(text_pair) if text_pair is not None else None
463
~/anaconda3/envs/r_nlp2/lib/python3.8/site-packages/transformers/tokenization_utils.py in get_input_ids(text)
446 )
447 else:
--> 448 raise ValueError(
449 f"Input {text} is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
450 )
ValueError: Input [] is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
Which I think is expected behavior. I have tried with spaCy's French transformer model but have also been unsuccessful. Here's the code I used for spaCy :
from transformers import BertTokenizer, BertModel
import spacy
#!python -m spacy download fr_dep_news_trf
trf_fr = spacy.load("fr_dep_news_trf")
example = trf_fr("")
example._.trf_data.tensors[1].flatten() # embedding of the CLS token
And the error is
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-27-c53de04d2e6f> in <module>
1 example = trf_fr("")
----> 2 example._.trf_data.tensors[1].flatten()
IndexError: list index out of range
simply because the model returns [].
I guess that at this point, my question is theoretical: what would be the best or a good way to encode an empty string using CamemBERT or spaCy? Would "forcing" the model to return a vector of 0 be a good thing? Would returning "impossible" values such as a (10,..., 10) be a good possibility? Should I force the tokenizer to create a sequence of [PAD] tokens? In this case, how would I implement that using spaCy and/or CamemBERT?
Thanks!
PS : I'm using
Python 3.8.10
spaCy 3.0.6
transformers 4.6.1

TypeError: in method 'IndexIDMap_add_with_ids', argument 4 of type 'faiss::IndexIDMapTemplate< faiss::Index >::idx_t const *'

I'm trying to do semantic search with Pre trained bert models and transformers. I'm using Facebook AI library Faiss.
The code is :
encoded_data = model.encode(df.Plot.tolist())
encoded_data = np.asarray(encoded_data.astype('float32'))
index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
index.add_with_ids(encoded_data, np.array(range(0, len(encoded_data))))
faiss.write_index(index, 'movie_plot.index')
The error it's returning is :
TypeError Traceback (most recent call last)
<ipython-input-19-c09b9ccadf2a> in <module>
----> 1 index.add_with_ids(encoded_data, np.array(range(0, len(encoded_data))))
2 faiss.write_index(index, 'movie_plot.index')
~\t5\lib\site-packages\faiss\__init__.py in replacement_add_with_ids(self, x, ids)
233
234 assert ids.shape == (n, ), 'not same nb of vectors as ids'
--> 235 self.add_with_ids_c(n, swig_ptr(x), swig_ptr(ids))
236
237 def replacement_assign(self, x, k, labels=None):
~\t5\lib\site-packages\faiss\swigfaiss.py in add_with_ids(self, n, x, xids)
4950
4951 def add_with_ids(self, n, x, xids):
-> 4952 return _swigfaiss.IndexIDMap_add_with_ids(self, n, x, xids)
4953
4954 def add(self, n, x):
TypeError: in method 'IndexIDMap_add_with_ids', argument 4 of type 'faiss::IndexIDMapTemplate< faiss::Index >::idx_t const *'
When i ran the same program in google colab, no error was returned. I'm running this program now in windows 10 local pc
I got the answer, we have to convert the np.array(range(0, len(encoded_data))) into int64
encoded_data = model.encode(df.Plot.tolist())
encoded_data = np.asarray(encoded_data.astype('float32'))
index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
ids = np.array(range(0, len(df)))
ids = np.asarray(ids.astype('int64'))
index.add_with_ids(encoded_data, ids)
faiss.write_index(index, 'movie_plot.index')
you migth convert encoded_data.astype('float32') after having done np.asarray(encoded_data) such as:
np.asarray(encoded_data).astype('float32')
Faiss add_with_ids() only accepts ids of np.int64 dtype.
I didn't find Python documentation of this data type requirement, but this link https://faiss.ai/cpp_api/struct/structfaiss_1_1Index.html (although it's in c++) shows its id data type.

How to fix this ValueError?

I am trying to run a python code, mostly based on NLTK book, for ngram POS Tagging a Gujarati language text from my GujaratiTextCorpus. I encountered a ValueError.
I am working with Python 3.7.3 in Windows 10. I use jupyter notebook through anaconda. I am a beginner in using python. I studied the answers available on stackoverflow. com to fix my ValueError, but could not solve it.
import nltk
f = open('C:\\Users\\BHOGAYATA\\Documents\\GujaratiPosTagging\\cts260.txt', encoding = 'utf8')
raw = f.read()
train2_sents = nltk.sent_tokenize(raw)
text2 = nltk.Text(train2_sents)
train2_sents
import nltk
f = open('C:\\Users\\BHOGAYATA\\Documents\\GujaratiPosTagging\\txt42_sents.txt', encoding = 'utf8')
raw = f.read()
bs_sents = nltk.sent_tokenize(raw)
text3 = nltk.Text(bs_sents)
bs_sents
unigram_tagger = nltk.UnigramTagger(train2_sents)
unigram_tagger.tag(bs_sents)
I expected that the words of the two Gujarati sentences would be POS Tagged. I found the following error messages:
ValueError
Traceback (most recent call last)
<ipython-input-3-5fae0b92393e> in <module>
11 text3 = nltk.Text(bs_sents)
12 bs_sents
---> 13 unigram_tagger = nltk.UnigramTagger(train2_sents)
14 unigram_tagger.tag(bs_sents)
15
~\Anaconda3\lib\site-packages\nltk\tag\sequential.py in __init__(self, train, model, backoff, cutoff, verbose)
344
345 def __init__(self, train=None, model=None, backoff=None, cutoff=0, verbose=False):
--> 346 NgramTagger.__init__(self, 1, train, model, backoff, cutoff, verbose)
347
348 def encode_json_obj(self):
~\Anaconda3\lib\site-packages\nltk\tag\sequential.py in __init__(self, n, train, model, backoff, cutoff, verbose)
293
294 if train:
--> 295 self._train(train, cutoff, verbose)
296
297 def encode_json_obj(self):
~\Anaconda3\lib\site-packages\nltk\tag\sequential.py in _train(self, tagged_corpus, cutoff, verbose)
181 fd = ConditionalFreqDist()
182 for sentence in tagged_corpus:
--> 183 tokens, tags = zip(*sentence)
184 for index, (token, tag) in enumerate(sentence):
185 # Record the event.
ValueError: not enough values to unpack (expected 2, got 1)
It says the variable you are passing have one output but you are expecting two..
Ex:
for a, b in [("a", "b")]:
print("a:", a, "b:", b)
This will work
for a, b in [("a")]:
print("a:", a, "b:", b)
This will not work
Edit:
Look at your UnigramTagger
For first argument it takes a list of tagged sentences of type
list(list(tuple(str, str)))
You are giving train2_sents of type
list(tuple(str,str)
Where your
list(tuple(str,str) is same as train2_sents

issues storing and extracting arrays in numpy file

Trying to store an array in numpy file however, while trying to extract it, and use it, getting an error message as trying to apply array to a sequence.
These are the two arrays, unsure which one is causing the issue.
X = [[1,2,3],[4,5,6],[7,8,9]]
y = [0,1,2,3,4,5,6....]
while trying to retrieve it and use it getting the values as:
X: array(list[1,2,3],list[4,5,6],list[7,8,9])
y = array([0,1,2,3,4,5...])
Here is the code:
vectors = np.array(X)
labels = np.array(y)
While retrieving working on t-sne
visualisations = TSNE(n_components=2).fit_transform(X,y)
I get the following error:
ValueError Traceback (most recent call last)
<ipython-input-11-244f99341167> in <module>()
----> 1 visualisations = TSNE(n_components=2).fit_transform(X,y)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\manifold\t_sne.py in fit_transform(self, X, y)
856 Embedding of the training data in low-dimensional space.
857 """
--> 858 embedding = self._fit(X)
859 self.embedding_ = embedding
860 return self.embedding_
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\manifold\t_sne.py in _fit(self, X, skip_num_points)
658 else:
659 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'],
--> 660 dtype=[np.float32, np.float64])
661 if self.method == 'barnes_hut' and self.n_components > 3:
662 raise ValueError("'n_components' should be inferior to 4 for the "
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
431 force_all_finite)
432 else:
--> 433 array = np.array(array, dtype=dtype, order=order, copy=copy)
434
435 if ensure_2d:
ValueError: setting an array element with a sequence.
Assuming I understand you correctly you need to package the first group in a list; something like this:
import numpy as np
#X = [[1,2,3],[4,5,6],[7,8,9]]
#y = [0,1,2,3,4,5,6, 7, 8, 9]
X = np.array([[1,2,3],[4,5,6],[7,8,9]])
y = np.array([0,1,2,3,4,5, 6, 7, 8, 9])
array(list[1,2,3],list[4,5,6],list[7,8,9])
is a 1d object dtype array. To get that from
[[1,2,3],[4,5,6],[7,8,9]]
requires more than np.array([[1,2,3],[4,5,6],[7,8,9]]); either the list elements have to vary in size, or you have to initialize an object array and copy the list values to it.
In any case fit_transform cannot handle that kind of array. It expects a 2d numeric dtype. Notice the parameters to the check_array function.
If all the list elements of X are the same size, then
X = np.stack(X)
should turn it into a 2d numeric array.
I suspect X was that 1d object array type before saving. By itself save/load should not turn a 2d numeric array into an object one.

Sympy TypeError: cannot determine truth value of Relational when using sympy

I'm learning SymPy now. Here is the problem I got:
x = symbols('x',real=True)
h = symbols('h',real=True)
f = symbols('f',cls=Function)
sym_dexpr = f_diff.subs(f(x), x*exp(-x**2)).doit()
f_diff = f(x).diff(x,1)
expr_diff = as_finite_diff(f_diff, [x, x-h,x-2*h,x-3*h])
w=Wild('w')
c=Wild('c')
patterns = [arg.match(c*f(w)) for arg in expr_diff.args]
coefficients = [t[c] for t in sorted(patterns, key=lambda t:t[w])]
print(coefficients)
But I got following error:
TypeError Traceback (most recent call
last) in ()
----> 1 coefficients = [t[c] for t in sorted(patterns, key=lambda t:t[w])]
2 print(coefficients)
C:\Program Files\Anaconda3\lib\site-packages\sympy\core\relational.py
in nonzero(self)
193
194 def nonzero(self):
--> 195 raise TypeError("cannot determine truth value of Relational")
196
197 bool = nonzero
TypeError: cannot determine truth value of Relational
I am using Windows 7, Python 3.5.2 and Anaconda 3.
Thank you.
The problem is the sort you perform on patterns.
sorted(patterns, key=lambda t:t[w]) attempts to return patterns sorted by every item's value for the key w, yet these values can not be compared with each other.
Why is that? because they are "relational" values, means they depend on the values of the variable in them. Lets check:
>>> [t[w] for t in patterns]
[-h + x, -3*h + x, -2*h + x, x]
Is -h + x greater than -3*h + x or the other way around? well, that depends on what h and x are, and since SymPy can't determine the order of these values, you get an error.

Resources