How to encode empty string using BERT - string

I have recently been trying to encode an empty string with CamemBERT (BERT model for French). I wasn't sure on how to do that. If I try to simply encode an empty string,
from transformers import CamembertModel, CamembertTokenizer
import torch
tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
camembert = CamembertModel.from_pretrained("camembert-base")
tokenized_sentence = tokenizer.tokenize("")
encoded_sentence = tokenizer.encode(tokenized_sentence, return_tensors='pt')
embeddings = camembert(encoded_sentence)
embeddings.last_hidden_state.squeeze()[0] # embedding of the CLS token
I get the error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-21-553400f369a8> in <module>
1 # Tokenize in sub-words with SentencePiece
2 tokenized_sentence = tokenizer.tokenize("")
----> 3 encoded_sentence = tokenizer.encode(tokenized_sentence, return_tensors='pt')
4 embeddings = camembert(encoded_sentence)
5 embeddings.last_hidden_state.squeeze()[0] # embeddings.last_hidden_state[0][0]
~/anaconda3/envs/r_nlp2/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in encode(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, return_tensors, **kwargs)
2057 ``convert_tokens_to_ids`` method).
2058 """
-> 2059 encoded_inputs = self.encode_plus(
2060 text,
2061 text_pair=text_pair,
~/anaconda3/envs/r_nlp2/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2376 )
2377
-> 2378 return self._encode_plus(
2379 text=text,
2380 text_pair=text_pair,
~/anaconda3/envs/r_nlp2/lib/python3.8/site-packages/transformers/tokenization_utils.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
459 )
460
--> 461 first_ids = get_input_ids(text)
462 second_ids = get_input_ids(text_pair) if text_pair is not None else None
463
~/anaconda3/envs/r_nlp2/lib/python3.8/site-packages/transformers/tokenization_utils.py in get_input_ids(text)
446 )
447 else:
--> 448 raise ValueError(
449 f"Input {text} is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
450 )
ValueError: Input [] is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
Which I think is expected behavior. I have tried with spaCy's French transformer model but have also been unsuccessful. Here's the code I used for spaCy :
from transformers import BertTokenizer, BertModel
import spacy
#!python -m spacy download fr_dep_news_trf
trf_fr = spacy.load("fr_dep_news_trf")
example = trf_fr("")
example._.trf_data.tensors[1].flatten() # embedding of the CLS token
And the error is
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-27-c53de04d2e6f> in <module>
1 example = trf_fr("")
----> 2 example._.trf_data.tensors[1].flatten()
IndexError: list index out of range
simply because the model returns [].
I guess that at this point, my question is theoretical: what would be the best or a good way to encode an empty string using CamemBERT or spaCy? Would "forcing" the model to return a vector of 0 be a good thing? Would returning "impossible" values such as a (10,..., 10) be a good possibility? Should I force the tokenizer to create a sequence of [PAD] tokens? In this case, how would I implement that using spaCy and/or CamemBERT?
Thanks!
PS : I'm using
Python 3.8.10
spaCy 3.0.6
transformers 4.6.1

Related

sklearn DictVectorizer() throwing error with a dictionary as input

I'm fairly new to sklearn's DictVectorizer, and am trying to create a function where DictVectorizer will output feature names from a list of bigrams that I have used to form a from a feature dictionary. The input to my function is a string, and the function should return a list consisting of a formed into dictionaries (something like this).
def features (str) -> List[Dict[Text, Union[Text, int]]]:
# my feature dictionary should have 'bigram' as the key, and the values will be the bigrams themselves. your feature dict needs to have "bigram" as a key
# bigram: a form of "w[i]-w[i+1]"
# This is my bigram list (as structured above)
bigrams: List[Dict[Text, Union[Text, int]]] = []
# here is my code:
bigrams = {'bigram':i for j in sentence for i in zip(j.split(" ").
[:-1], j.split(" ")[1:])}
return bigrams
vect = DictVectorizer(sparse=False)
text = str()
feature_catalog = features(text)
vect.fit(feature_catalog)
print(sorted(vectorizer.get_feature_names_out()))
Everything works fine until the code advances to the DictVectorizer blocks (hidden in the class itself). This is what I get:
AttributeError Traceback (most recent call last)
/var/folders/pl/k80fpf9s4f9_3rp8hnpw5x0m0000gq/T/ipykernel_3804/266218402.py in <module>
22 features = get_feature(text)
23
---> 24 vectorizer.fit(features)
25
26 print(sorted(vectorizer.get_feature_names()))
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/feature_extraction/_dict_vectorizer.py in fit(self, X, y)
159
160 for x in X:
--> 161 for f, v in x.items():
162 if isinstance(v, str):
163 feature_name = "%s%s%s" % (f, self.separator, v)
AttributeError: 'str' object has no attribute 'items'
Any ideas? This ultimately going to be used as part of a larger processsing effort on a corpus.

Setting `remove_unused_columns=False` causes error in HuggingFace Trainer class

I am training a model using HuggingFace Trainer class. The following code does a decent job:
!pip install datasets
!pip install transformers
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, AutoTokenizer
dataset = load_dataset('glue', 'mnli')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True)
def preprocess_function(examples):
return tokenizer(examples["premise"], examples["hypothesis"], truncation=True, padding=True)
encoded_dataset = dataset.map(preprocess_function, batched=True)
args = TrainingArguments(
"test-glue",
learning_rate=3e-5,
per_device_train_batch_size=8,
num_train_epochs=3,
remove_unused_columns=True
)
trainer = Trainer(
model,
args,
train_dataset=encoded_dataset["train"],
tokenizer=tokenizer
)
trainer.train()
However, setting remove_unused_columns=False results in the following error:
ValueError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
704 if not is_tensor(value):
--> 705 tensor = as_tensor(value)
706
ValueError: too many dimensions 'str'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
8 frames
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
720 )
721 raise ValueError(
--> 722 "Unable to create tensor, you should probably activate truncation and/or padding "
723 "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
724 )
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
Any suggestions are highly appreciated.
It fails because the value in line 705 is a list of str, which points to hypothesis. And hypothesis is one of the ignored_columns in trainer.py.
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
704 if not is_tensor(value):
--> 705 tensor = as_tensor(value)
See the below snippet from trainer.py for the remove_unused_columns flag:
def _remove_unused_columns(self, dataset: "datasets.Dataset", description: Optional[str] = None):
if not self.args.remove_unused_columns:
return dataset
if self._signature_columns is None:
# Inspect model forward signature to keep only the arguments it accepts.
signature = inspect.signature(self.model.forward)
self._signature_columns = list(signature.parameters.keys())
# Labels may be named label or label_ids, the default data collator handles that.
self._signature_columns += ["label", "label_ids"]
columns = [k for k in self._signature_columns if k in dataset.column_names]
ignored_columns = list(set(dataset.column_names) - set(self._signature_columns))
There could be a potential pull request on HuggingFace to provide a fallback option in case the flag is False. But in general, it looks like that the flag implementation is not complete for e.g. it can't be used with Tensorflow.
On the contrary, it doesn't hurt to keep it True, unless there is some special need.

Don't understand error message (basic sklearn command)

I'm new to Python and programming in general and I wanted to exercise a littlebit with linear regression in one variable.
Im currently following this tutorial in the link
https://www.youtube.com/watch?v=8jazNUpO3lQ&list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw&index=2
and I am exactly doing what he is doing.
I did however encounter an error when compiling as shown in the code below
(for simplicity, I put '--' to places which is the output. I used Jupyter Notebook)
At the end I encounterd a long list of errors when trying to compile 'reg.predict(3300)'.
I don't understand what went wrong.
Can someone help me out?
Cheers!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
df = pd.read_csv("homeprices.csv")
df
--area price
0 2600 550000
1 3000 565000
2 3200 610000
3 3600 680000
4 4000 725000
%matplotlib inline
plt.xlabel('area(sqr ft)')
plt.ylabel('price(US$)')
plt.scatter(df.area, df.price, color='red', marker = '+')
--<matplotlib.collections.PathCollection at 0x2e823ce66a0>
reg = linear_model.LinearRegression()
reg.fit(df[['area']],df.price)
--LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
reg.predict(3300)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-ad5a8409ff75> in <module>
----> 1 reg.predict(3300)
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
211 Returns predicted values.
212 """
--> 213 return self._decision_function(X)
214
215 _preprocess_data = staticmethod(_preprocess_data)
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in _decision_function(self, X)
194 check_is_fitted(self, "coef_")
195
--> 196 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
197 return safe_sparse_dot(X, self.coef_.T,
198 dense_output=True) + self.intercept_
~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
543 "Reshape your data either using array.reshape(-1, 1) if "
544 "your data has a single feature or array.reshape(1, -1) "
--> 545 "if it contains a single sample.".format(array))
546 # If input is 1D raise error
547 if array.ndim == 1:
ValueError: Expected 2D array, got scalar array instead:
array=3300.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Try reg.predict([[3300]]). The api used to allow scalar value but now you need to give 2D array
reg.fit(df[['area']],df.price)
I think above we are using 2 variables, so using 2D array to fit [X]. we need to use 2D array in reg.predict for [X],too. Hence,
reg.predict([[3300]])
Expected 2D array,got scalar array instead: this is written in the error explained box so
kindly change it to :
just wrote it like this
reg.predict([[3300]])

How to fix this ValueError?

I am trying to run a python code, mostly based on NLTK book, for ngram POS Tagging a Gujarati language text from my GujaratiTextCorpus. I encountered a ValueError.
I am working with Python 3.7.3 in Windows 10. I use jupyter notebook through anaconda. I am a beginner in using python. I studied the answers available on stackoverflow. com to fix my ValueError, but could not solve it.
import nltk
f = open('C:\\Users\\BHOGAYATA\\Documents\\GujaratiPosTagging\\cts260.txt', encoding = 'utf8')
raw = f.read()
train2_sents = nltk.sent_tokenize(raw)
text2 = nltk.Text(train2_sents)
train2_sents
import nltk
f = open('C:\\Users\\BHOGAYATA\\Documents\\GujaratiPosTagging\\txt42_sents.txt', encoding = 'utf8')
raw = f.read()
bs_sents = nltk.sent_tokenize(raw)
text3 = nltk.Text(bs_sents)
bs_sents
unigram_tagger = nltk.UnigramTagger(train2_sents)
unigram_tagger.tag(bs_sents)
I expected that the words of the two Gujarati sentences would be POS Tagged. I found the following error messages:
ValueError
Traceback (most recent call last)
<ipython-input-3-5fae0b92393e> in <module>
11 text3 = nltk.Text(bs_sents)
12 bs_sents
---> 13 unigram_tagger = nltk.UnigramTagger(train2_sents)
14 unigram_tagger.tag(bs_sents)
15
~\Anaconda3\lib\site-packages\nltk\tag\sequential.py in __init__(self, train, model, backoff, cutoff, verbose)
344
345 def __init__(self, train=None, model=None, backoff=None, cutoff=0, verbose=False):
--> 346 NgramTagger.__init__(self, 1, train, model, backoff, cutoff, verbose)
347
348 def encode_json_obj(self):
~\Anaconda3\lib\site-packages\nltk\tag\sequential.py in __init__(self, n, train, model, backoff, cutoff, verbose)
293
294 if train:
--> 295 self._train(train, cutoff, verbose)
296
297 def encode_json_obj(self):
~\Anaconda3\lib\site-packages\nltk\tag\sequential.py in _train(self, tagged_corpus, cutoff, verbose)
181 fd = ConditionalFreqDist()
182 for sentence in tagged_corpus:
--> 183 tokens, tags = zip(*sentence)
184 for index, (token, tag) in enumerate(sentence):
185 # Record the event.
ValueError: not enough values to unpack (expected 2, got 1)
It says the variable you are passing have one output but you are expecting two..
Ex:
for a, b in [("a", "b")]:
print("a:", a, "b:", b)
This will work
for a, b in [("a")]:
print("a:", a, "b:", b)
This will not work
Edit:
Look at your UnigramTagger
For first argument it takes a list of tagged sentences of type
list(list(tuple(str, str)))
You are giving train2_sents of type
list(tuple(str,str)
Where your
list(tuple(str,str) is same as train2_sents

Tree classifier to graphviz ERROR

I had made a Tree Classifier named model and tried to use the export graphviz function like this:
export_graphviz(decision_tree=model,
out_file='NT_model.dot',
feature_names=X_train.columns,
class_names=model.classes_,
leaves_parallel=True,
filled=True,
rotate=False,
rounded=True)
For some reason my run had raised this exception:
TypeError Traceback (most recent call last)
<ipython-input-298-40fe56bb0c85> in <module>()
6 filled=True,
7 rotate=False,
----> 8 rounded=True)
C:\Users\yonatanv\AppData\Local\Continuum\Anaconda3\lib\site-
packages\sklearn\tree\export.py in export_graphviz(decision_tree, out_file,
max_depth, feature_names, class_names, label, filled, leaves_parallel,
impurity, node_ids, proportion, rotate, rounded, special_characters)
431 recurse(decision_tree, 0, criterion="impurity")
432 else:
--> 433 recurse(decision_tree.tree_, 0,
criterion=decision_tree.criterion)
434
435 # If required, draw leaf nodes at same depth as each other
C:\Users\yonatanv\AppData\Local\Continuum\Anaconda3\lib\site-
packages\sklearn\tree\export.py in recurse(tree, node_id, criterion, parent,
depth)
319 out_file.write('%d [label=%s'
320 % (node_id,
--> 321 node_to_str(tree, node_id,
criterion)))
322
323 if filled:
C:\Users\yonatanv\AppData\Local\Continuum\Anaconda3\lib\site-
packages\sklearn\tree\export.py in node_to_str(tree, node_id, criterion)
289 np.argmax(value),
290 characters[2])
--> 291 node_string += class_name
292
293 # Clean up any trailing newlines
TypeError: ufunc 'add' did not contain a loop with signature matching types
dtype('<U90') dtype('<U90') dtype('<U90')
My hyper parameters for the visualizations are those:
print(model)
DecisionTreeClassifier(class_weight={1.0: 10, 0.0: 1}, criterion='gini',
max_depth=7, max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=50,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=0, splitter='best')
print(model.classes_)
[ 0. , 1. ]
Help would be most appreciated!
As you see here specified in the documentation of export_graphviz, the param class_names works for strings, not float or int.
class_names : list of strings, bool or None, optional (default=None)
Try converting the model.classes_ to list of strings before passing them in export_graphviz.
Try class_names=['0', '1'] or class_names=['0.0', '1.0'] in the call to export_graphviz().
For a more general solution, use:
class_names=[str(x) for x in model.classes_]
But is there a specific reason that you are passing float values as y in model.fit()? Because that is mostly not required in classification task. Do you have actual y labels as this only or are you converting string labels to numeric before fitting the model?

Resources