I am training a model using HuggingFace Trainer class. The following code does a decent job:
!pip install datasets
!pip install transformers
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, AutoTokenizer
dataset = load_dataset('glue', 'mnli')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True)
def preprocess_function(examples):
return tokenizer(examples["premise"], examples["hypothesis"], truncation=True, padding=True)
encoded_dataset = dataset.map(preprocess_function, batched=True)
args = TrainingArguments(
"test-glue",
learning_rate=3e-5,
per_device_train_batch_size=8,
num_train_epochs=3,
remove_unused_columns=True
)
trainer = Trainer(
model,
args,
train_dataset=encoded_dataset["train"],
tokenizer=tokenizer
)
trainer.train()
However, setting remove_unused_columns=False results in the following error:
ValueError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
704 if not is_tensor(value):
--> 705 tensor = as_tensor(value)
706
ValueError: too many dimensions 'str'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
8 frames
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
720 )
721 raise ValueError(
--> 722 "Unable to create tensor, you should probably activate truncation and/or padding "
723 "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
724 )
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
Any suggestions are highly appreciated.
It fails because the value in line 705 is a list of str, which points to hypothesis. And hypothesis is one of the ignored_columns in trainer.py.
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
704 if not is_tensor(value):
--> 705 tensor = as_tensor(value)
See the below snippet from trainer.py for the remove_unused_columns flag:
def _remove_unused_columns(self, dataset: "datasets.Dataset", description: Optional[str] = None):
if not self.args.remove_unused_columns:
return dataset
if self._signature_columns is None:
# Inspect model forward signature to keep only the arguments it accepts.
signature = inspect.signature(self.model.forward)
self._signature_columns = list(signature.parameters.keys())
# Labels may be named label or label_ids, the default data collator handles that.
self._signature_columns += ["label", "label_ids"]
columns = [k for k in self._signature_columns if k in dataset.column_names]
ignored_columns = list(set(dataset.column_names) - set(self._signature_columns))
There could be a potential pull request on HuggingFace to provide a fallback option in case the flag is False. But in general, it looks like that the flag implementation is not complete for e.g. it can't be used with Tensorflow.
On the contrary, it doesn't hurt to keep it True, unless there is some special need.
I have recently been trying to encode an empty string with CamemBERT (BERT model for French). I wasn't sure on how to do that. If I try to simply encode an empty string,
from transformers import CamembertModel, CamembertTokenizer
import torch
tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
camembert = CamembertModel.from_pretrained("camembert-base")
tokenized_sentence = tokenizer.tokenize("")
encoded_sentence = tokenizer.encode(tokenized_sentence, return_tensors='pt')
embeddings = camembert(encoded_sentence)
embeddings.last_hidden_state.squeeze()[0] # embedding of the CLS token
I get the error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-21-553400f369a8> in <module>
1 # Tokenize in sub-words with SentencePiece
2 tokenized_sentence = tokenizer.tokenize("")
----> 3 encoded_sentence = tokenizer.encode(tokenized_sentence, return_tensors='pt')
4 embeddings = camembert(encoded_sentence)
5 embeddings.last_hidden_state.squeeze()[0] # embeddings.last_hidden_state[0][0]
~/anaconda3/envs/r_nlp2/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in encode(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, return_tensors, **kwargs)
2057 ``convert_tokens_to_ids`` method).
2058 """
-> 2059 encoded_inputs = self.encode_plus(
2060 text,
2061 text_pair=text_pair,
~/anaconda3/envs/r_nlp2/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2376 )
2377
-> 2378 return self._encode_plus(
2379 text=text,
2380 text_pair=text_pair,
~/anaconda3/envs/r_nlp2/lib/python3.8/site-packages/transformers/tokenization_utils.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
459 )
460
--> 461 first_ids = get_input_ids(text)
462 second_ids = get_input_ids(text_pair) if text_pair is not None else None
463
~/anaconda3/envs/r_nlp2/lib/python3.8/site-packages/transformers/tokenization_utils.py in get_input_ids(text)
446 )
447 else:
--> 448 raise ValueError(
449 f"Input {text} is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
450 )
ValueError: Input [] is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
Which I think is expected behavior. I have tried with spaCy's French transformer model but have also been unsuccessful. Here's the code I used for spaCy :
from transformers import BertTokenizer, BertModel
import spacy
#!python -m spacy download fr_dep_news_trf
trf_fr = spacy.load("fr_dep_news_trf")
example = trf_fr("")
example._.trf_data.tensors[1].flatten() # embedding of the CLS token
And the error is
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-27-c53de04d2e6f> in <module>
1 example = trf_fr("")
----> 2 example._.trf_data.tensors[1].flatten()
IndexError: list index out of range
simply because the model returns [].
I guess that at this point, my question is theoretical: what would be the best or a good way to encode an empty string using CamemBERT or spaCy? Would "forcing" the model to return a vector of 0 be a good thing? Would returning "impossible" values such as a (10,..., 10) be a good possibility? Should I force the tokenizer to create a sequence of [PAD] tokens? In this case, how would I implement that using spaCy and/or CamemBERT?
Thanks!
PS : I'm using
Python 3.8.10
spaCy 3.0.6
transformers 4.6.1
I'm trying to pass the all of the huggingface's ...ForMaskedLM to the FitBert model for fill-in-the-blank task and see which pretrained yields the best result on the data I've prepared. But in the Reformer module I have this error says that I need to do 'config.is_decoder=False' but I don't really get what this means (This is my first time using huggingface). I tried to pass a ReformerConfig(is_decoder=False) to the model but still get the same error. How can I fix this?
My code:
pretrained_weights = ['google/reformer-crime-and-punishment',
'google/reformer-enwik8']
configurations = ReformerConfig(is_decoder=False)
for weight in pretrained_weights:
print(weight)
model = ReformerForMaskedLM(configurations).from_pretrained(weight)
tokenizer = ReformerTokenizer.from_pretrained(weight)
fb = FitBert(model=model, tokenizer=tokenizer)
predicts = []
for _, row in df.iterrows():
predicts.append(fb.rank(row['question'], options=[row['1'], row['2'], row['3'], row['4']])[0])
print(weight,':', np.sum(df.anwser==predicts) / df.shape[0])
Error:
AssertionError Traceback (most recent call last)
<ipython-input-5-a6016e0015ba> in <module>()
4 for weight in pretrained_weights:
5 print(weight)
----> 6 model = ReformerForMaskedLM(configurations).from_pretrained(weight)
7 tokenizer = ReformerTokenizer.from_pretrained(weight)
8 fb = FitBert(model=model, tokenizer=tokenizer)
/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
1032
1033 # Instantiate model.
-> 1034 model = cls(config, *model_args, **model_kwargs)
1035
1036 if state_dict is None and not from_tf:
/usr/local/lib/python3.7/dist-packages/transformers/models/reformer/modeling_reformer.py in __init__(self, config)
2304 assert (
2305 not config.is_decoder
-> 2306 ), "If you want to use `ReformerForMaskedLM` make sure `config.is_decoder=False` for bi-directional self-attention."
2307 self.reformer = ReformerModel(config)
2308 self.lm_head = ReformerOnlyLMHead(config)
AssertionError: If you want to use `ReformerForMaskedLM` make sure `config.is_decoder=False` for bi-directional self-attention.
You can override certain model configurations by loading the model config separately and providing it as parameter for the from_pretrained() method. This will assure that you are using the proper model configuration with the changes you have made:
from transformers import ReformerConfig, ReformerForMaskedLM
config = ReformerConfig.from_pretrained('google/reformer-crime-and-punishment')
print(config.is_decoder)
config.is_decoder=False
print(config.is_decoder)
model = ReformerForMaskedLM.from_pretrained('google/reformer-crime-and-punishment', config=config)
Output:
True
False
Im trying to use the zip function to bring the column names together and the np.transpose function to bring together the coefficients of the log_model I created.
My code:
# Create LogisticRegression model object
log_model = LogisticRegression()
# Fit our data into that object
log_model.fit(X,Y)
# Check your accuracy
log_model.score(X,Y)
This code worked just fine as I was able to check the accuracy of my model.
However, the following code is where I get my error.
Erroneous code:
coeff_df = DataFrame(zip(X.columns,np.transpose(log_model.coef_)))
Error message:
TypeError Traceback (most recent call last)
<ipython-input-147-a4e0ad234518> in <module>()
1 # Use zip to bring the column names and the np.transpose function to bring together the coefficients from the model
----> 2 coeff_df = DataFrame(zip(X.columns,np.transpose(log_model.coef_)))
~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
387 mgr = self._init_dict({}, index, columns, dtype=dtype)
388 elif isinstance(data, collections.Iterator):
--> 389 raise TypeError("data argument can't be an iterator")
390 else:
391 try:
TypeError: data argument can't be an iterator
What am I doing wrong? Sorry, newbie here. Im following a Udemy data visualization with Python tutorial. My lecturer is using Python 2 but I've been able to manage with Python 3, just making and researching the conversions to ensure my code still works. Any suggestions would be greatly appreciated.
For binary classification, only one linear model is used. In the case of multi-label classification, there will be one linear model for each class. Assuming your X is a pd.DataFrame, you can proceed as follows:
output = pd.DataFrame(my_model.coef_, columns=X.columns)
Rows represent linear models for different classes, columns represent coefficients.
I would like to perform a multinomial logistic regression but I can't set threshold and thresholds parameters correctly. Consider the following DF:
from pyspark.ml.linalg import DenseVector
test_train_df = (
sqlc
.createDataFrame([(0, DenseVector([-1.0, 1.2, 0.7])),
(0, DenseVector([3.1, -2.0, -2.9])),
(1, DenseVector([1.0, 0.8, 0.3])),
(1, DenseVector([4.2, 1.4, -1.7])),
(0, DenseVector([-1.9, 2.5, -2.3])),
(2, DenseVector([2.6, -0.2, 0.2])),
(1, DenseVector([0.3, -3.4, 1.8])),
(2, DenseVector([-1.0, -3.5, 4.7]))],
['label', 'features'])
)
My label has 3 classes, so I have to set thresholds (plural, which default is None) rather than threshold (singular, which default is 0.5). Then I write:
from pyspark.ml import classification as cl
test_logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThresholds([.5, .5, .5])
)
Then I would like to fit the model on my DF:
test_logit = test_logit_abst.fit(test_train_df)
but when executing this last command I get an error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
~/anaconda3/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
Py4JJavaError: An error occurred while calling o3769.fit.
: java.lang.IllegalArgumentException: requirement failed: Logistic Regression found inconsistent values for threshold and thresholds. Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.
During handling of the above exception, another exception occurred:
IllegalArgumentException Traceback (most recent call last)
<ipython-input-211-8f3443f41b6b> in <module>()
----> 1 test_logit = test_logit_abst.fit(test_train_df)
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/base.py in fit(self, dataset, params)
62 return self.copy(params)._fit(dataset)
63 else:
---> 64 return self._fit(dataset)
65 else:
66 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit(self, dataset)
263
264 def _fit(self, dataset):
--> 265 java_model = self._fit_java(dataset)
266 return self._create_model(java_model)
267
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit_java(self, dataset)
260 """
261 self._transfer_params_to_java()
--> 262 return self._java_obj.fit(dataset._jdf)
263
264 def _fit(self, dataset):
~/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: 'requirement failed: Logistic Regression found inconsistent values for threshold and thresholds. Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.'
The error says threshold is set. This looks strange, as the documentation says that setting thresholds (plural) clears threshold (singular), so that the value 0.5 should be deleted.
So, how to clear threshold since no clearThreshold() exists?
In order to achieve this I tried to clear threshold this way:
logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThresholds([.5, .5, .5])
.setThreshold(None)
)
This time the fit command works, I even obtain the model intercept and coefficients:
test_logit.interceptVector
DenseVector([65.6445, 31.6369, -97.2814])
test_logit.coefficientMatrix
DenseMatrix(3, 3, [-76.4534, -19.4797, -79.4949, 12.3659, 4.642, 4.1057, 64.0876, 14.8377, 75.3892], 1)
But if I try to get thresholds (plural) from test_logit_abst I get an error:
test_logit_abst.getThresholds()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-214-fc1c8617ce80> in <module>()
----> 1 test_logit_abst.getThresholds()
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/classification.py in getThresholds(self)
363 if not self.isSet(self.thresholds) and self.isSet(self.threshold):
364 t = self.getOrDefault(self.threshold)
--> 365 return [1.0-t, t]
366 else:
367 return self.getOrDefault(self.thresholds)
TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'
What does this mean?
As a further detail, curiously (and incomprehensibly to me) inverting the order of the parameters settings produces the first error I posted above:
logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThreshold(None)
.setThresholds([.5, .5, .5])
)
Why does changing the order of the "set" instructions change the output as well?
It is a messy situation indeed...
The short answer is:
setThresholds (plural) not clearing the threshold (singular) seems to be a bug
For multinomial classification (i.e. number of classes > 2), setThresholds does not do what you expect (and arguably you don't need it)
If all you need is having some "thresholds" in the "default" value of 0.5, you don't have a problem - simply don't use any relevant argument or setThresholds statement
If you really need to apply different decision thresholds to different classes in multinomial classification, you will have to do it manually, by post-processing the respective probabilities, i.e. the probability column in the transformed dataframe (it works OK though with setThreshold(s) for binary classification)
And now for the long answer...
Let's start with binary classification, adapting the toy data from the docs:
spark.version
# u'2.2.0'
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
bdf = sc.parallelize([
Row(label=1.0, features=Vectors.dense(0.0, 5.0)),
Row(label=0.0, features=Vectors.dense(1.0, 2.0)),
blor = LogisticRegression(threshold=0.7, thresholds=[0.3, 0.7])
Row(label=1.0, features=Vectors.dense(2.0, 1.0)),
Row(label=0.0, features=Vectors.dense(3.0, 3.0))]).toDF()
We don't need to set thresholds (plural) here - threshold=0.7 is enough, but it will be useful when illustrating the differences with setThreshold below.
blorModel = blor.fit(bdf) # works OK
blor.getThreshold()
# 0.7
blor.getThresholds()
# [0.3, 0.7]
blorModel.transform(bdf).show(truncate=False) # transform the training data
Here is the result:
+---------+-----+------------------------------------------+----------------------------------------+----------+
|features |label|rawPrediction |probability |prediction|
+---------+-----+------------------------------------------+----------------------------------------+----------+
|[0.0,5.0]|1.0 |[-1.138455151184087,1.138455151184087] |[0.242604109995602,0.757395890004398] |1.0 |
|[1.0,2.0]|0.0 |[-0.6056346859838877,0.6056346859838877] |[0.35305562698104337,0.6469443730189567]|0.0 |
|[2.0,1.0]|1.0 |[0.26586039040308496,-0.26586039040308496]|[0.5660763559614698,0.4339236440385302] |0.0 |
|[3.0,3.0]|0.0 |[1.6453673835702176,-1.6453673835702176] |[0.8382639556951765,0.16173604430482344]|0.0 |
+---------+-----+------------------------------------------+----------------------------------------+----------+
What is the meaning of thresholds=[0.3, 0.7]? The answer lies in the 2nd row, where the prediction is 0.0, despite the fact that the the probability is higher for 1.0 (0.65): 0.65 is indeed higher that 0.35, but it is lower than the threshold we have set for this class (0.7), hence it is not classified as such.
Let's now try the seemingly identical operation, but with setThreshold(s) instead:
blor2 = (LogisticRegression()
.setThreshold(0.7)
.setThresholds([0.3, 0.7]) ) # works OK
blorModel2 = blor2.fit(bdf)
[...]
IllegalArgumentException: u'requirement failed: Logistic Regression getThreshold found inconsistent values for threshold (0.5) and thresholds (equivalent to 0.7)'
Nice, eh?
setThresholds (plural) seems indeed to have cleared our value of threshold (0.7) set in the previous line, as claimed in the docs, but it seemingly did so only to restore it to its default value of 0.5...
Omitting .setThreshold(0.7) gives the first error you report yourself (not shown).
Inverting the order of the parameter settings resolves the issue (!!!) and, moreover, renders both getThreshold (singular) and getThresholds (plural) operational (in contrast with your case):
blor2 = (LogisticRegression()
.setThresholds([0.3, 0.7])
.setThreshold(0.7) )
blorModel2 = blor2.fit(bdf) # works OK
blor2.getThreshold()
# 0.7
blor2.getThresholds()
# [0.30000000000000004, 0.7]
Let's move now to the multinomial case; we'll stick again to the example in the docs, with data from the Spark Github repo (they should also be available locally, in your $SPARK_HOME/data/mllib/sample_multiclass_classification_data.txt, but I am working on a Databricks notebook); it is a 3-class case, with labels in {0.0, 1.0, 2.0}.
data_path ="/FileStore/tables/sample_multiclass_classification_data.txt"
mdf = spark.read.format("libsvm").load(data_path)
Similarly with the binary case above, where the elements of our thresholds (plural) sum up to 1, let's ask for a threshold of 0.8 for class 2:
mlor = (LogisticRegression()
.setFamily("multinomial")
.setThresholds([0, 0.2, 0.8])
.setThreshold(0.8) )
mlorModel= mlor.fit(mdf) # works OK
mlor.getThreshold()
# 0.8
mlor.getThresholds()
# [0.19999999999999996, 0.8]
Looks fine, but let's ask for a prediction in the (training) dataset:
mlorModel.transform(mdf).show(truncate=False)
I have singled out only one row - it should be the 2nd from the end of the full output:
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
|label|features |rawPrediction |probability |prediction|
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
[...]
|0.0 |(4,[0,1,2,3],[0.111111,-0.333333,0.38983,0.166667]) |[36.67790353804905,-74.71196613173531,38.034062593686244]|[0.20486526556822454,8.619113376801409E-50,0.7951347344317755] |2.0 |
[...]
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
Scrolling to the right, you'll see that despite the fact that the prediction for class 2.0 here is below the threshold we have set (0.8), the row is indeed predicted as 2.0 - in contrast with the binary case demonstrated above...
So, what to do? Simply remove all the threshold-related statements; you don't need them - even setFamily is unnecessary, as the algorithm will detect by itself that you have more than 2 classes. This will give identical results with the above:
mlor = LogisticRegression() # works OK - no family, no threshold(s)
To summarize:
In both the binary & multinomial cases, what is actually returned by the algorithm is a vector of probabilities of length equal to the number of classes, with elements summing up to 1.
In the binary case only, Spark allows you to go one step further and not naively selecting the highest probability class as the prediction, but applying a user-defined threshold instead; this setting might be useful e.g. in cases with imbalanced data.
This threshold(s) setting has actually no effect in the multinomial case, where Spark will always return as prediction the class with the highest probability.
Despite the mess in the documentation (about which I have argued elsewhere) and the possibility of some bugs, let me say about (3) that this design choice is not unjustifiable; as it has been nicely argued elsewhere (emphasis in the original):
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.
Although the above argument was made for the binary case, it fully holds for the multinomial one, too...