HuggingFace - GPT2 Tokenizer configuration in config.json - pytorch

The GPT2 finetuned model is uploaded in huggingface-models for the inferencing
Below error is observed during the inference,
Can't load tokenizer using from_pretrained, please update its configuration: Can't load tokenizer for 'bala1802/model_1_test'. Make sure that: - 'bala1802/model_1_test' is a correct model identifier listed on 'https://huggingface.co/models' - or 'bala1802/model_1_test' is the correct path to a directory containing relevant tokenizer files
Below is the configuration - config.json file for the Finetuned huggingface model,
{
"_name_or_path": "gpt2",
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"gradient_checkpointing": false,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_inner": null,
"n_layer": 12,
"n_positions": 1024,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"transformers_version": "4.3.2",
"use_cache": true,
"vocab_size": 50257
}
Should I configure the GPT2 Tokenizer just like the "model_type": "gpt2" in the config.json file

Your repository does not contain the required files to create a tokenizer. It seems like you have only uploaded the files for your model. Create an object of your tokenizer that you have used for training the model and save the required files with save_pretrained():
from transformers import GPT2Tokenizer
t = GPT2Tokenizer.from_pretrained("gpt2")
t.save_pretrained('/SOMEFOLDER/')
Output:
('/SOMEFOLDER/tokenizer_config.json',
'/SOMEFOLDER/special_tokens_map.json',
'/SOMEFOLDER/vocab.json',
'/SOMEFOLDER/merges.txt',
'/SOMEFOLDER/added_tokens.json')

Related

Understanding the config file of paraphrase mpnet base v2?

Here is the config file of the paraphrase mpnet transformer model and I would like to understand the meaning with examples of the hidden_size and num_hidden_layers parameters.
{
"_name_or_path": "old_models/paraphrase-mpnet-base-v2/0_Transformer",
"architectures": [
"MPNetModel"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "mpnet",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"relative_attention_num_buckets": 32,
"transformers_version": "4.7.0",
"vocab_size": 30527
}

Unrecognized configuration class <class 'transformers.models.bert.configuration_bert.BertConfig'> for this kind of AutoModel: AutoModelForSeq2SeqLM

Model type should be one of BartConfig, PLBartConfig, BigBirdPegasusConfig, M2M100Config, LEDConfig, BlenderbotSmallConfig, MT5Config, T5Config, PegasusConfig, MarianConfig, MBartConfig, BartConfig, BlenderbotConfig, FSMTConfig, XLMProphetNetConfig, ProphetNetConfig, EncoderDecoderConfig.
I am trying to load a fine-tuned Bert model for machine translation using AutoModelForSeq2SeqLM but it can't recognize the configuration class.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
model = AutoModelForSeq2SeqLM.from_pretrained('/content/drive/MyDrive/Models/CSE498')
Config File
{
"_name_or_path": "ckiplab/albert-tiny-chinese",
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.0,
"bos_token_id": 101,
"classifier_dropout": null,
"classifier_dropout_prob": 0.1,
"down_scale_factor": 1,
"embedding_size": 128,
"eos_token_id": 102,
"gap_size": 0,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.0,
"hidden_size": 312,
"initializer_range": 0.02,
"inner_group_num": 1,
"intermediate_size": 1248,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"net_structure_type": 0,
"num_attention_heads": 12,
"num_hidden_groups": 1,
"num_hidden_layers": 4,
"num_memory_blocks": 0,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"tokenizer_class": "BertTokenizerFast",
"torch_dtype": "float32",
"transformers_version": "4.18.0",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 30522
}
This is because BERT itself is not a seq2seq model. You can consider using a pre-trained BART instead.

Hyperopt spark 3.0 issues

I am running runtime 8.1 (includes Apache Spark 3.1.1, Scala 2.12) trying to get hyperopt working as defined by
https://docs.databricks.com/applications/machine-learning/automl-hyperparam-tuning/hyperopt-
spark-mlflow-integration.html
py4j.Py4JException: Method maxNumConcurrentTasks([]) does not exist
when I try to
spark_trials = SparkTrials()
Is there anything special I need to do to get this working?
Here is the cluster I am using
{
"autoscale": {
"min_workers": 1,
"max_workers": 2
},
"cluster_name": "mlops_tiny_ml",
"spark_version": "8.2.x-cpu-ml-scala2.12",
"spark_conf": {},
"aws_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK",
"zone_id": "us-west-2b",
"instance_profile_arn": "arn:aws:iam::112437402463:instance-profile/databricks_instance_role_s3",
"spot_bid_price_percent": 100,
"ebs_volume_type": "GENERAL_PURPOSE_SSD",
"ebs_volume_count": 3,
"ebs_volume_size": 100
},
"node_type_id": "m4.large",
"driver_node_type_id": "m4.large",
"ssh_public_keys": [],
"custom_tags": {},
"spark_env_vars": {},
"autotermination_minutes": 120,
"enable_elastic_disk": false,
"cluster_source": "UI",
"init_scripts": [],
"cluster_id": "0xxxxxt404"
}
this is the code I am using
https://docs.databricks.com/applications/machine-learning/automl-hyperparam-tuning/hyperopt-model-selection.html
Hyperopt is only included into the DBR ML runtimes, not into the stock runtimes. You can check it by looking into release notes for each of runtimes: DBR 8.1 vs. DBR 8.1 ML.
And from the docs:
Databricks Runtime for Machine Learning incorporates MLflow and Hyperopt, two open source tools that automate the process of model selection and hyperparameter tuning.

How to read predict() result in Tensorflowjs using a SavedModel

Code using tfjs-node:
const model = await tf.node.loadSavedModel(modelPath);
const data = fs.readFileSync(imgPath);
const tfimage = tf.node.decodeImage(data, 3);
const expanded = tfimage.expandDims(0);
const result = model.predict(expanded);
console.log(result);
for (r of result) {
console.log(r.dataSync());
}
Output:
(8) [Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]
Float32Array(100) [48700, 48563, 48779, 48779, 49041, 48779, ...]
Float32Array(400) [0.10901492834091187, 0.18931034207344055, 0.9181075692176819, 0.8344497084617615, ...]
Float32Array(100) [61, 88, 65, 84, 67, 51, 62, 20, 59, 9, 18, ...]
Float32Array(9000) [0.009332209825515747, 0.003941178321838379, 0.0005068182945251465, 0.001926332712173462, 0.0020033419132232666, 0.000742495059967041, 0.022082984447479248, 0.0032682716846466064, 0.05071520805358887, 0.000018596649169921875, ...]
Float32Array(100) [0.6730095148086548, 0.1356855034828186, 0.12674063444137573, 0.12360832095146179, 0.10837388038635254, 0.10075071454048157, ...]
Float32Array(1) [100]
Float32Array(196416) [0.738592267036438, 0.4373246729373932, 0.738592267036438, 0.546840488910675, -0.010780575685203075, 0.00041256844997406006, 0.03478313609957695, 0.11279871314764023, -0.0504981130361557, -0.11237315833568573, 0.02907072752714157, 0.06638012826442719, 0.001794634386897087, 0.0009463857859373093, ...]
Float32Array(4419360) [0.0564018189907074, 0.016801774501800537, 0.025803595781326294, 0.011671125888824463, 0.014013528823852539, 0.008442580699920654, ...]
How do I read the predict() response for object detection? I was expecting a dictionary with num_detections, detection_boxes, detection_classes, etc. as described here.
I also tried using tf.execute(), but it throws me the following error: UnhandledPromiseRejectionWarning: Error: execute() of TFSavedModel is not supported yet.
I'm using efficientdet/d0 downloaded from here.
When you download the tensor using dataSync() it just keeps the value. If you wanted the object with a description of each of the results without the tensors you would just have to console.log(result). Then you expand the result from your log in the browsers console it should return something like this:
Tensor {
"dataId": Object {},
"dtype": "float32",
"id": 160213,
"isDisposedInternal": false,
"kept": false,
"rankType": "2",
"scopeId": 365032,
"shape": Array [
1,
3,
],
"size": 3,
"strides": Array [
3,
],
}
The output of your console.log(result) contains 8 tensors within it which shows that it is correct. You are looping over each of the results and each of the outputs should follow this format :
['num_detections', 'detection_boxes', 'detection_classes', 'detection_scores', 'raw_detection_boxes', 'raw_detection_scores, 'detection_anchor_indices', 'detection_multiclass_scores']

How to load Stanfordnlp pipeline without printing the load processor messages

I am trying to get dependency relations for words using Stanfordnlp. I have downloaded the English models and able to load the models to get the dependency relations for the words in the text. However, it will also print the whole load process messages.
Sample code:
import stanfordnlp
config = {
'processors': 'tokenize,pos,lemma,depparse', # Comma-separated list of processors to use
'lang': 'en', # Language code for the language to build the Pipeline in
'tokenize_model_path': 'C:\\path\\stanfordnlp_resources\\en_ewt_models\\en_ewt_tokenizer.pt',
'pos_model_path': 'C:\\path\\stanfordnlp_resources\\en_ewt_models\\en_ewt_tagger.pt',
'pos_pretrain_path': 'C:\\path\\stanfordnlp_resources\\en_ewt_models\\en_ewt.pretrain.pt',
'lemma_model_path': 'C:\\path\\stanfordnlp_resources\\en_ewt_models\\en_ewt_lemmatizer.pt',
'depparse_model_path': 'C:\\path\\stanfordnlp_resources\\en_ewt_models\\en_ewt_parser.pt',
'depparse_pretrain_path': 'C:\\path\\stanfordnlp_resources\\en_ewt_models\\en_ewt.pretrain.pt'
}
text = 'The weather is nice today.'
# This downloads the English models for the neural pipeline
nlp = stanfordnlp.Pipeline(**config) # This sets up a default neural pipeline in English
doc = nlp(text)
doc.sentences[0].print_dependencies()
>>>
Use device: cpu
---
Loading: tokenize
With settings:
{'model_path': 'C:\\path\\stanfordnlp_resources\\en_ewt_models\\en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: pos
With settings:
{'model_path': 'C:\\path\\stanfordnlp_resources\\en_ewt_models\\en_ewt_tagger.pt', 'pretrain_path': 'C:\\path\\stanfordnlp_resources\\en_ewt_models\\en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: lemma
With settings:
{'model_path': 'C:\\path\\stanfordnlp_resources\\en_ewt_models\\en_ewt_lemmatizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
---
Loading: depparse
With settings:
{'model_path': 'C:\\path\\stanfordnlp_resources\\en_ewt_models\\en_ewt_parser.pt', 'pretrain_path': 'C:\\path\\stanfordnlp_resources\\en_ewt_models\\en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Done loading processors!
---
('The', '2', 'det')
('weather', '4', 'nsubj')
('is', '4', 'cop')
('nice', '0', 'root')
('today', '4', 'obl:tmod')
('.', '4', 'punct')
I installed Stanfordnlp using Anaconda and working with Jupyter notebooks. Is there a way to skip the messages as I only need the dependencies.
if you only want to get rid of those lines in Jupyter notebooks, you can simply clear the outputs right after calling the pipeline;
from IPython.display import clear_output
...
nlp = stanfordnlp.Pipeline(**config)
clear_output()

Resources