Lookup table not working in training data of Rasa NLU - nlp

I have examples for a particular intent also showing the entity, and I want the model to recognize other words which could be entities for that particular intent, but it fails to recognize it.
## intent: frequency
* what is the frequency of [region](field)?
* what's the frequency of[region](field)?
* frequency of [region](field)?
* [region](field)s frequency?
* [region](field) frequency?
* frequency [region](field)?
## lookup: field
* price
* phone type
* region
So when I enter the text "What is the frequency of region?" I get the output
{'intent': {'name': 'frequency', 'confidence': 0.9517087936401367},
'entities': [{'start': 17, 'end': 23, 'value': 'region',
'entity': 'field', 'confidence': 0.9427971487440825,
'extractor': 'CRFEntityExtractor'}], 'text': 'What is the frequency of region?'}
but when I enter the text "What is the frequency of price?" I get the output
{'intent': {'name': 'frequency', 'confidence': 0.9276150465011597},
'entities': [], 'text': 'What is the frequency of price?'}

According to RasaNLU documentation, in order for lookups to work, you need to include a few examples from the lookup table.
Also, you need to understand that "phone type" and "region" are different patterns because "phone type" has two words and "region" is a single word. Keeping this in mind I have extended your dataset as
## intent: frequency
* what is the frequency of [region](field)?
* what is the frequency of [city](field)?
* what is the frequency of [work](field)?
* what's the frequency of [phone type](field)?
* what is the frequency of [phone type](field)?
* frequency of [region](field)?
* frequency of [phone type](field)?
* [region](field)s frequency?
* [region](field) frequency?
* frequency [region](field)?
Now when I tried all the examples you mentioned they worked even though the "price" was not included in the dataset but the patters were all covered.
Enter a message: What is the frequency of price?
{
"intent": {
"name": "frequency",
"confidence": 0.966820478439331
},
"entities": [
{
"start": 25,
"end": 30,
"value": "price",
"entity": "field",
"confidence": 0.7227365687405007,
"extractor": "CRFEntityExtractor"
}
]
}
I recommend using https://github.com/rodrigopivi/Chatito for generating simple dataset it would make things easier for you and generate synonyms etc. automatically.
Also, just in case you don't know you can also use files to point to large lookups such as
## lookup:city
data/lookups/city_lookup.txt

use the following pipeline in the config.yml
pipeline:
name: WhitespaceTokenizer
name: RegexFeaturizer
name: CRFEntityExtractor
name: LexicalSyntacticFeaturizer
name: CountVectorsFeaturizer
name: CountVectorsFeaturizer
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
name: DIETClassifier
entity_recognition: False
epochs: 100
name: EntitySynonymMapper
name: ResponseSelector
epochs: 100

Related

nan reward after hyperparameters optimization (ray, gym)

I launched a hyperopt algorithm on a custom gym environment.
this is my code :
config = {
"env": "affecta",
"sgd_minibatch_size": 1000,
"num_sgd_iter": 100,
"lr": tune.uniform(5e-6, 5e-2),
"lambda": tune.uniform(0.6, 0.99),
"vf_loss_coeff": tune.uniform(0.6, 0.99),
"kl_target": tune.uniform(0.001, 0.01),
"kl_coeff": tune.uniform(0.5, 0.99),
"entropy_coeff": tune.uniform(0.001, 0.01),
"clip_param": tune.uniform(0.4, 0.99),
"train_batch_size": 200, # taille de l'épisode
# "monitor": True,
# "model": {"free_log_std": True},
"num_workers": 6,
"num_gpus": 0,
# "rollout_fragment_length":3
# "batch_mode": "complete_episodes"
}
current_best_params = [{
'lr': 5e-4,
}]
config = explore(config)
optimizer = HyperOptSearch(metric="episode_reward_mean", mode="max", n_initial_points=20, random_state_seed=7, space=config)
# optimizer = ConcurrencyLimiter(optimizer, max_concurrent=4)
tuner = tune.Tuner(
"PPO",
tune_config=tune.TuneConfig(
# metric="episode_reward_mean", # the metric we want to study
# mode="max", # maximize the metric
search_alg=optimizer,
# num_samples will repeat the entire config 'num_samples' times == Number of trials dans l'output 'Status'
num_samples=10,
),
run_config=air.RunConfig(stop={"training_iteration": 3}, local_dir="test_avec_inoffensifs"),
# limite le nombre d'épisode pour chaque croisement d'hyperparamètres
)
results = tuner.fit()
The problem is that the dataframes returned at each iteration of the hyperopt algorithm contain nan values for rewards...
I tried using several environments, and it is still the same.
Thank you by advance :)
The returned rewards are independent HP optimization algorithm.
If the train_batch_size is 200 but you have tiny rollout fragment lengths, you probably run into an issue related to num_workers*rollout_fragment_length only being 18. So you collect very few samples (18!) on every iteration, train on them, but there is never a full episode to calculate the mean reward from, even after three iterations.
Collecting complete episodes, a larger rollout_fragment_length and/or a lower train_batch_size should do the trick.

HuggingFace-Transformers --- NER single sentence/sample prediction

I am trying to predict with the NER model, as in the tutorial from huggingface (it contains only the training+evaluation part).
I am following this exact tutorial here : https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb
The training works flawlessly, but the problems that I have begin when I try to predict on a simple sample.
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
loaded_model = AutoModel.from_pretrained('./my_model_own_custom_training.pth',
from_tf=False)
input_sentence = "John Nash is a great mathematician, he lives in France"
tokenized_input_sentence = tokenizer([input_sentence],
truncation=True,
is_split_into_words=False,
return_tensors='pt')
predictions = loaded_model(tokenized_input_sentence["input_ids"])[0]
Predictions is of shape (1,13,768)
How can I arrive at the final result of the form [JOHN <-> ‘B-PER’, … France <-> “B-LOC”], where B-PER and B-LOC are two ground truth labels, representing the tag for a person and location respectively?
The result of the prediction is:
torch.Size([1, 13, 768])
If I write:
print(predictions.argmax(axis=2))
tensor([613, 705, 244, 620, 206, 206, 206, 620, 620, 620, 477, 693, 308])
I get the tensor above.
However I would have expected to get the tensor representing the ground truth [0…8] labels from the ground truth annotations.
Summary when loading the model :
loading configuration file ./my_model_own_custom_training.pth/config.json
Model config DistilBertConfig {
“name_or_path": “distilbert-base-uncased”,
“activation”: “gelu”,
“architectures”: [
“DistilBertForTokenClassification”
],
“attention_dropout”: 0.1,
“dim”: 768,
“dropout”: 0.1,
“hidden_dim”: 3072,
“id2label”: {
“0”: “LABEL_0”,
“1”: “LABEL_1”,
“2”: “LABEL_2”,
“3”: “LABEL_3”,
“4”: “LABEL_4”,
“5”: “LABEL_5”,
“6”: “LABEL_6”,
“7”: “LABEL_7”,
“8”: “LABEL_8”
},
“initializer_range”: 0.02,
“label2id”: {
“LABEL_0”: 0,
“LABEL_1”: 1,
“LABEL_2”: 2,
“LABEL_3”: 3,
“LABEL_4”: 4,
“LABEL_5”: 5,
“LABEL_6”: 6,
“LABEL_7”: 7,
“LABEL_8”: 8
},
“max_position_embeddings”: 512,
“model_type”: “distilbert”,
“n_heads”: 12,
“n_layers”: 6,
“pad_token_id”: 0,
“qa_dropout”: 0.1,
“seq_classif_dropout”: 0.2,
“sinusoidal_pos_embds”: false,
"tie_weights”: true,
“transformers_version”: “4.8.1”,
“vocab_size”: 30522
}
The answer is a bit trickier than expected[Huge credits to Niels Rogge].
Firstly, loading models in huggingface-transformers can be done in (at least) two ways:
AutoModel.from_pretrained('./my_model_own_custom_training.pth', from_tf=False)
AutoModelForTokenClassification.from_pretrained('./my_model_own_custom_training.pth', from_tf=False)
It seems that, according to the task at hand, different AutoModels subclasses need to be used. In this scenario I posted, it is the AutoModelForTokenClassification() that has to be used.
After that, a solution to obtain the predictions would be to do the following:
# forward pass
outputs = model(**encoding)
logits = outputs.logits
predictions = logits.argmax(-1)

How to set the label names when using the Huggingface TextClassificationPipeline?

I am using a fine-tuned Huggingface model (on my company data) with the TextClassificationPipeline to make class predictions. Now the labels that this Pipeline predicts defaults to LABEL_0, LABEL_1 and so on. Is there a way to supply the label mappings to the TextClassificationPipeline object so that the output may reflect the same?
Env:
tensorflow==2.3.1
transformers==4.3.2
Sample Code:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # or any {'0', '1', '2'}
from transformers import TextClassificationPipeline, TFAutoModelForSequenceClassification, AutoTokenizer
MODEL_DIR = "path\to\my\fine-tuned\model"
# Feature extraction pipeline
model = TFAutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
pipeline = TextClassificationPipeline(model=model,
tokenizer=tokenizer,
framework='tf',
device=0)
result = pipeline("It was a good watch. But a little boring.")[0]
Output:
In [2]: result
Out[2]: {'label': 'LABEL_1', 'score': 0.8864616751670837}
The simplest way is to add such a mapping is to edit the config.json of the model to contain: id2label field as below:
{
"_name_or_path": "distilbert-base-uncased",
"activation": "gelu",
"architectures": [
"DistilBertForMaskedLM"
],
"id2label": [
"negative",
"positive"
],
"attention_dropout": 0.1,
.
.
}
A in-code way to set this mapping is by adding the id2label param in the from_pretrained call as below:
model = TFAutoModelForSequenceClassification.from_pretrained(MODEL_DIR, id2label={0: 'negative', 1: 'positive'})
Here is the Github Issue I raised for this to get added into the Documentation of transformers.XForSequenceClassification.

Identify best GridsearchCV scoring metric for food prediction in XGBoost

I am using GridSearchCV to find the best parameter that help me tune XGBoost for a food prediction algorithm.
I am struggling to identify the best scoring metric that would result in the best profit (sales margin minus wastage costs) as this is ultimately what I am looking for. In running the script below and plugging it into the data (I reserved some data for testing only), I noticed that a better R2 seems to be better than a better RMSE in obtaining a higher profit. But I am struggling to find an explanation which will help me guide to the best scoring method.
Here some infos on the situation:
It costs me 6 USD to produce the product and 9 USD to sell, so my margin is 3 USD. Therefore my wastage is 6 USD multiplied by (production minus sales quantities), whereas my earnings are sales quantities multiplied by 3.
Example: I produce 100, sell 70, waste 30 my earnings are 70*3 - 30*6 = 30
So I have an imbalance between sales and wastage.
Main Question: Which scoring metric puts a higher penalty weight on the over-prediction?
My current code:
X = consumption[feature_names]
y = consumption['Meal1']
data_dmatrix = xgb.DMatrix(data=X,label=y)
# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
'min_child_weight':[1, 2],
'gamma': [0.05,0.06],
'reg_alpha':range(1, 2),
'colsample_bytree': [0.22, 0.23],
'n_estimators': range(28, 29),
'max_depth': range(3, 8),
'reg_alpha':range(1, 2),
'reg_lambda':range(1, 2),
'subsample': [0.7,0.8,0.9],
'learning_rate': [0.1,0.2],
}
fixed_params = {'objective':'reg:squarederror','booster':'gbtree' }
# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(**fixed_params)
# Perform grid search: grid_mse
grid_mse = GridSearchCV(estimator=gbm, param_grid=gbm_param_grid, scoring="r2", cv=5, verbose=1)
# Fit grid_mse to the data
grid_mse.fit(X,y)
# Print the best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest Score found: ", np.sqrt(np.abs(grid_mse.best_score_)))

Issue with Monte Carlo analysis with uncertainty on LCIA

I try to run a Monte Carlo analysis with uncertainty on characterization factor. The code is running well (no error) but the results for each iteration are always the same. Calculation works with just LCA simulation.
Here is the code:
Definition of a sample LCIA method
some_exchange = bw.Database('biosphere3').random()
my_cf = [(some_exchange.key,
{"amount": 10,
"uncertainty_type": 4,
"minimum": 0,
"maximum": 20}
)]
uncertain_method = bw.Method(("fake", "method", "with uncertainty"))
uncertain_method.write(my_cf)
Definition of an simple activity
simple_LCI_db = bw.Database('simple LCI db')
simple_LCI_db.write(
{('simple LCI db', 'some_code'):
{'name': 'fake activity',
'unit': 'amount',
'exchanges':
[
{'input': ('simple LCI db', 'some_code'),
'amount': 1,
'type': 'production'},
{'input': some_exchange.key,
'amount': 1,
'type': 'biosphere'},
]
},
})
Monte Carlo code
mc = bw.MonteCarloLCA({('simple LCI db', 'some_code'):1}, ('fake', 'method', 'with uncertainty'))
next(mc)
Is there something wrong with the uncertainty definition?
Thank for your help!
You simply need to define your uncertainty dictionary slightly differently: in Brightway, the uncertainty type is written without the _, i.e.
my_cf = [(some_exchange.key,
{"amount": 10,
"uncertainty type": 4, #and not "uncertainty_type"
"minimum": 0,
"maximum": 20}
)]
You can see the schema for the uncertainty dictionary in the Brightway framework in the Brightway documentation
You wrote it like it is defined in the stats_arrays documentation. I do not know why they are different, i.e. why in one case we have uncertainty type and in the other uncertainty_type, but just remove your _ and your code will work.

Resources