Issue with Monte Carlo analysis with uncertainty on LCIA - uncertainty

I try to run a Monte Carlo analysis with uncertainty on characterization factor. The code is running well (no error) but the results for each iteration are always the same. Calculation works with just LCA simulation.
Here is the code:
Definition of a sample LCIA method
some_exchange = bw.Database('biosphere3').random()
my_cf = [(some_exchange.key,
{"amount": 10,
"uncertainty_type": 4,
"minimum": 0,
"maximum": 20}
)]
uncertain_method = bw.Method(("fake", "method", "with uncertainty"))
uncertain_method.write(my_cf)
Definition of an simple activity
simple_LCI_db = bw.Database('simple LCI db')
simple_LCI_db.write(
{('simple LCI db', 'some_code'):
{'name': 'fake activity',
'unit': 'amount',
'exchanges':
[
{'input': ('simple LCI db', 'some_code'),
'amount': 1,
'type': 'production'},
{'input': some_exchange.key,
'amount': 1,
'type': 'biosphere'},
]
},
})
Monte Carlo code
mc = bw.MonteCarloLCA({('simple LCI db', 'some_code'):1}, ('fake', 'method', 'with uncertainty'))
next(mc)
Is there something wrong with the uncertainty definition?
Thank for your help!

You simply need to define your uncertainty dictionary slightly differently: in Brightway, the uncertainty type is written without the _, i.e.
my_cf = [(some_exchange.key,
{"amount": 10,
"uncertainty type": 4, #and not "uncertainty_type"
"minimum": 0,
"maximum": 20}
)]
You can see the schema for the uncertainty dictionary in the Brightway framework in the Brightway documentation
You wrote it like it is defined in the stats_arrays documentation. I do not know why they are different, i.e. why in one case we have uncertainty type and in the other uncertainty_type, but just remove your _ and your code will work.

Related

tune hyperparameters of XGBRanker

I try to optimize my hyperparameters of my XGBoost Ranker model, but I can't
Here is what my table (df on code) looks like :
query
relevance
features
1
5
5.4.7....
1
3
6........
2
5
3........
2
3
8........
3
2
1........
Then I split my table on train test with on the test table only one query:
gss = GroupShuffleSplit(test_size=1, n_splits=1,).split(df, groups=df['query'])
X_train_inds, X_test_inds = next(gss)
train_data= df.iloc[X_train_inds]
X_train=train_data.drop(columns=["relevance"])
Y_train=train_data.relevance
test_data= df.iloc[X_test_inds]
X_test=test_data.drop(columns=["relevance"])
Y_test=test_data.relevance
and constitute groups which is the number of lines by query:
groups = train_data.groupby('query').size().to_frame('size')['size'].to_numpy()
And then I run my model and try to optimize the hyperparameters with a RandomizedSearchCV:
param_dist = {'n_estimators': randint(40, 1000),
'learning_rate': uniform(0.01, 0.59),
'subsample': uniform(0.3, 0.6),
'max_depth': [3, 4, 5, 6, 7, 8, 9],
'colsample_bytree': uniform(0.5, 0.4),
'min_child_weight': [0.05, 0.1, 0.02]
}
scoring = sklearn.metrics.make_scorer(sklearn.metrics.ndcg_score, k=10,
greater_is_better=True)
model = xgb.XGBRanker(
tree_method='hist',
booster='gbtree',
objective='rank:ndcg',)
clf = RandomizedSearchCV(model,
param_distributions=param_dist,
cv=5,
n_iter=5,
scoring=scoring,
error_score=0,
verbose=3,
n_jobs=-1)
clf.fit(X_train,Y_train, group=groups)
Then I have the following error message which it seems be related to my construction of groups but I don't see why (Knowing that without the randomsearch the model works) :
Check failed: group_ptr_.back() == num_row_ (11544 vs. 9235) : Invalid group structure. Number of rows obtained from groups doesn't equal to actual number of rows given by data.
Same problem as here:(Tuning XGBRanker produces error for groups)

HuggingFace-Transformers --- NER single sentence/sample prediction

I am trying to predict with the NER model, as in the tutorial from huggingface (it contains only the training+evaluation part).
I am following this exact tutorial here : https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb
The training works flawlessly, but the problems that I have begin when I try to predict on a simple sample.
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
loaded_model = AutoModel.from_pretrained('./my_model_own_custom_training.pth',
from_tf=False)
input_sentence = "John Nash is a great mathematician, he lives in France"
tokenized_input_sentence = tokenizer([input_sentence],
truncation=True,
is_split_into_words=False,
return_tensors='pt')
predictions = loaded_model(tokenized_input_sentence["input_ids"])[0]
Predictions is of shape (1,13,768)
How can I arrive at the final result of the form [JOHN <-> ‘B-PER’, … France <-> “B-LOC”], where B-PER and B-LOC are two ground truth labels, representing the tag for a person and location respectively?
The result of the prediction is:
torch.Size([1, 13, 768])
If I write:
print(predictions.argmax(axis=2))
tensor([613, 705, 244, 620, 206, 206, 206, 620, 620, 620, 477, 693, 308])
I get the tensor above.
However I would have expected to get the tensor representing the ground truth [0…8] labels from the ground truth annotations.
Summary when loading the model :
loading configuration file ./my_model_own_custom_training.pth/config.json
Model config DistilBertConfig {
“name_or_path": “distilbert-base-uncased”,
“activation”: “gelu”,
“architectures”: [
“DistilBertForTokenClassification”
],
“attention_dropout”: 0.1,
“dim”: 768,
“dropout”: 0.1,
“hidden_dim”: 3072,
“id2label”: {
“0”: “LABEL_0”,
“1”: “LABEL_1”,
“2”: “LABEL_2”,
“3”: “LABEL_3”,
“4”: “LABEL_4”,
“5”: “LABEL_5”,
“6”: “LABEL_6”,
“7”: “LABEL_7”,
“8”: “LABEL_8”
},
“initializer_range”: 0.02,
“label2id”: {
“LABEL_0”: 0,
“LABEL_1”: 1,
“LABEL_2”: 2,
“LABEL_3”: 3,
“LABEL_4”: 4,
“LABEL_5”: 5,
“LABEL_6”: 6,
“LABEL_7”: 7,
“LABEL_8”: 8
},
“max_position_embeddings”: 512,
“model_type”: “distilbert”,
“n_heads”: 12,
“n_layers”: 6,
“pad_token_id”: 0,
“qa_dropout”: 0.1,
“seq_classif_dropout”: 0.2,
“sinusoidal_pos_embds”: false,
"tie_weights”: true,
“transformers_version”: “4.8.1”,
“vocab_size”: 30522
}
The answer is a bit trickier than expected[Huge credits to Niels Rogge].
Firstly, loading models in huggingface-transformers can be done in (at least) two ways:
AutoModel.from_pretrained('./my_model_own_custom_training.pth', from_tf=False)
AutoModelForTokenClassification.from_pretrained('./my_model_own_custom_training.pth', from_tf=False)
It seems that, according to the task at hand, different AutoModels subclasses need to be used. In this scenario I posted, it is the AutoModelForTokenClassification() that has to be used.
After that, a solution to obtain the predictions would be to do the following:
# forward pass
outputs = model(**encoding)
logits = outputs.logits
predictions = logits.argmax(-1)

Hyperopt tuning parameters get stuck

I'm testing to tune parameters of SVM with hyperopt library.
Often, when i execute this code, the progress bar stop and the code get stuck.
I do not understand why.
Here is my code :
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
X_train = normalize(X_train)
def hyperopt_train_test(params):
if 'decision_function_shape' in params:
if params['decision_function_shape'] == "ovo":
params['break_ties'] = False
clf = svm.SVC(**params)
y_pred = clf.fit(X_train, y_train).predict(X_test)
return precision_recall_fscore_support(y_test, y_pred, average='macro')[0]
space4svm = {
'C': hp.uniform('C', 0, 20),
'kernel': hp.choice('kernel', ['linear', 'sigmoid', 'poly', 'rbf']),
'degree': hp.uniform('degree', 10, 30),
'gamma': hp.uniform('gamma', 10, 30),
'coef0': hp.uniform('coef0', 15, 30),
'shrinking': hp.choice('shrinking', [True, False]),
'probability': hp.choice('probability', [True, False]),
'tol': hp.uniform('tol', 0, 3),
'decision_function_shape': hp.choice('decision_function_shape', ['ovo', 'ovr']),
'break_ties': hp.choice('break_ties', [True, False])
}
def f(params):
print(params)
precision = hyperopt_train_test(params)
return {'loss': -precision, 'status': STATUS_OK}
trials = Trials()
best = fmin(f, space4svm, algo=tpe.suggest, max_evals=35, trials=trials)
print('best:')
print(best)
I would suggest restricting the space of your parameters and see if that works. Fix the probability parameter to False and see if the model trains. Also, gamma needs to be {‘scale’, ‘auto’} according to the documentation.
Also at every iteration print out your params to better understand which combination is causing the model to get stuck.

Optimization of predictions from sklearn model (e.g. RandomForestRegressor)

Does anyone used any optimization models on fitted sklearn models?
What I'd like to do is fit model based on train data and using this model try to find the best combination of parameters for which model would predict the biggest value.
Some example, simplified code:
import pandas as pd
df = pd.DataFrame({
'temperature': [10, 15, 30, 20, 25, 30],
'working_hours': [10, 12, 12, 10, 30, 15],
'sales': [4, 7, 6, 7.3, 10, 8]
})
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
X = df.drop(['sales'], axis=1)
y = df['sales']
model.fit(X, y);
Our baseline is a simple loop and predict all combination of variables:
results = pd.DataFrame(columns=['temperature', 'working_hours', 'sales_predicted'])
import numpy as np
for temp in np.arange(1,100.01,1):
for work_hours in np.arange(1,60.01,1):
results = pd.concat([
results,
pd.DataFrame({
'temperature': temp,
'working_hours': work_hours,
'sales_predicted': model.predict(np.array([temp, work_hours]).reshape(1,-1))
}
)
]
)
print(results.sort_values(by='sales_predicted', ascending=False))
Using that way it's difficult or impossible to:
* do it fast (brute method)
* implement constraint concerning two or more variables dependency
We tried PuLP library and PyOmo library, but both doesn't allow to put model.predict function as an objective function returning error:
TypeError: float() argument must be a string or a number, not 'LpVariable'
Do anyone have any idea how we can get rid off loop and use some other stuff?
When people talk about optimizing fitted sklearn models, they usually mean maximizing accuracy/performance metrics. So if you are trying to maximize your predicted value, you can definitely improve your code to achieve it more efficiently, like below.
You are collecting all the predictions in a big results dataframe, and then sorting it in ascending order. Instead, you can just search for an increase in your target variable (sales_predicted) on-the-fly, using a simple if logic. So just change your loop into this:
max_sales_predicted = 0
for temp in np.arange(1, 100.01, 1):
for work_hours in np.arange(1, 60.01, 1):
sales_predicted = model.predict(np.array([temp, work_hours]).reshape(1, -1))
if sales_predicted > max_sales_predicted:
max_sales_predicted = sales_predicted
desired_temp = temp
desired_work_hours = work_hours
So that you can only take into account any specification that produces a predictiong that exceeds your current target, and else, do nothing.
The result of my code is the same as yours, i.e. a max_sales_predicted value of 9.2. Also, desired_temp and desired_work_hours now give you the specification that produce that maxima. Hope this helps.

Lookup table not working in training data of Rasa NLU

I have examples for a particular intent also showing the entity, and I want the model to recognize other words which could be entities for that particular intent, but it fails to recognize it.
## intent: frequency
* what is the frequency of [region](field)?
* what's the frequency of[region](field)?
* frequency of [region](field)?
* [region](field)s frequency?
* [region](field) frequency?
* frequency [region](field)?
## lookup: field
* price
* phone type
* region
So when I enter the text "What is the frequency of region?" I get the output
{'intent': {'name': 'frequency', 'confidence': 0.9517087936401367},
'entities': [{'start': 17, 'end': 23, 'value': 'region',
'entity': 'field', 'confidence': 0.9427971487440825,
'extractor': 'CRFEntityExtractor'}], 'text': 'What is the frequency of region?'}
but when I enter the text "What is the frequency of price?" I get the output
{'intent': {'name': 'frequency', 'confidence': 0.9276150465011597},
'entities': [], 'text': 'What is the frequency of price?'}
According to RasaNLU documentation, in order for lookups to work, you need to include a few examples from the lookup table.
Also, you need to understand that "phone type" and "region" are different patterns because "phone type" has two words and "region" is a single word. Keeping this in mind I have extended your dataset as
## intent: frequency
* what is the frequency of [region](field)?
* what is the frequency of [city](field)?
* what is the frequency of [work](field)?
* what's the frequency of [phone type](field)?
* what is the frequency of [phone type](field)?
* frequency of [region](field)?
* frequency of [phone type](field)?
* [region](field)s frequency?
* [region](field) frequency?
* frequency [region](field)?
Now when I tried all the examples you mentioned they worked even though the "price" was not included in the dataset but the patters were all covered.
Enter a message: What is the frequency of price?
{
"intent": {
"name": "frequency",
"confidence": 0.966820478439331
},
"entities": [
{
"start": 25,
"end": 30,
"value": "price",
"entity": "field",
"confidence": 0.7227365687405007,
"extractor": "CRFEntityExtractor"
}
]
}
I recommend using https://github.com/rodrigopivi/Chatito for generating simple dataset it would make things easier for you and generate synonyms etc. automatically.
Also, just in case you don't know you can also use files to point to large lookups such as
## lookup:city
data/lookups/city_lookup.txt
use the following pipeline in the config.yml
pipeline:
name: WhitespaceTokenizer
name: RegexFeaturizer
name: CRFEntityExtractor
name: LexicalSyntacticFeaturizer
name: CountVectorsFeaturizer
name: CountVectorsFeaturizer
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
name: DIETClassifier
entity_recognition: False
epochs: 100
name: EntitySynonymMapper
name: ResponseSelector
epochs: 100

Resources