efficient way of calculating Monte Carlo results for different impact assessment methods in Brightway - brightway

I am trying to do a comparative monte carlo calculation with brightway2 using different impact assessment methods. I thought about using the switch_method method to be more efficient, since the technosphere matrix is the same for a given iteration. However, I am getting an assertion error. A code to reproduce it could be something like this
import brighway as bw
bw.projects.set_current('ei35') # project with ecoinvent 3.5
db = bw.Database("ei_35cutoff")
# select two different transport activities to compare
activity_name = 'transport, freight, lorry >32 metric ton, EURO4'
for activity in bw.Database("ei_35cutoff"):
if activity['name'] == activity_name:
truckE4 = bw.Database("ei_35cutoff").get(activity['code'])
print(truckE4['name'])
break
activity_name = 'transport, freight, lorry >32 metric ton, EURO6'
for activity in bw.Database("ei_35cutoff"):
if activity['name'] == activity_name:
truckE6 = bw.Database("ei_35cutoff").get(activity['code'])
print(truckE6['name'])
break
demands = [{truckE4: 1}, {truckE6: 1}]
# impact assessment method:
recipe_midpoint=[method for method in bw.methods.keys()
if method[0]=="ReCiPe Midpoint (H)"]
mc_mm = bw.MonteCarloLCA(demands[0], recipe_midpoint[0])
next(mc_mm)
If I try switch method I get the assertion error.
mc_mm.switch_method(recipe_midpoint[1])
assert mc_mm.method==recipe_midpoint[1]
mc_mm.redo_lcia()
next(mc_mm)
Am I doing something wrong here?

I usually store characterization factor matrices in a temporary dict and multiply these cfs with the LCI resulting from MonteCarloLCA directly.
import brightway2 as bw
import numpy as np
# Generate objects for analysis
bw.projects.set_current("my_mcs")
my_db = bw.Database('db')
my_act = my_db.random()
my_demand = {my_act:1}
my_methods = [bw.methods.random() for _ in range(2)]
I wrote this simple function to get characterization factor matrices for the product system I will generate in the MonteCarloLCA. It uses a temporara "sacrificial LCA" object that will have the same A and B matrices as the MonteCarloLCA.
This may seem like a waste of time, but it is only done once, and will make MonteCarlo quicker and simpler.
def get_C_matrices(demand, list_of_methods):
""" Return a dict with {method tuple:cf_matrix} for a list of methods
Uses a "sacrificial LCA" with exactly the same demand as will be used
in the MonteCarloLCA
"""
C_matrices = {}
sacrificial_LCA = bw.LCA(demand)
sacrificial_LCA.lci()
for method in list_of_methods:
sacrificial_LCA.switch_method(method)
C_matrices[method] = sacrificial_LCA.characterization_matrix
return C_matrices
Then:
# Create array that will store mc results.
# Shape is (number of methods, number of iteration)
my_iterations = 10
mc_scores = np.empty(shape=[len(my_methods), my_iterations])
# Instantiate MonteCarloLCA object
my_mc = bw.MonteCarloLCA(my_demand)
# Get characterization factor matrices
my_C_matrices = get_C_matrices(my_demand, my_methods)
# Generate results
for iteration in range(my_iterations):
lci = next(my_mc)
for i, m in enumerate(my_methods):
mc_scores[i, iteration] = (my_C_matrices[m]*my_mc.inventory).sum()
All your results are in mc_scores. Each row corresponds to a method, each column to an MC iteration.

Not very elegant, but try this:
iterations = 10
simulations = []
for _ in range(iterations):
mc_mm = MonteCarloLCA(demands[0], recipe_midpoint[0])
next(mc_mm)
mcresults = []
for i in demands:
print(i)
for m in recipe_midpoint[0:3]:
mc_mm.switch_method(m)
print(mc_mm.method)
mc_mm.redo_lcia(i)
print(mc_mm.score)
mcresults.append(mc_mm.score)
simulations.append(mcresults)
CC_truckE4 = [i[1] for i in simulations] # Climate Change, truck E4
CC_truckE6 = [i[1+3] for i in simulations] # Climate Change, truck E6
from matplotlib import pyplot as plt
plt.plot(CC_truckE4 , CC_truckE6, 'o')
If you then make a test and do twice the simulation for the same demand vector, by setting demands = [{truckE4: 1}, {truckE4: 1}] and plot the result you should get a straight line. This means that you are doing dependent sampling and re-using the same tech matrix for each demand vector and for each LCIA. I am not 100% sure of this but I hope it answers your question.

Related

Pyspark Kernel Density Estimation over multiple groups in parallel

Given a dataset consisting of money transactions, I am trying to use kernel density estimation to form clusters of transactions by their transaction amount. To do this, I identify the local minima of the density and use these as boundaries for the different clusters. I am able to do this on the whole dataset.
However, now I want to again use KDE, but use it on groups of data. That is, I want to estimate separate kernel densities for each group of transactions. The transactions are grouped on the basis of the counter party bank account from which they are sent. Currently, I use a naïve approach where I just loop over all counter parties. However, this is very inefficient, and as I am using spark I would like to be able to do this in parallel. I am not sure how to do this, as I am quite new to pyspark.
Any suggestions on how to do this?
Code that executes KDE over all data
from pyspark.mllib.stat import KernelDensity
from scipy.signal import argrelextrema
from matplotlib.pyplot import plot
from bisect import bisect
dat_rdd = sdf_pos.select("amount").rdd
dat_rdd_amounts = dat_rdd.map(lambda x: float(x[0]))
kd = KernelDensity()
kd.setBandwidth(10.0)
kd.setSample(dat_rdd_amounts)
s = np.linspace(0, 3000, num=50)
e = kd.estimate(s)
mi = argrelextrema(e, np.less)[0]
print("Minima:", s[mi])
minima_array = f.array([f.lit(i) for i in s[mi]])
user_func = f.udf(bisect)
sdf_pos = sdf_pos.withColumn("amount_group",
user_func(minima_array, f.col("amount")).cast('integer'))
Code that executes KDE separately for each group
counter_parties = sdf_pos.select("CP").distinct().collect()
sdf_pos = sdf_pos.withColumn("minima_array", f.array(f.lit(-1)))
dat_rdd = sdf_pos.select(["amount", "CP"]).rdd
for cp in counter_parties:
dat_rdd_amounts = dat_rdd.filter(lambda y: y[1] == cp[0]).map(lambda x: float(x[0]))
kd = KernelDensity()
kd.setBandwidth(10.0)
kd.setSample(dat_rdd_amounts)
s = np.linspace(0, 3000, num=50)
e = kd.estimate(s)
mi = argrelextrema(e, np.less)[0]
minima_array = f.array([f.lit(i) for i in s[mi]])
sdf_pos = sdf_pos.withColumn("minima_array",
f.when(f.col("CP") == cp[0], minima_array).otherwise(f.col("minima_array")))
user_func = f.udf(bisect)
sdf_pos = sdf_pos.withColumn("amount_group", user_func(f.col("minima_array"), f.col("amount")))

Generate High, Medium, Low categories from a skewed distribution

I have been working on a Churn Prediction use case in Python using XGBoost. The data trained on various parameters like Age, Tenure, Last 6 months income etc gives us the prediction if an employee is likely to leave based on its employee ID.
Additionally, if the user wants to the see why this ML system categorised the employee as such, the user can see the features that contributed to this, which are extracted form the model via eli5 library.
So to make this more explainable to the users, we had created some ranges for each feature:
Tenure (in days)
[0-100] = High Risk
[101-300] = Medium Risk
[301-800] = Low Risk
To define these ranges we've analysed the distributions of each feature and manually defined the ranges for our use in the system. We saw the impact of each feature on the target variable IsTerminated in training data. Following is an example of Tenure distribution.
Here the green bar represents the employees who are terminated or left and pink represents those who didn't.
So the question is that, as time passes and new data would be added to the model the such features' risk ranges would change. In this case of Tenure, if an employee has tenure of 780 days, after a month his tenure feature would show 810. Obviously, we keep the upper end on "Low Risk" as open ended. But real problem is, how can we define the internal boundaries / ranges programtically ?
EDIT: Thanks for the clarification. I have changed the answer.
It is important to realize that you are trying to project a selection in multi-dimensional space into a 1D space. Not in every case you will be able to see a clear separation like the one you got. There are also various possibilities to do that, here I made a simple example that could help your client to interpret the model, but does not represent the full complexity of the model, of course.
You did not provide any sample data, so I will generate some from the breast cancer dataset.
First let's import what we need:
from sklearn import datasets
from xgboost import XGBClassifier
import pandas as pd
import numpy as np
And now import the dataset and train a very simple XGBoost Model
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target
xgb_model = XGBClassifier(n_estimators=5,
objective="binary:logistic",
random_state=42)
xgb_model.fit(X, y)
y_prob = pd.DataFrame(xgb_model.predict_proba(X))[0]
There are multiple ways to solve this.
One approach is to bin in the probability given by the model. So you will decide which probabilities you consider to be "High Risk", "Medium Risk" and "Low Risk" and the intervals on data can be classified. In this example I considered low to be 0 <= p <= 0.5, medium for 0.5 < p <= 0.8 and high for 0.8 < p <= 1.
First you have to calculate the probability for each prediction. I would suggest to maybe use the test set for that, to avoid bias from a possible model overfitting.
y_prob = pd.DataFrame(xgb_model.predict_proba(X))[0]
df = pd.DataFrame(X, columns=cancer.feature_names)
# Stores the probability of a malignant cancer
df['probability'] = y_prob
Then you have to bin your data and calculate average probabilities for each of those bins. I would suggest to bin your data using np.histogram_bin_edges automatic calculation:
def calculate_mean_prob(feat):
"""Calculates mean probability for a feature value, binning it."""
# Bins from the automatic rules from numpy, check docs for details
bins = np.histogram_bin_edges(df[feat], bins='auto')
binned_values = pd.cut(df[feat], bins)
return df['probability'].groupby(binned_values).mean()
Now you can classify each bin following what you would consider to be a low/medium/high probability:
def classify_probability(prob, medium=0.5, high=0.8, fillna_method= 'ffill'):
"""Classify the output of each bin into a risk group,
according to the probability.
Following the follow rules:
0 <= p <= medium: Low risk
medium < p <= high: Medium risk
high < p <= 1: High Risk
If a bin has no entries, it will be filled using fillna with the method
specified in fillna_method
"""
risk = pd.cut(prob, [0., medium, high, 1.0], include_lowest=True,
labels=['Low Risk', 'Medium Risk', 'High Risk'])
risk.fillna(method=fillna_method, inplace=True)
return risk
This will return you the risk for each bin that you divided your data. Since you will probably have multiple bins that have consecutive values, you might want to merge the consecutive pd.Interval bins. The code for that is shown below:
def sum_interval(i1, i2):
if i2 is None:
return None
if i1.right == i2.left:
return pd.Interval(i1.left, i2.right)
return None
def sum_intervals(args):
"""Given a list of pd.Intervals,
returns a list summing consecutive intervals."""
result = list()
current_interval = args[0]
for next_interval in list(args[1:]) + [None]:
# Try to sum the current interval and nex interval
# The None in necessary for the last interval
sum_int = sum_interval(current_interval, next_interval)
if sum_int is not None:
# Update the current_interval in case if it is
# possible to sum
current_interval = sum_int
else:
# Otherwise tries to start a new interval
result.append(current_interval)
current_interval = next_interval
if len(result) == 1:
return result[0]
return result
def combine_bins(df):
# Group them by label
grouped = df.groupby(df).apply(lambda x: sorted(list(x.index)))
# Sum each category in intervals, if consecutive
merged_intervals = grouped.apply(sum_intervals)
return merged_intervals
Now you can combine all the functions to calculate the bins for each feature:
def generate_risk_class(feature, medium=0.5, high=0.8):
mean_prob = calculate_mean_prob(feature)
classification = classify_probability(mean_prob, medium=medium, high=high)
merged_bins = combine_bins(classification)
return merged_bins
For example, generate_risk_class('worst radius') results in:
Low Risk (7.93, 17.3]
Medium Risk (17.3, 18.639]
High Risk (18.639, 36.04]
But in case you get features which are not so good discriminators (or that do not separate the high/low risk linearly), you will have more complicated regions. For example generate_risk_class('mean symmetry') results in:
Low Risk [(0.114, 0.209], (0.241, 0.249], (0.272, 0.288]]
Medium Risk [(0.209, 0.225], (0.233, 0.241], (0.249, 0.264]]
High Risk [(0.225, 0.233], (0.264, 0.272], (0.288, 0.304]]

Problems regarding Pyomo provided math functions

I am trying to solve a maximization problem using Pyomo which has a recursive relationship. I am trying to maximize the revenue from a battery and it involves updating the state of charge of the battery every hour (which is the recursive relationship here). I am using the following code:
import pyomo
import numpy as np
from pyomo.environ import *
import pandas as pd
model = ConcreteModel()
N = 24 #number of hours
lmpdata = np.random.randint(1,10,24) #LMP Data (to be imported from MISO/PJM)
R = 0 #discount
eta_s = 0.99 #self-discharge efficiency
eta_c = 0.95 #round-trip efficiency
gammas_min = 0.1 #fraction of energy capacity to reserve for discharging
gammas_max = 0.05 #fraction of energy capacity to reserve for charging
S_bar = 50 #energy capacity
Q_bar = 50 #energy charge/discharge rating
model.qd = Var(range(N), within = NonNegativeReals) #variables for energy sold at time t
model.qr = Var(range(N), within = NonNegativeReals) #variables for energy purchased at time t
model.obj = Objective(expr = sum((model.qd[i]-model.qr[i])*lmpdata[i]*np.exp(-R*(i+1)) for i in range(N)), sense = maximize) #objective function
model.SOC = np.zeros(N) #state of charge (s(t) in Sandia's Model)
model.SOC[0] = 25 #SOC at hour 0
#recursion relation describing the SOC
def con_rule1(model,i):
model.SOC[i] = eta_s*model.SOC[i-1] + eta_c*model.qr[i-1] - model.qd[i-1]
return (eta_s*model.SOC[i-1] + eta_c*model.qr[i-1] - model.qd[i-1]== model.SOC[i])
#def con_rule1(model,i):
model.con1 = Constraint(range(1,N), rule = con_rule1)
#model.con2 = Constraint(expr = eta_s*SOC[N-1] + eta_c*model.qr[N-1] - model.qd[N-1] == SOC[0]) #SOC relation for the last hour
#SOC boundaries
def con_rule2(model,i):
return (gammas_min*S_bar <= eta_s*model.SOC[i] + eta_c*model.qr[i] - model.qd[i] <= (1-gammas_max)*S_bar)
model.con3 = Constraint(range(N), rule = con_rule2)
#limits the total energy charged over each time step to the energy
#charge limit (derived from the power limit)
#It restricts the throughput based on the power rating
def con_rule3(model,i):
return (0 <= model.qr[i]+model.qd[i] <= Q_bar)
model.con4 = Constraint(range(N),rule = con_rule3)
def pyomo_postprocess(options=None, instance=None, results=None):
model.qd.display()
model.qr.display()
model.pprint()
However, when I try to run the code, I am getting the following error:
Implicit conversion of Pyomo NumericValue type `<class 'pyomo.core.kernel.expr_coopr3._SumExpression'>' to a float is
disabled. This error is often the result of using Pyomo components as
arguments to one of the Python built-in math module functions when
defining expressions. Avoid this error by using Pyomo-provided math
functions.
I could not find any reference to Pyomo's math function in its documentation. It would be great if anyone could help me solve this problem!
Pyomo defines its own set of math module functions for operations like exp, log, sin, etc. If you want to use any of these functions in your Pyomo expressions you should make sure they are the ones provided by Pyomo and not from some other Python package. I think the issue with your model is that you are using np.exp in your Objective function. The Pyomo math functions are automatically imported when you import pyomo.environ so you should be able to replace np.exp with exp to get the Pyomo-defined function.

Doc2Vec.infer_vector keeps giving different result everytime on a particular trained model

I am trying to follow the official Doc2Vec Gensim tutorial mentioned here - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
I modified the code in line 10 to determine best matching document for the given query and everytime I run, I get a completely different resultset. My new code iin line 10 of the notebook is:
inferred_vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
rank = [docid for docid, sim in sims]
print(rank)
Everytime I run the piece of code, I get different set of documents that are matching with this query: "only you can prevent forest fires". The difference is stark and just does not seem to match.
Is Doc2Vec not a suitable match for querying and information extraction? Or are there bugs?
Look into the code, in infer_vector you are using parts of the algorithm that is non-deterministic. Initialization of word vector is deterministic - see the code of seeded_vector, but when we look further, i.e., random sampling of words, negative sampling (updating only sample of word vector per iteration) could cause non-deterministic output (thanks #gojomo).
def seeded_vector(self, seed_string):
"""Create one 'random' vector (but deterministic by seed_string)"""
# Note: built-in hash() may vary by Python version or even (in Py3.x) per launch
once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
return (once.rand(self.vector_size) - 0.5) / self.vector_size
Set negative=0 to avoid randomization:
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [list('asdf'), list('asfasf')]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(documents)]
model = Doc2Vec(documents, vector_size=20, window=5, min_count=1, negative=0, workers=6, epochs=10)
a = list('test sample')
b = list('testtesttest')
for s in (a, b):
v1 = model.infer_vector(s)
for i in range(100):
v2 = model.infer_vector(s)
assert np.all(v1 == v2), "Failed on %s" % (''.join(s))

How to merger NaiveBayesClassifier object in NLTK

I am working on a project using the NLTK toolkit. With the hardware I have, I am able to run the classifier object on a small data set. So, I divided the data into smaller chunks and running the classifier object in them while storing all these individual object in a pickle file.
Now for testing I need to have the whole object as one to get better result. So my question is how can I combine these objects into one.
objs = []
while True:
try:
f = open(picklename,"rb")
objs.extend(pickle.load(f))
f.close()
except EOFError:
break
Doing this does not work. And it gives the error TypeError: 'NaiveBayesClassifier' object is not iterable.
NaiveBayesClassifier code :
classifier = nltk.NaiveBayesClassifier.train(training_set)
I am not sure about the exact format of your data, but you can not simply merge different classifiers. The Naive Bayes classifier stores a probability distribution based on the data it was trained on, and you can not merge probability distributions without access to the original data.
If you look at the source code here: http://www.nltk.org/_modules/nltk/classify/naivebayes.html
an instance of the classifier stores:
self._label_probdist = label_probdist
self._feature_probdist = feature_probdist
these are calculated in the train method using relative frequency counts. (e.g P(L_1) = (# of L1 in training set) / (# labels in training set). To combine the two, you would want to get (# of L1 in Train 1 + Train 2)/(# of labels in T1 + T2).
However, the naive bayes procedure isn't too hard to implement from scratch, especially if you follow the 'train' source code in the link above. Here is an outline, using the NaiveBayes source code
Store 'FreqDist' objects for each subset of the data for the labels and features.
label_freqdist = FreqDist()
feature_freqdist = defaultdict(FreqDist)
feature_values = defaultdict(set)
fnames = set()
# Count up how many times each feature value occurred, given
# the label and featurename.
for featureset, label in labeled_featuresets:
label_freqdist[label] += 1
for fname, fval in featureset.items():
# Increment freq(fval|label, fname)
feature_freqdist[label, fname][fval] += 1
# Record that fname can take the value fval.
feature_values[fname].add(fval)
# Keep a list of all feature names.
fnames.add(fname)
# If a feature didn't have a value given for an instance, then
# we assume that it gets the implicit value 'None.' This loop
# counts up the number of 'missing' feature values for each
# (label,fname) pair, and increments the count of the fval
# 'None' by that amount.
for label in label_freqdist:
num_samples = label_freqdist[label]
for fname in fnames:
count = feature_freqdist[label, fname].N()
# Only add a None key when necessary, i.e. if there are
# any samples with feature 'fname' missing.
if num_samples - count > 0:
feature_freqdist[label, fname][None] += num_samples - count
feature_values[fname].add(None)
# Use pickle to store label_freqdist, feature_freqdist,feature_values
Combine those using their built-in 'add' method. This will allow you to get the relative frequency across all the data.
all_label_freqdist = FreqDist()
all_feature_freqdist = defaultdict(FreqDist)
all_feature_values = defaultdict(set)
for file in train_labels:
f = open(file,"rb")
all_label_freqdist += pickle.load(f)
f.close()
# Combine the default dicts for features similarly
Use the 'estimator' to create a probability distribution.
estimator = ELEProbDist()
label_probdist = estimator(all_label_freqdist)
# Create the P(fval|label, fname) distribution
feature_probdist = {}
for ((label, fname), freqdist) in all_feature_freqdist.items():
probdist = estimator(freqdist, bins=len(all_feature_values[fname]))
feature_probdist[label, fname] = probdist
classifier = NaiveBayesClassifier(label_probdist, feature_probdist)
The classifier will not combine the counts across all the data and produce what you need.

Resources