what does the characterised inventory matrix of a multiLCA object represent in Brightway2? I would have expected to find several of these matrices in the object, representing the characterised inventory of the different activities and different impact assessment methods.
For a simple LCA object, the sum of all the elements of the characterised inventory matrix gives the total impact of that activity. But it seems not to be the case for MultiLCA objects (e.g.)
#impact assessment method
i2002=[('IMPACT 2002+ (Endpoint)', 'resources', 'total'),
('IMPACT 2002+ (Endpoint)', 'climate change', 'climate change'),
('IMPACT 2002+ (Endpoint)', 'human health', 'total'),
('IMPACT 2002+ (Endpoint)', 'ecosystem quality', 'total')
]
fu=[]
for j in range(1,11):
fu.append({bw.Database('ei_33c').random():1})
testsetup_i2002 ={'inv': fu, 'ia':i2002}
bw.calculation_setups['testsetup_i2002'] = testsetup_i2002
mlca_test=bw.MultiLCA('testsetup_i2002')
result=mlca_test.lca.characterized_inventory.sum()
the result is different from the scores or sum of scores obtained from
mlca_test.results()
You can see the MultiLCA source code, which is relatively simple. It overwrites the characterized_inventory matrix for each step of the calculation, and only stores the results in a Numpy array: self.results = np.zeros((len(self.func_units), len(self.methods))). To get what you want - separate characterized inventory matrices for each combination of functional unit and LCIA method - you would have to write your own subclass. Here is one example:
from bw2calc.multi_lca import *
class PersistentMultiLCA(MultiLCA):
def __init__(self, cs_name):
if not calculation_setups:
raise ImportError
assert cs_name in calculation_setups
try:
cs = calculation_setups[cs_name]
except KeyError:
raise ValueError(
"{} is not a known `calculation_setup`.".format(cs_name)
)
self.func_units = cs['inv']
self.methods = cs['ia']
self.lca = LCA(demand=self.all, method=self.methods[0])
self.lca.lci(factorize=True)
self.method_matrices = []
self.results = {}
for method in self.methods:
self.lca.switch_method(method)
self.method_matrices.append(self.lca.characterization_matrix)
for row, func_unit in enumerate(self.func_units):
self.lca.redo_lci(func_unit)
for col, cf_matrix in enumerate(self.method_matrices):
self.lca.characterization_matrix = cf_matrix
self.lca.lcia_calculation()
self.results[row, col] = self.lca.characterized_inventory.copy()
Related
A federal agency, Centers for Medicare and Medicaid Services (CMS), imposes regulations on nursing homes. However, nursing homes are inspected by state agencies for compliance with regulations, and fines for violations can vary widely between states.
Let's develop a very simple initial model to predict the amount of fines a nursing home might expect to pay based on its location. Fill in the class definition of the custom estimator, StateMeanEstimator, below.
i am getting the following (error Your solution did not match the expected type: 53 * number
Specifically, solution[52] did not match {'type': 'number'}:
None)
Here is my code
from sklearn.base import BaseEstimator, RegressorMixin, TransformerMixin
class GroupMeanEstimator(BaseEstimator, RegressorMixin):
def init(self, grouper):
self.grouper = grouper
self.group_averages = {}
def fit(self, X, y):
# Use self.group_averages to store the average penalty by group
Xy = X.join(y)
state_mean_series = Xy.groupby(self.grouper)[y.name].mean()
for row in pd.DataFrame(state_mean_series).itertuples():
self.group_averages[row[0]] = row[1]
return self
def predict(self, X):
# Return a list of predicted penalties based on group of samples in X
if not isinstance(X, pd.DataFrame):
X = pd.DataFrame(X)
dictionary = self.group_averages
group = self.grouper
list_of_predictions = []
for row in X.itertuples():
prediction = dictionary.get(row.STATE)
list_of_predictions.append(prediction)
return list_of_predictions
I've written a custom class to group elements of a dataset, fit each group, and then run predictions for each group based on the fitted model. I want to be able to return the coefficients of each fitting (presumably in a dictionary), so that I can refer back to them and plot the line of best fit for each.
Calling the standard .coef_ or .get_params methods do not work because the items these methods attempt to retrieve are groupby objects. Alternatively, I tried to introduce the following:
def get_coefs():
coefs_dict = {}
for name, values in dataframe.groupby(self.groupby_column):
coefs_dict[name] = self.drugs_dict[name].coefs_
return coefs_dict
But get the following:
<bound method GroupbyEstimator.get_coefs of GroupbyEstimator(groupby_column='ndc',
pipeline_factory=<function pipeline_factory at 0x0000018DAD207268>)>
Here's the class I've written:
from sklearn import base
import numpy as np
import pandas as pd
class GroupbyEstimator(base.BaseEstimator, base.RegressorMixin):
def __init__(self, groupby_column, pipeline_factory):
self.groupby_column = groupby_column
self.pipeline_factory = pipeline_factory
def fit(self, dataframe, label):
self.drugs_dict = {}
self.label = label
dataframe = pd.get_dummies(dataframe)
for name, values in dataframe.groupby(self.groupby_column):
y = values[label]
X = values.drop(columns = [label, self.groupby_column], axis = 1)
self.drugs_dict[name] = self.pipeline_factory().fit(X, y)
return self
def get_coefs():
self.coefs_dict = {}
self.coefs_dict[name] = self.drugs_dict[name].named_steps["lin_reg"].coef_
return self.coefs_dict
def predict(self, test_data):
price_pred_list = []
for idx, row in test_data.iterrows():
name = row[self.groupby_column]
regression_coefs = self.drugs_dict[name]
row = pd.DataFrame(row).T
X = row.drop(columns = [self.label, self.groupby_column], axis = 1).values.reshape(1, -1)
drug_price_pred = regression_coefs.predict(X)
price_pred_list.append([name, drug_price_pred])
return price_pred_list
Expected result is a dictionary of the format:
{drug_a: [coefficient_1, coefficient_2,...coefficient_n],
drug_b: [coefficient_1, coefficient_2,...coefficient_n],
drug_c: [coefficient_1, coefficient_2,...coefficient_n]}
The pipeline factory is like this. I'll be building this out with alternative regressors, pca, gridsearchcv, etc. at a later time (so long as I can get the parameters out of the groupby objects for the individual regressions.
def pipeline_factory():
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
return Pipeline([
('lin_reg', LinearRegression())
])
EDIT: Added the get_coefs method as suggested. Unfortunately, as displayed above, it is still returning the same error.
The problem is with self.drugs_dict which is a dictionary of Pipeline objects, so you can't use coef_ on them directly. The coef_ is anattribute associated with the estimator object which in your case is a LinearRegression object. So the correct way of accessing the coefficients will be self.drugs_dict[name].named_steps["lin_reg"].coef_ instead of self.drugs_dict[name].coefs_ in your get_coefs() method.
While #Parthasarathy Subburaj led me to the right answer, here's the completed code for anyone that may be looking for a similar solution:
from sklearn import base
import numpy as np
import pandas as pd
class GroupbyEstimator(base.BaseEstimator, base.RegressorMixin):
def __init__(self, groupby_column, pipeline_factory):
# column is the value to group by; estimator_factory can be called to produce estimators
self.groupby_column = groupby_column
self.pipeline_factory = pipeline_factory
def fit(self, dataframe, label):
# Create an estimator and fit it with the portion in each group (create and fit a model per city
self.drugs_dict = {}
self.label = label
self.coefs_dict = {}
dataframe = pd.get_dummies(dataframe) #onehot encoder had problems with the data, so I'm getting the dummies with pandas here
for name, values in dataframe.groupby(self.groupby_column):
y = values[label]
X = values.drop(columns = [label, self.groupby_column], axis = 1)
self.drugs_dict[name] = self.pipeline_factory().fit(X, y)
self.coefs_dict[name] = self.drugs_dict[name].named_steps["lin_reg"].coef_
return self
def get_coefs(self):
return self.coefs_dict
def predict(self, test_data):
price_pred_list = []
for idx, row in test_data.iterrows():
name = row[self.groupby_column] #get drug name from drug column
regression_coefs = self.drugs_dict[name] #get coefficients from fitting in drugs_dict
row = pd.DataFrame(row).T
X = row.drop(columns = [self.label, self.groupby_column], axis = 1).values.reshape(1, -1)
drug_price_pred = regression_coefs.predict(X) #Use regression coefficients from dictionary (key = drug name) to predict
price_pred_list.append([name, drug_price_pred])
return price_pred_list
The TL;DR of the comments is that the dictionary holding model names and coefficients needs to be created under the fit method using sklearn's .named_steps on the desired portion of the pipeline, and then returned in a separate method (in this case get_coefs).
I've generated random streets using Shapely's LineString function using the following code:
class StreetNetwork():
def __init__(self):
self.street_coords = []
self.coords = {}
def gen_street_coords(self, length, coordRange):
min_, max_ = coordRange
for i in range(length):
street = LineString(((randint(min_, max_), randint(min_, max_)),
(randint(min_, max_), randint(min_,max_))))
self.street_coords.append(street)
If I use:
street_network = StreetNetwork()
street_network.gen_street_coords(10, [-50, 50])
I get an image like so: Simple
I've been looking at the following question which seems similar. I now want to iterate through my list of street_coords, and split streets into 2 if they cross with another street but I'm finding it difficult to find the co-ordinates of the point of intersection. However, as I am unfamiliar with using Shapely, I am struggling to use the "intersects" function.
It is rather simple to check intersection of two LineString objects. To avoid getting empty geometries, I suggest to check for intersection first before computing it. Something like this:
from shapely.geometry import LineString, Point
def get_intersections(lines):
point_intersections = []
line_intersections = [] #if the lines are equal the intersections is the complete line!
lines_len = len(lines)
for i in range(lines_len):
for j in range(i+1, lines_len): #to avoid computing twice the same intersection we do some index handling
l1, l2 = lines[i], lines[j]
if l1.intersects(l2):
intersection = l1.intersection(l2)
if isinstance(intersection, LineString):
line_intersections.append(intersection)
elif isinstance(intersection, Point)
point_intersections.append(intersection)
else:
raise Exception('What happened?')
return point_intersections, line_intersections
With the example:
l1 = LineString([(0,0), (1,1)])
l2 = LineString([(0,1), (1,0)])
l3 = LineString([(5,5), (6,6)])
l4 = LineString([(5,5), (6,6)])
my_lines = [l1, l2, l3, l4]
print get_intersections(my_lines)
I got:
[<shapely.geometry.point.Point object at 0x7f24f00a4710>,
<shapely.geometry.linestring.LineString object at 0x7f24f00a4750>]
I have a review dataset and I want to process it using NLP techniques. I did all the preprocessing stages (remove stop words, stemming, etc.). My problem is that there are some words, which are connected to each other and my function doesn't understand those. Here is an example:
Great services. I had a nicemeal and I love it a lot.
How can I correct it from nicemeal to nice meal?
Peter Norvig has a nice solution to the word segmentation problem that you are encountering. Long story short, he uses a large dataset of word (and bigram) frequencies and some dynamic programming to split long strings of connected words into their most likely segmentation.
You download the zip file with the source code and the word frequencies and adapt it to your use case. Here is the relevant bit, for completeness.
def memo(f):
"Memoize function f."
table = {}
def fmemo(*args):
if args not in table:
table[args] = f(*args)
return table[args]
fmemo.memo = table
return fmemo
#memo
def segment(text):
"Return a list of words that is the best segmentation of text."
if not text: return []
candidates = ([first]+segment(rem) for first,rem in splits(text))
return max(candidates, key=Pwords)
def splits(text, L=20):
"Return a list of all possible (first, rem) pairs, len(first)<=L."
return [(text[:i+1], text[i+1:])
for i in range(min(len(text), L))]
def Pwords(words):
"The Naive Bayes probability of a sequence of words."
return product(Pw(w) for w in words)
#### Support functions (p. 224)
def product(nums):
"Return the product of a sequence of numbers."
return reduce(operator.mul, nums, 1)
class Pdist(dict):
"A probability distribution estimated from counts in datafile."
def __init__(self, data=[], N=None, missingfn=None):
for key,count in data:
self[key] = self.get(key, 0) + int(count)
self.N = float(N or sum(self.itervalues()))
self.missingfn = missingfn or (lambda k, N: 1./N)
def __call__(self, key):
if key in self: return self[key]/self.N
else: return self.missingfn(key, self.N)
def datafile(name, sep='\t'):
"Read key,value pairs from file."
for line in file(name):
yield line.split(sep)
def avoid_long_words(key, N):
"Estimate the probability of an unknown word."
return 10./(N * 10**len(key))
N = 1024908267229 ## Number of tokens
Pw = Pdist(datafile('count_1w.txt'), N, avoid_long_words)
You can also use the segment2 method as it uses bigrams and is much more accurate.
I have 2 pandas data Series that I know are the same length. Each Series contains sets() in each element. I want to figure out a computationally efficient way to get the element wise union of these two Series' sets. I've created a simplified version of the code with fake and short Series to play with below. This implementation is a VERY inefficient way of doing this. There has GOT to be a faster way to do this. My real Series are much longer and I have to do this operation hundreds of thousands of times.
import pandas as pd
set_series_1 = pd.Series([{1,2,3}, {'a','b'}, {2.3, 5.4}])
set_series_2 = pd.Series([{2,4,7}, {'a','f','g'}, {0.0, 15.6}])
n = set_series_1.shape[0]
for i in range(0,n):
set_series_1[i] = set_series_1[i].union(set_series_2[i])
print set_series_1
>>> set_series_1
0 set([1, 2, 3, 4, 7])
1 set([a, b, g, f])
2 set([0.0, 2.3, 15.6, 5.4])
dtype: object
I've tried combining the Series into a data frame and using the apply function, but I get an error saying that sets are not supported as dataframe elements.
pir4
After testing several options, I finally came up with a good one... pir4 below.
Testing
def jed1(s1, s2):
s = s1.copy()
n = s1.shape[0]
for i in range(n):
s[i] = s2[i].union(s1[i])
return s
def pir1(s1, s2):
return pd.Series([item.union(s2[i]) for i, item in enumerate(s1.values)], s1.index)
def pir2(s1, s2):
return pd.Series([item.union(s2[i]) for i, item in s1.iteritems()], s1.index)
def pir3(s1, s2):
return s1.apply(list).add(s2.apply(list)).apply(set)
def pir4(s1, s2):
return pd.Series([set.union(*z) for z in zip(s1, s2)])