Brightway2 - Get LCA scores of immediate exchanges - brightway

I'm having some problems regarding the post-processing analysis of my LCA results from brightway2. After running a LCA calculation, if, for example, I type top_activities() I get a list of a bunch of activities and their associated scores, however none of the activities/scores are the ones associated directly with my functional unit (they appear to be some exchanges of my exchanges...).
How can I get the LCA scores of the exchanges (both technosphere and biosphere) I defined when constructing my Functional Unit?
Thanks!

I've found the best way to get aggregated results for your foreground model in brightway is using the bw2analyzer.traverse_tagged_databases() function rather than top_activities(). Details in the docs are here.
It's designed to calculate upstream impacts of the elements of your foreground model and then aggregate the impacts based on a tag it finds in the activity. e.g. if you add 'tag':'use phase' or 'tag':'processing' to your activities you can aggregate impact results by life cycle stage.
BUT you can change the default label it looks for, so instead of tag you can tell it to look for name - that'll give you the aggregated upstream impact of each of the activities in your foreground model. It returns a dictionary with the names of your tags as keys, and impacts as values. It also returns a graph of your foreground system which you can use to create some cool tree/bullseye charts - see the docs for the format.
Here's the function you need:
results, graph = recurse_tagged_databases(functional_unit, method, label='name')
Here are a couple of examples of the kinds of visualisations you can make using the data recurse_tagged_databases gives you:
Waterfall chart example from the results dictionary
Bullseye chart example from the tagged graph

It is pretty easy to traverse the supply chain manually, and everyone wants to do this a slightly different way, so it isn't built in to Brightway yet. Here is a simple example:
from brightway2 import *
func_unit = Database("ecoinvent 3.4 cutoff").random()
lca = LCA({func_unit: 1}, methods.random())
lca.lci()
lca.lcia()
print(func_unit)
for exc in func_unit.technosphere():
lca.redo_lcia({exc.input: exc['amount']})
print(exc.input, exc['amount'], lca.score)

Related

How to deal with end of life scenarios on Brightway?

I am currently working on a project about life cycle models for vehicles on Brightway. The models I am using are inspired from models on the software Simapro. All the life cycle processes are created fine except for the end of life scenarios. On Simapro the end of life scenarios are described with percentages of recycled mass for each type of product (plastics, aluminium, glass, etc) but I can't find how to translate this into Brightway. Do you have ideas on how to deal with these end of life scenarios on Brightway ? Thank you for your answer.
Example of the definition of an end of life scenario on Simapro
There are many different ways to model End-of-Life, depending on what kind of abstraction you choose to map to the matrix math at the heart of Brightway. There is always some impedence between our intuitive understanding of physical systems, and the computational models we work with. Brightway doesn't have any built-in functionality to calculate fractions of material inputs, but you can do this manually by adding an appropriate EoL activity for each input to your vehicle. This can be in the vehicle activity itself, or as a separate activity. You could also write functions that would add these activities automatically, though my guess is that manual addition makes more sense, as you can check the reasonableness of the linked EoL activities more easily.
One thing to be aware of is that, depending on the background database you are using, the sign of the EoL activity might not be what you expect. Again, what we think of is not necessarily what fits into the model. For example, aluminium going to a recycling center is a physical output of an activity, and all outputs have positive signs (inputs have negative signs in the matrix, but Brightway sets this sign by the type of the exchange). However, ecoinvent models EoL treatment activities as negative inputs (which is identical to positive outputs, the negatives cancel). I would build a simple system to make sure you are getting results you expect before working on more complex systems.

Monte Carlo LCA on activities that have parameters with uncertainty from a SimaPro project returns constant value (no uncertainty)

I've imported a project from SimaPro in which almost every activity uses parameters that have uncertainty. When I run a Monte Carlo LCA on any of these in Brightway, the results are constant, as though the quantities have no uncertainty (the snippet shows 10 steps, but it's the same for 2000 steps).
sp = bw.SimaProCSVImporter(fp, name="All Param")
sp.apply_strategies()
sp.statistics() # returns 0 unlinked
sp.write_database(activate_parameters=True)
spdb = bw.Database("All Param")
imported_material = [act for act in spdb if 'Imported' in act['name']][0]
mciter=10
mc_imported = bw.MonteCarloLCA({imported_material:1},('IPCC 2013', 'climate change', 'GWP 100a'))
scores = [next(mc_imported) for _ in range(mciter)]
scores
[0.015027544172490276,
0.015027544172490276,
...
0.015027544172490276,
0.015027544172490276]
I'm at a loss, as everything loads without errors, and looking at the activities and exchanges show the expected formulas, parameters and uncertainties on parameters.
I suspect that the issue might be related to the distinction between active and passive parameters described in the documentation, but do not see how to make the designation that these parameters are (all) "active" beyond xxx.write_database(activate_parameters=True) as in the parameterized dataset example notebook. I also don't see how to list which parameters are active or passive, so the issue might be something else completely.
What do I need to do to get my parameterized activities to incorporate the uncertainty from the parameters in the MC LCA? Any help would be most appreciated!
For what it's worth, they do work in the SimaPro project whence they come - the uncertainty analysis uses the uncertainty on the parameters - so I don't think the issue is in the originating project.
Thank you for any guidance you can provide!
Parameterized inventories generally don't work in Monte Carlo, as the Monte Carlo class is focused on data point uncertainties described by PDFs. There is a separate project called presamples which allows for the use of parameterized inventories in Monte Carlo through some pre-calculations - however, it doesn't have great documentation yet. Look in the docs, and ParameterizedBrightwayModel.
Note: Check your parameter names and formulas from SimaPro, Brightway is stricter in what it allows (e.g. python is case sensitive and has more reserved words).

Spark Item Similarity Interpretation (Cross-Similarity and Similarity)

I've been using Spark Item Similarity through mahout by following the steps in this article:
https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
I was able to clean my data, setup a local-only spark/hadoop node and all that.
Now, my question relies more in the interpretation of the matrices. I've tried some Google queries with limited success.
I'm creating a multi-modal recommender - and one of my datasets is very similar to the Mahout example.
Example input:
Customer ActionName Product
11064612 view 241505
11086047 purchase 110915
11121878 view CERT_DL
11149030 purchase CERT_FS
11104130 view 111401
The output of mahout is 2 sets of matrices. A similarity matrix and a coocurrence matrix.
This is my similarity matrix (I assume mahout uses my "filter1" purchases)
**791207-WP** 791520-WP:11.350536461453885 791520:9.547158147208393 76130142:7.938639976084232 711215:7.0641921646893024 751309:6.805891904514283
So how would I interpret this? If someone purchased 791207-WP they could be interested in 791520-WP? (so I'd use the left part against purchases of a customer and rank products in the right part?).
The row for 791520-WP looks like this:
791520-WP 76151220:18.954662238247693 791604-WP:13.951210170984268
So, in theory, I'd recommend 76151220 to someone who bought 791520-WP, correct?
Part 2 of the question is interpreting the cross-similarity matrix. Remember my filter2 is "views".
How would I interpret this:
**790907** 76120956:14.2824428207241 791500-LXQ2:13.864741460885853 190907:10.735807818360627
I take this matrix as "someone who visited the 76120956 web page ended up purchasing 790907". So I should promote 790907 to customers who bought 76120956 and maybe even add a link between these 2 products on our site, for example.
Or is it "people who visited the webpage of 790907 ended up buying 76120956"?
My plan is not to use these as-is. I'll still use RowSimilarity and different sources to rank products - but I'm missing the basic interpretation of the outputs from mahout.
If you know of any documentation that clarifies this, that would be a great asset to have.
Thank you.
In both cases the matrix is telling you that the item-id key is similar to the listed items by the LLR value attached to each similar item. Similar in the sense that similar users purchased the items. In the second case it is saying that similar people viewed the items and this view also appears to have led of a purchase of the same item.
Cooccurrence works for purchases alone, cross-occurrence adds the check to make sure the view also correlated with a purchase. This allows you to use both for recommendations.
The output is meant to be used with a search engine generally and you would use a user's history of purchases and views as a 2 field query against the matrices, one in each field.
There are analogous methods to find item-based recommendations.
Better yet, use something like the Universal Recommender here: actionml.com/docs/ur with PredictionIO for an end-to-end system.

How to use secondary user actions with to improve recommendations with Spark ALS?

Is there a way to use secondary user actions derived from the user click stream to improve recommendations when using Spark Mllib ALS?
I have gone through the explicit and implicit feedback based example mentioned here : https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html that uses the same ratings RDD for the train() and trainImplicit() methods.
Does this mean I need to call trainImplicit() on the same model object with a RDD(user,item,action) for each secondary user action? Or train multiple models , retrieve recommendations based on each action and then combine them linearly?
For additional context, the crux of the question is if Spark ALS can model secondary actions like Mahout's spark item similarity job. Any pointers would help.
Disclaimer: I work with Mahout's Spark Item Similarity.
ALS does not work well for multiple actions in general. First an illustration. The way we consume multiple actions in ALS is to weight one above the other. For instance buy = 5, view = 3. ALS was designed in the days when ratings seemed important and predicting them was the question. We now know that ranking is more important. In any case ALS uses predicted ratings/weights to rank results. This means that a view is really telling ALS nothing since a rating of 3 means what? Like? Dislike? ALS tries to get around this by adding a regularization parameter and this will help in deciding if 3 is a like or not.
But the problem is more fundamental than that, it is one of user intent. When a user views a product (using the above ecom type example) how much "buy" intent is involved? From my own experience there may be none or there may be a lot. The product was new, or had a flashy image or other clickbait. Or I'm shopping and look at 10 things before buying. I once tested this with a large ecom dataset and found no combination of regularization parameter (used with ALS trainImplicit) and action weights that would beat the offline precision of "buy" events used alone.
So if you are using ALS, check your results before assuming that combining different events will help. Using two models with ALS doesn't solve the problem either because from buy events you are recommending that a person buy something, from view (or secondary dataset) you are recommending a person view something. The fundamental nature of intent is not solved. A linear combination of recs still mixes the intents and may very well lead to decreased quality.
What Mahout's Spark Item Similarity does is to correlate views with buys--actually it correlates a primary action, one where you are clear about user intent, with other actions or information about the user. It builds a correlation matrix that in effect scrubs the views of the ones that did not correlate to buys. We can then use the data. This is a very powerful idea because now almost any user attribute, or action (virtually the entire clickstream) may be used in making recs since the correlation is always tested. Often there is little correlation but that's ok, it's an optimization to remove from the calculation since the correlation matrix will add very little to the recs.
BTW if you find integration of Mahout's Spark Item Similarity daunting compared to using MLlib ALS, I'm about to donate an end-to-end implementation as a template for Prediction.io, all of which is Apache licensed open source.

Understanding Lucene Queries

I am interested in knowing a little more specifically about how Lucene queries are scored. In their documentation, they mention the VSM. I am familiar with VSM, but it seems inconsistent with the types of queries they allow.
I tried stepping through the source code for BooleanScorer2 and BooleanWeight, to no real avail.
My question is, can somebody step through the execution of a BooleanScorer to explain how it combines queries.
Also, is there a way to simple send out several terms and just get the raw tf.idf score for those terms, the way it is described in the documentation?
The place to start is http://lucene.apache.org/java/3_3_0/api/core/org/apache/lucene/search/Similarity.html
I think it clears up your inconsistency? Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM.
The next thing to look at is Searcher.explain, which can give you a string explaining how the score for a (query, document) pair is calculated.
Tracing thru the execution of BooleanScorer can be challenging I think, its probably easiest to understand BooleanScorer2 first, which uses subscorers like ConjunctionScorer/DisjunctionSumScorer, and to think of BooleanScorer as an optimization.
If this is confusing, then start even simpler at TermScorer. Personally I look at it "bottoms-up" anyway:
A Query creates a Weight valid across the whole index: this incorporates boost, idf, queryNorm, and even confusingly, boosts of any 'outer'/'parent' queries like booleanquery that are holding the term. this weight is computed a single time.
A Weight creates a Scorer (e.g. TermScorer) for each index segment, for a single term this scorer has everything it needs in the formula except for what is document-dependent: the within-document term-frequency (TF), which it must read from the postings, and the document's length normalization value (norm). So this is why termscorer scores a document as weight * sqrt(tf) * norm. in practice this is cached for tf values < 32 so that scoring most documents is a single multiply.
BooleanQuery really doesnt do "much" except its scorers are responsible for nextDoc()'ing and advance()'ing subscorers, and when the Boolean model is satisfied, then it combines the scores of the subscorers, applying the coordination factory (coord()) based on how many subscorers matched.
in general, its definitely difficult to trace through how lucene scores documents because in all released forms, the Scorers are responsible for 2 things: matching and calculating scores. In Lucene's trunk (http://svn.apache.org/repos/asf/lucene/dev/trunk/) these are now separated, in such a way that a Similarity is basically responsible for all aspects of scoring, and this is separate from matching. So the API there might be easier to understand, maybe harder, but at least you can refer to implementations of many other scoring models (BM25, language models, divergence from randomness, information-based models) if you get confused: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/java/org/apache/lucene/search/similarities/

Resources