LCA in brightway- Is mass balance maintained during a monte-carlo simulation? - montecarlo

I am running a monte-carlo analysis for a single ecoinvent cut-off process (cement production) in Activity browser/Brightway. 1) Is there a way to perform a mass balance check for each run? Or 2) is Brightway also performing a mass balance check in the background?
If all the uncertainties, per run, are modelled independently based on their pre-defined probability distributions, I can imagine that matrix-based LCA formulation including uncertainty can't maintain a mass balance (without any additional steps)? Is this correct?
I believe this also links to a recent question "Overestimated Monte Carlo results in Brightway" (Overestimated Monte Carlo results in brightway) and the Chemical imbalance blog post (https://chris.mutel.org/chemical-imbalance.html).

An excellent question! No, mass (or other) balances are not preserved by any LCA software (at least that I know of), as the data formats do not give us information on correlated uncertainty or joint distributions.
Brightway 2.5 introduces ways to fix this problem through the use of uncertainty arrays instead of independent PDFs, and there is a publication about this under submission. You can get a sense of how this approach works in Aleksandra Kim's repo.

Related

How to understand the answer_start parameter of Squad dataset for training BERT-QA model + practical implications for creating custom dataset?

I am in the process of creating a custom dataset to benchmark the accuracy of the 'bert-large-uncased-whole-word-masking-finetuned-squad' model for my domain, to understand if I need to fine-tune further, etc.
When looking at the different Question Answering datasets on the Hugging Face site (squad, adversarial_qa, etc. ), I see that the answer is commonly formatted as a dictionary with keys: answer (the text) and answer_start (char index where answer starts).
I'm trying to understand:
The intuition behind how the model uses the answer_start when calculating the loss, accuracy, etc.
If I need to go through the process of adding this to my custom dataset (easier to run model evaluation code, etc?)
If so, is there a programmatic way to do this to avoid manual effort?
Any help or direction would be greatly appreciated!
Code example to show format:
import datasets
ds = datasets.load_dataset('squad')
train = ds['train']
print('Example: \n')
print(train['answers'][0])
Your question is a bit broad to give you a specific answer, but I will try my best to point you in some directions.
The intuition behind how the model uses the answer_start when
calculating the loss, accuracy, etc.
There are different types of QA tasks/datasets. The ones you mentioned (SQuAD and adversarial_qa) belong to the field of extractive question answering. There, a model must select a span from a given context that answers the given question. For example:
context = 'Second, Democrats have always elevated their minority floor leader to the speakership upon reclaiming majority status. Republicans have not always followed this leadership succession pattern. In 1919, for instance, Republicans bypassed James R. Mann, R-IL, who had been minority leader for eight years, and elected Frederick Gillett, R-MA, to be Speaker. Mann "had angered many Republicans by objecting to their private bills on the floor;" also he was a protégé of autocratic Speaker Joseph Cannon, R-IL (1903–1911), and many Members "suspected that he would try to re-centralize power in his hands if elected Speaker." More recently, although Robert H. Michel was the Minority Leader in 1994 when the Republicans regained control of the House in the 1994 midterm elections, he had already announced his retirement and had little or no involvement in the campaign, including the Contract with America which was unveiled six weeks before voting day.'
question='How did Republicans feel about Mann in 1919?'
answer='angered' #-> starting at character 365
A simple approach that is often used today, is a linear layer that predicts the answer start and answer end from the last hidden state of a transformer encoder (code example). The last hidden state holds one vector for each input token (token!= words) and the linear layer is trained to assign high probabilities to tokens that could potentially be the start and end of the answer span. To train a model with your data, the loss function needs to know which tokens should get a high probability (i.e. the answer and the start token).
If I need to go through the process of adding this to my custom
dataset (easier to run model evaluation code, etc?)
You should go through this process, otherwise, how should someone know where the answer starts in your context? They can of course interfere with it programmatically, but what if your answer string appears twice in the context? Providing an answer start position avoids confusion and allows your users to use it right away with one of the many extractive questions answering scripts that are already available out there.
If so, is there a programmatic way to do this to avoid manual effort?
You could simply loop through your dataset and use str.find:
context.find(answer)
Output:
365

Statistical method to compare urban vs rural matched siblings

I am writing a study protocol for my masters thesis. The study seeks to compare the rates of Non Communicable Diseases and risk factors and determine the effects of rural to urban migration. Sibling pairs will be identified from a rural area. One of the siblings should have participated in the rural NCD survey which is currently on going in the area. The other sibling should have left the area and reported moving to a city.Data will collected by completing a questionnaire on demographics, family history,medical history, diet,alcohol consumption, smoking ,physical activity.This will be done for both the rural and urban sibling, with data on the amount of time spent in urban areas fur
The outcomes which are binary (whether one has a condition or not) are : 1.diabetic, 2.hypertensive, 3.obese
What statistical method can I use to compare the outcomes (stated above) between the two groups, considering that the siblings were matched (one urban sibling for every rural sibling)?
What statistical methods can also be used to explore associations between amount spent in urban residence and the outcomes?
Given that your main aim is to compare quantities of two nominal distributions, a chi-square test seems to be the method of choice with regard to your first question. However, it should be mentioned that a chi-square test is somehow "the smallest" test for answering differences in samples. If you are studying medicine (or related) a chi-square test is fine because it is also frequently applied by researchers of this field. If you are studying psychology or sociology (or related) I'd advise to highlight limitations of the test in the discussions section since it mostly tests your distributions against randomly expected distributions.
Regarding your second question, a logistic regression would be applicable since it allows binomial distributed variables both for independent variables (predictors) and dependent variables. However, if you have other interval scaled variables (e.g. age, weight etc.) you could also use t-tests or ANOVAs to investigate differences between these variables with respect to the existence of specific diseases (i.e. is diabetic or not).
Overall, this matter strongly depends on what you mean by "association". Classically, "association" refers to correlations or linear regression (for which you need interval scaled variables on "both sides") but given your data structure, the aforementioned methods possess a better fit.
How you actually calculate these tests depends on the statistics software used.

Calculate expected color temperature of daylight

I have a location (latitude/longitude) and a timestamp (year/month/day/hour/minute).
Assuming clear skies, is there an algorithm to loosely estimate the color temperature of sunlight at that time and place?
If I know what the weather was at that time, is there a suggested way to modify the color temperature for the amount of cloud cover at that time?
I suggest taking a look at this paper which has nice practical implementation for CG applications:
A Practical Analytic Model for Daylight A. J. Preetham Peter Shirley Brian Smits
Abstract
Sunlight and skylight are rarely rendered correctly in computer
graphics. A major reason for this is high computational expense.
Another is that precise atmospheric data is rarely available. We
present an inexpensive analytic model that approximates full spectrum
daylight for various atmospheric conditions. These conditions are
parameterized using terms that users can either measure or estimate.
We also present an inexpensive analytic model that approximates the
effects of atmosphere (aerial perspective). These models are fielded
in a number of conditions and intermediate results verified against
standard literature from atmospheric science. Our goal is to achieve
as much accuracy as possible without sacrificing usability.
Both compressed postscript and pdf files of the paper are available.
Example code is available.
Color images from the paper are shown below.
Link only answers are discouraged but I can not post neither sufficient portion of the article nor any complete C++ code snippet here as both are way too big. Following the link you can find both right now.

How do I measure the distribution of an attribute of a given population?

I have a catalog of 900 applications.
I need to determine how their reliability is distributed as a whole. (i.e. is it normal).
I can measure the reliability of an individual application.
How can I determine the reliability of the group as a whole without measuring each one?
That's a pretty open-ended question! Overall, distribution fitting can be quite challenging and works best with large samples (100's or even 1000's). It's generally better to pick a modeling distribution based on known characteristics of the process you're attempting to model than to try purely empirical fitting.
If you're going to go empirical, for a start you could take a random sample, measure the reliability scores (whatever you're using for that) of your sample, sort them, and plot them vs normal quantiles. If they fall along a relatively straight line the normal distribution is a plausible model, and you can estimate sample mean and variance to parameterize it. You can apply the same idea of plotting vs quantiles from other proposed distributions to see if they are plausible as well.
Watch out for behavior in the tails, in particular. Pretty much by definition the tails occur rarely and may be under-represented in your sample. Like all things statistical, the larger the sample size you can draw on the better your results will be.
I'd also add that my prior belief would be that a normal distribution wouldn't be a great fit. Your reliability scores probably fall on a bounded range, tend to fall more towards one side or the other of that range. If they tend to the high range, I'd predict that they get lopped off at the end of the range and have a long tail to the low side, and vice versa if they tend to the low range.

Compare and Contrast Monte-Carlo Method and Evolutionary Algorithms

What's the relationship between the Monte-Carlo Method and Evolutionary Algorithms? On the face of it they seem to be unrelated simulation methods used to solve complex problems. Which kinds of problems is each best suited for? Can they solve the same set of problems? What is the relationship between the two (if there is one)?
"Monte Carlo" is, in my experience, a heavily overloaded term. People seem to use it for any technique that uses a random number generator (global optimization, scenario analysis (Google "Excel Monte Carlo simulation"), stochastic integration (the Pi calculation that everybody uses to demonstrate MC). I believe, because you mentioned evolutionary algorithms in your question, that you are talking about Monte Carlo techniques for mathematical optimization: You have a some sort of fitness function with several input parameters and you want to minimize (or maximize) that function.
If your function is well behaved (there is a single, global minimum that you will arrive at no matter which inputs you start with) then you are best off using a determinate minimization technique such as the conjugate gradient method. Many machine learning classification techniques involve finding parameters that minimize the least squares error for a hyperplane with respect to a training set. The function that is being minimized in this case is a smooth, well behaved, parabaloid in n-dimensional space. Calculate the gradient and roll downhill. Easy peasy.
If, however, your input parameters are discrete (or if your fitness function has discontinuties) then it is no longer possible to calculate gradients accurately. This can happen if your fitness function is calculated using tabular data for one or more variables (if variable X is less than 0.5 use this table else use that table). Alternatively, you may have a program that you got from NASA that is made up of 20 modules written by different teams that you run as a batch job. You supply it with input and it spits out a number (think black box). Depending on the input parameters that you start with you may end up in a false minimum. Global optimization techniques attempt to address these types of problems.
Evolutionary Algorithms form one class of global optimization techniques. Global optimization techniques typically involve some sort of "hill climbing" (accepting a configuration with a higher (worse) fitness function). This hill climbing typically involves some randomness/stochastic-ness/monte-carlo-ness. In general, these techniques are more likely to accept less optimal configurations early on and, as the optimization progresses, they are less likely to accept inferior configurations.
Evolutionary algorithms are loosely based on evolutionary analogies. Simulated annealing is based upon analogies to annealing in metals. Particle swarm techniques are also inspired by biological systems. In all cases you should compare results to a simple random (a.k.a. "monte carlo") sampling of configurations...this will often yield equivalent results.
My advice is to start off using a deterministic gradient-based technique since they generally require far fewer function evaluations than stochastic/monte-carlo techniques. When you hear hoof steps think horses not zebras. Run the optimization from several different starting points and, unless you are dealing with a particularly nasty problem, you should end up with roughly the same minimum. If not, then you might have zebras and should consider using a global optimization method.
well I think Monte Carlo methods is the general name for these methods which
use random numbers in order to solve optimization problems. In this ways,
even the evolutionary algorithms are a type of Monte Carlo methods if they
use random numbers (and in fact they do).
Other Monte Carlo methods are: metropolis, wang-landau, parallel tempering,etc
OTOH, Evolutionary methods use 'techniques' borrowed from nature such as
mutation, cross-over, etc.

Resources