Connecting exchange names and codes to LCA inventory results - brightway

I'm getting into Brightway2 for some energy system modeling and I'm still getting used to the all of the concepts.
I've created a small custom demo database, and run lca.lci() and lca.lcia(). lca.inventory and lca.characterized_inventory both return sparse matrices of the results. My question, which may be very simple, is how can you connect the values in the matrix to the exchange names and keys. I.e., if I wanted to print the results to a file, how would I match the exchanges to the inventory values?
Thanks.

To really understand what is going on, it is useful to understand the difference between "intermediate" data (stored as structured text files) and "processed" data (stored as numpy structured arrays). These concepts are described both here and here.
However, to answer your question directly: what each row and column stand for in the different matrices and arrays (e.g. lca.inventory matrix, lca.supply_array, lca.characterized_inventory) are contained in a set of dictionaries that are associated with your LCA object. These are:
activity_dict: Columns in the technosphere matrix
product_dict : Rows in the technosphere matrix
biosphere_dict: Rows in the biosphere matrix
For example, lca.product_dict yields, in the case of an LCA I just did:
{('ei32_CU_U', '671c1ae85db847083176b9492f000a9d'): 8397,
('ei32_CU_U', '53398faeaf96420408204e309262b8c5'): 536,
('ei32_CU_U', 'fb8599da19dabad6929af8c3a3c3bad6'): 7774,
('ei32_CU_U', '28b3475e12e4ed0ec511cbef4dc97412'): 3051, ...}
with the key in the dictionary being the actual product in my inventory database and the value is the row in the demand_array or the supply_array.
More useful may be the reverse of these dictionaries. Let's say you want to know what a value in e.g. your supply_array refers to, you can create a reverse dictionary using a dict comprehension :
inv_product_dict = {v: k for k, v in lca.product_dict.items()}
and then simply use it directly to obtain the information you are after. Say you want to know what is in the 10th row of the supply_array, you can simply do inv_product_dict[10], which in my case yields ('ei32_CU_U', '4110733917e1fcdc7c55af3b3f068c72')
The same types of logic applies with biosphere (or elementary) flows, found in the lca.biosphere_dict (in LCA parlance, rows in the B matrix), and activities, found in the lca.activity_dict (columns of the A or B matrices).
Note that you can generate the reverse of the activity_dict/product_dict/biosphere_dict simultaneously using lca.reverse_dict(). The syntax then is:
rev_act_dict, rev_product_dict, rev_bio_dict = lca.reverse_dict()

Related

How to sort rows in Excel without having repeated data together

I have a table of data with many data repeating.
I have to sort the rows by random, however, without having identical names next to each other, like shown here:
How can I do that in Excel?
Perfect case for a recursive LAMBDA.
In Name Manager, define RandomSort as
=LAMBDA(ζ,
LET(
ξ, SORTBY(ζ, RANDARRAY(ROWS(ζ))),
λ, TAKE(ξ, , 1),
κ, SUMPRODUCT(N(DROP(λ, -1) = DROP(λ, 1))),
IF(κ = 0, ξ, RandomSort(ζ))
)
)
then enter
=RandomSort(A2:B8)
within the worksheet somewhere. Replace A2:B8 - which should be your data excluding the headers - as required.
If no solution is possible then you will receive a #NUM! error. I didn't get round to adding a clause to determine whether a certain combination of names has a solution or not.
This is just an attempt because the question might need clarification or more sample data to understand the actual scenario. The main idea is to generate a random list from the input, then distribute it evenly by names. This ensures no repetition of consecutive names, but this is not the only possible way of sorting (this problem may have multiple valid combinations), but this is a valid one. The solution is volatile (every time Excel recalculates, a new output is generated) because RANDARRAY is volatile function.
In cell D2, you can use the following formula:
=LET(rng, A2:B8, m, ROWS(rng), seq, SEQUENCE(m),
idx, SORTBY(seq, RANDARRAY(m,,1,m, TRUE)), rRng, INDEX(rng, idx,{1,2}),
names, INDEX(rRng,,1), nCnts, MAP(seq, LAMBDA(s, ROWS(FILTER(names,
(names=INDEX(names,s)) * (seq<=s))))), SORTBY(rRng, nCnts))
Here is the output:
Update
Looking at #JosWoolley approach. The generation of the random sorting can be simplified so that the resulting formula could be:
=LET(rng, A2:B8, m, ROWS(rng), seq, SEQUENCE(m), rRng,SORTBY(rng, RANDARRAY(m)),
names, TAKE(rRng,,1), nCnts, MAP(seq, LAMBDA(s, ROWS(FILTER(names,
(names=INDEX(names,s)) * (seq<=s))))), SORTBY(rRng, nCnts))
Explanation
LET function is used for easy reading and composition. The name idx represents a random sequence of the input index positions. The name rRng, represents the input rng, but sorted by random. This sorting doesn't ensure consecutive names are distinct.
In order to ensure consecutive names are not repeated, we enumerate (nCnts) repeated names. We use a MAP for that. This is a similar idea provided by #cybernetic.nomad in the comment section, but adapted for an array version (we cannot use COUNTIF because it requires a range). Finally, we use SORTBY with input argument by_array, the map result (nCnts), to ensure names are evenly distributed so no consecutive names will be the same. Every time Excel recalculate you will get an output with the names distributed evenly in a different way.
Not sure if it's worth posting this, but I might as well share the results of my research such as it is. The problem is similar to that of re-arranging the characters in a string so that no same characters are adjacent The method is just to insert whichever one of the remaining characters (names) has the highest frequency at this point and is not the same as the previous character, then reduce its frequency once it has been used. It's fairly easy to implement this in Excel, even in Excel 2019. So if the initial frequencies are in D2:D8 for convenience using Countif:
=COUNTIF(A$2:A$8,A2)
You can use this formula in (say) F2 and pull it down:
=INDEX(A$2:A$8,MATCH(MAX((D$2:D$8-COUNTIF(F$1:F1,A$2:A$8))*(A$2:A$8<>F1)),(D$2:D$8-COUNTIF(F$1:F1,A$2:A$8))*(A$2:A$8<>F1),0))
and similarly in G2 to get the ages:
=INDEX(B$2:B$8,MATCH(MAX((D$2:D$8-COUNTIF(F$1:F1,A$2:A$8))*(A$2:A$8<>F1)),(D$2:D$8-COUNTIF(F$1:F1,A$2:A$8))*(A$2:A$8<>F1),0))
I'm fairly sure this will always produce a correct result if one is possible.
HOWEVER there is no randomness built in to this method. You can see if I extend it to more data that in the first several rows the most common name simply alternates with the other two names:
Having said that, this is a bit of a worst case scenario (a lot of duplication) and it may not look too bad with real data, so it may be worth considering this approach along with the other two methods.

Application of a custom function to generate iterations across a distance range

Bit of a complex query but I will try to articulate it as best I can:
Essentially I have a list of objects for which I am trying to work out the probability of said items hitting particular points below on the seabed having been dropped at the surface. So there are two steps I need guidance on:
I need to define a custom function, ERF(a,b), where I need to refer to specified values dependent on the Item Number (see in the column names) in order to then use them as multipliers:
These multipliers can be found in a dictionary, Lat_Dev (note please ignore the dataframe column reference as this was a previous attempt at coding a solution, info is now found in a dictionary format).
The function then needs to be repeated for set iterations between a & b with the step size defined as 0.1m. This range is from 0-100m. a is the lower limit and b the upper limit (e.g. a = 0.1m & b = 0.2m). This is done for each column (i.e. Item_Num).
Hopefully the above is clear enough. Cheers,

Turn list of tuples into binary tensors?

I have a list of tuples in the form below. A given tuple represents a given pair of movies that a given user liked. All tuples, together, capture every combination of movie likes found in my data.
[(movie_a,movie_b),...(movie_a,movie_b)]
My task is to create movie embeddings, similar to word embeddings. The idea is to train a single hidden layer NN to predict the most likely movie which any user might like, given a movie supplied. Much like word embeddings, the task is inconsequential; it's the weight matrix I'm after, which maps movies to vectors.
Reference: https://arxiv.org/vc/arxiv/papers/1603/1603.04259v2.pdf
In total, there are 19,000,000 tuples (training examples.) Likewise, there are 9,000 unique movie IDs in my data. My initial goal was to create an input variable, X where each row represented a unique movie_id, and each column represented a unique observation. In any given column, only one cell would be set to 1, with all other values set to 0.
As an intermediate step, I tried creating a matrix of zeros of the right dimensions
X = np.zeros([9000,19000000])
Understandably, my computer crashed, simply trying to allocate sufficient memory to X.
Is there a memory efficient way to pass my list of values into PyTorch, such that a binary vector is created for every training example?
Likewise, I tried randomly sampling 500,000 observations. But similarly, passing 9000,500000 to np.zeroes() resulted in another crash.
My university has a GPU server available, and that's my next stop. But I'd like to know if there's a memory efficient way that I should be doing this, especially since I'll be using shared resources.

How to apply sklean pipeline to a list of features depending on availability

I have a pandas dataframe with 10 features (e.g., all floats). Given the different characteristics of the features (e.g., mean), the dataframe can be broken into 4 subsets: mean <0, mean within range (0,1), mean within range (1,100), mean >=100
For each subset, a different pipeline will be applied, however, they may not always be available, for example, the dataset might only contain mean <0; or it may contain only mean <0 and mean (1,100); or it may contain all 4 subsets
The question is how to apply the pipelines depending on the availability of the subsets.
The problem is that there will be total 7 different combinations:
all subset exists, only 3 exists, only 2 subset exists, only 1 subset exist.
How can I assign different pipelines depending on the availability of the subsets without using a nested if else (10 if/else)
if subset1 exists:
make_column_transformer(pipeline1, subset1)
elif subset2 exists:
make_column_transformer(pipeline2, subset2)
elif subset3 exists:
make_column_transformer(pipeline3, subset3)
elif subset1 and subset 2 exists
make_column_transformer([(pipeline1, subset1), (pipeline2, subset2)]
elif subset3 and subset 2 exists
make_column_transformer([(pipeline3, subset3), (pipeline2, subset2)]
elif subset1 and subset 3 exists
make_column_transformer([(pipeline1, subset1), (pipeline3, subset3)]
elif subset1 and subset2 and subset3 exists:
make_column_transformer([(pipeline1, subset1), (pipeline2, subset2), (pipeline3, subset3)]
Is there a better way to avoid this nested if else (considering that if we have 10 different subsets _)
The way to apply different transformations to different sets of features is by ColumnTransformer [1]. You could then have a lists with the column names, which can be filled based on the conditions you want. Then, each transformer will take the columns listed in each list, for example cols_mean_lt0 = [...], etc.
Having said that, your approach doesn't look good to me. You probably want to scale the features so they all have the same mean and std. Depending on the algorithm you'll use, this may be mandatory or not.
[1] https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
EDIT:
ColumnTransformer takes transformers, which are a tuple of name, tuple and columns. What you want is to have multiple transformers, each of which will process different columns. The columns in the tuple can be indicated by 'string or int, array-like of string or int, slice, boolean mask array or callable'. Here is where I suggest you pass a list of columns.
This way, you can have three transformers, one for each of your cases. Now, to indicate which columns you want each transformer to process, you just have to create three lists, one for each transformer. Each column will corresond to one of the lists. This is simple to to. In a loop you can check for each column what the mean is, and then append the column name to the list which corresponds to the corresponding transformer.
Hope this helps!

Search selection

For a C# program that I am writing, I need to compare similarities in two entities (can be documents, animals, or almost anything).
Based on certain properties, I calculate the similarities between the documents (or entities).
I put their similarities in a table as below
X Y Z
A|0.6 |0.5 |0.4
B|0.6 |0.4 |0.2
C|0.6 |0.3 |0.6
I want to find the best matching pairs (eg: AX, BY, CZ) based on the highest similarity score. High score indicates the higher similarity.
My problem arises when there is a tie between similarity values. For example, AX and CZ have both 0.6. How do I decide which two pairs to select? Are there any procedures/theories for this kind of problems?
Thanks.
In general, tie-breaking methods are going to depend on the context of the problem. In some cases, you want to report all the tying results. In other situations, you can use an arbitrary means of selection such as which one is alphabetically first. Finally, you may choose to have a secondary characteristic which is only evaluated in the case of a tie in the primary characteristic.
Additionally, you can always report one or more and then alert the user that there was a tie to allow him or her to decide for him- or herself.
In this case, the similarities you should be looking for are:
- Value
- Row
- Column
Objects which have any of the above in common are "similar". You could assign a weighting to each property, so that objects which have the same value are more similar than objects which are in the same column. Also, objects which have the same value and are in the same column are more similar than objects with just the same value.
Depending on whether there are any natural ranges occurring in your data, you could also consider comparing ranges. For example two numbers in the range 0-0.5 might be somewhat similar.

Resources