How to manually link in Brightway2 an imported exchange, given I have found the correct one in ecoinvent - brightway

I have been linking my data automatically with
import functools
from bw2io.strategies import link_iterable_by_fields
sp.apply_strategy(functools.partial(
link_iterable_by_fields,
other=Database("ecoinvent 3.2 cutoff"),
kind="technosphere",
fields=["reference product", "name", "unit", "location"]
))
sp.statistics()
When I list the remaining unlinked datasets with
bw2io.importers.simapro_csv.SimaProCSVImporter
it outputs e.g.:
Electricity, low voltage {ENTSO-E}| market group for | Alloc Rec, U kilowatt hour ('Electricity/heat',)
Given that I found the dataset in ecoinvent:
'market group for electricity, low voltage' (kilowatt hour, ENTSO-E, None)
How do I link these datasets together?

This is a dataset from ecoinvent 3.2, for which bw2io does not yet have the migration data for the "special" SimaPro names. Normally conversion from Simapro names (e.g. Electricity, low voltage {ENTSO-E}| market group for | Alloc Rec, U) to ecoinvent activity names and reference products would be handled by the migration simapro-ecoinvent-3. But this doesn't work in this case:
In [4]: Migration('simapro-ecoinvent-3').load()['Electricity, low voltage {ENTSO-E}| market group for | Alloc Rec, U']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
You can write your own migration:
migration_data = {
'fields': ['name'],
'data': [
(
# First element is input data in the order of `fields` above
('Electricity, low voltage {ENTSO-E}| market group for | Alloc Rec, U',),
# Second element is new values
{
'name': 'market group for electricity, low voltage',
'reference product': 'electricity, high voltage',
'location': 'ENTSO-E',
}
)
]
}
Migration("new-ecoinvent").write(
migration_data,
description="New datasets in ecoinvent 3.2"
)
And then apply this migration to your unlinked data:
sp.migrate("new-ecoinvent")
Migration only changes the data used to link; you will still have to apply link_iterable_by_fields to actually link against ecoinvent 3.2.

Related

Delta live tables data quality checks -Retain failed records

There are 3 types of quality checks in Delta live tables:
expect (retain invalid records)
expect_or_drop (drop invalid records)
expect_or_fail (fail on invalid records)
I want to retain invalid records, but I also want to keep track of them. So, by using expect, can I query the invalid records, or is it just for keeping stats like "n records were invalid"?
expect just record that you had some problems so you have some statistics about you data quality in the pipeline. But it's not very useful in practice.
Native quarantine functionality is still not available, that's why there is the recipe in the cookbook. Although it's not exactly what you need, you can still build on top of it, especially if you take into account the second part of the recipe that explicitly adds a Quarantine column - we can combine it with expect to get statistics into UI:
import dlt
from pyspark.sql.functions import expr
rules = {}
quarantine_rules = {}
...
quarantine_rules = "NOT({0})".format(" AND ".join(rules.values()))
#dlt.table(
name="partitioned_farmers_market",
partition_cols = [ 'Quarantine' ]
)
#dlt.expect_all(rules)
def get_partitioned_farmers_market():
return (
dlt.read("raw_farmers_market")
.withColumn("Quarantine", expr(quarantine_rules))
.select("MarketName", "Website", "Location", "State",
"Facebook", "Twitter", "Youtube", "Organic", "updateTime",
"Quarantine")
)
Another approach would be to use first part of the recipe (that uses expect_all_or_drop), and just union both tables (it's better to mark the valid/invalid tables with temporary = True marker)

Get all elementary flows generated by an activity in Brightway

I would like to access to all elementary flows generated by an activity in Brightway in a table that would gather the flows and the amounts.
Let's assume a random activity :
lca=bw.LCA({random_act:2761,method)
lca.lci()
lca.lcia()
lca.inventory
I have tried several ways but none works :
I have tried to export my lci with brightway2-io but some errors appear that i cannot solve :
bw2io.export.excel.lci_matrices_to_excel(db_name) returns an error when computing the biosphere matrix data for a specific row :
--> 120 bm_sheet.write_number(bio_lookup[row] + 1, act_lookup[col] + 1, value)
122 COLUMNS = (
123 u"Index",
124 u"Name",
(...)
128 u"Location",
129 )
131 tech_sheet = workbook.add_worksheet("technosphere-labels")
KeyError: 1757
I try to get manually the amount of a specific elementary flow. For example, let's say I want to compute the total amount of Aluminium needed for the activity. To do so, i try this:
flow_Al=Database("biosphere3").search("Aluminium, in ground")
(I only want the resource Aluminium that is extracted as a ore, from the ground)
amount_Al=0
row = lca.biosphere_dict[flow_Al]
col_indices = lca.biosphere_matrix[row, :].tocoo()
amount_consumers_lca = [lca.inventory[row, index] for index in col_indices.col]
for j in amount_consumers_lca:
amount_Al=amount_Al+j
amount_Al`
This works but the final amount is too low and probably isn't what i'm looking for...
How can I solve this ?
Thank you
This will work on Brightway 2 and 2.5:
import pandas as pd
import bw2data as bd
import warnings
def create_inventory_dataframe(lca, cutoff=None):
array = lca.inventory.sum(axis=1)
if cutoff is not None and not (0 < cutoff < 1):
warnings.warn(f"Ignoring invalid cutoff value {cutoff}")
cutoff = None
total = array.sum()
include = lambda x: abs(x / total) >= cutoff if cutoff is not None else True
if hasattr(lca, 'dicts'):
mapping = lca.dicts.biosphere
else:
mapping = lca.biosphere_dict
data = []
for key, row in mapping.items():
amount = array[row, 0]
if include(amount):
data.append((bd.get_activity(key), row, amount))
data.sort(key=lambda x: abs(x[2]))
return pd.DataFrame([{
'row_index': row,
'amount': amount,
'name': flow.get('name'),
'unit': flow.get('unit'),
'categories': str(flow.get('categories'))
} for flow, row, amount in data
])
The cutoff doesn't make sense for the inventory database, but it can be adapted for the LCIA result (characterized_inventory) as well.
Once you have a pandas DataFrame you can filter or export easily.

BigQuery Struct Aggregation

I am processing an ETL job on BigQuery, where I am trying to reconcile data where there may be conflicting sources. I frist used array_agg(distinct my_column ignore nulls) to find out where reconciliation was needed and next I need to prioritize data per column base on the source source.
I thought to array_agg(struct(data_source, my_column)) and hoped I could easily extract the preferred source data for a given column. However, with this method, I failed aggregating data as a struct and instead aggregated data as an array of struct.
Considered the simplified example below, where I will prefer to get job_title from HR and dietary_pref from Canteen:
with data_set as (
select 'John' as employee, 'Senior Manager' as job_title, 'vegan' as dietary_pref, 'HR' as source
union all
select 'John' as employee, 'Manager' as job_title, 'vegetarian' as dietary_pref, 'Canteen' as source
union all
select 'Mary' as employee, 'Marketing Director' as job_title, 'pescatarian' as dietary_pref, 'HR' as source
union all
select 'Mary' as employee, 'Marketing Manager' as job_title, 'gluten-free' as dietary_pref, 'Canteen' as source
)
select employee,
array_agg(struct(source, job_title)) as job_title,
array_agg(struct(source, dietary_pref)) as dietary_pref,
from data_set
group by employee
The data I get for John with regard to the job title is:
[{'source':'HR', 'job_title':'Senior Manager'}, {'source': 'Canteen', 'job_title':'Manager'}]
Whereas I am trying to achieve:
[{'HR' : 'Senior Manager', 'Canteen' : 'Manager'}]
With a struct output, I was hoping to then easily access the preferred source using my_struct.my_preferred_source. I this particular case I hope to invoke job_title.HR and dietary_pref.Canteen.
Hence in pseudo-SQL here I imagine I would :
select employee,
AGGREGATE_JOB_TITLE_AS_STRUCT(source, job_title).HR as job_title,
AGGREGATE_DIETARY_PREF_AS_STRUCT(source, dietary_pref).Canteen as dietary_pref,
from data_set group by employee
The output would then be:
I'd like help here solving this. Perhaps that's the wrong approach altogether, but given the more complex data set I am dealing with I thought this would be the preferred approach (albeit failed).
Open to alternatives. Please advise. Thanks
Notes: I edited this post after Mikhail's answer, which solved my problem using a slightly different method than I expected, and added more details on my intent to use a single struct per employee
Consider below
select employee,
array_agg(struct(source as job_source, job_title) order by if(source = 'HR', 1, 2) limit 1)[offset(0)].*,
array_agg(struct(source as dietary_source, dietary_pref) order by if(source = 'HR', 2, 1) limit 1)[offset(0)].*
from data_set
group by employee
if applied to sample data in your question - output is
Update:
use below for clarified output
select employee,
array_agg(job_title order by if(source = 'HR', 1, 2) limit 1)[offset(0)] as job_title,
array_agg(dietary_pref order by if(source = 'HR', 2, 1) limit 1)[offset(0)] as dietary_pref
from data_set
group by employee
with output

what is the key to improve speed when we need action operation in bigdata processing by Spark?

I am recently using Spark 1.5.1 to process hadoop data. However, my experience of Spark is not so good for the slowness of processing action operation(e.g., .count(),.collect()). My task can be described as following:
I have a dataframe like this:
----------------------------
trans item_code item_qty
----------------------------
001 A 2
001 B 3
002 A 4
002 B 6
002 C 10
003 D 1
----------------------------
I need to find association rules of two items, e.g. one of A will result in one and a half of B with confidence of 0.8. The desired result dataframe is like this:
----------------------------
item1 item2 conf coef
----------------------------
A B 0.8 1.5
B A 1.0 0.67
A C 0.7 2.5
----------------------------
My method is using FP-growth to generate frequent item sets first and then filter item sets of one item and item sets with two items. After that I can calculate the confidence of one item resulting in another. For example, having (itemset=[A], support=0.4),(itemset=[B], support=0.2),(itemset=[A,B], support=0.2), I can generate association rules:(rule=(A->B),confidence=0.5),
(rule=(B->A),confidence=1.0).
However, when I broadcast the one-item frequent item sets as dictionary, the action of .collectAsMap is really very slow. I tried to use .join and it is even slower. I even need to wait hours to see rdd.count(). I know we should avoid any use of action operation in Spark, but sometimes it is unavoidable. So I am curious what is the key to improve speed when we face action operations.
My code is here:
#!/usr/bin/python
from pyspark import SparkContext,HiveContext
from pyspark.mllib.fpm import FPGrowth
import time
#read raw data from database
def read_data():
sql="""select t.orderno_nosplit,
t.prod_code,
t.item_code,
sum(t.item_qty)
as item_qty
from ioc_fdm.fdm_dwr_ioc_fcs_pk_spu_item_f_chain t
group by t.prod_code, t.orderno_nosplit,t.item_code """
data=sql_context.sql(sql)
return data.cache()
#calculate quantity coefficient of two items
def qty_coef(item1,item2):
sql =""" select t1.item, t1.qty from table t1
where t1.trans in
(select t2.trans from spu_table t2 where t2.item ='%s'
and
(select t3.trans from spu_table t3 where t3.item = '%s' """ % (item1,item2)
df=sql_context.sql(sql)
qty_item1=df.filter(df.item_code==item1).agg({"item_qty":"sum"}).first()[0]
qty_item2=df.filter(df.item_code==item2).agg({"item_qty":"sum"}).first()[0]
coef=float(qty_item2)/qty_item1
return coef
def train(prod):
spu=total_spu.filter(total_spu.prod_code == prod)
print 'data length',spu.count(),time.strftime("%H:%M:%S")
supp=0.1
conf=0.7
sql_context.registerDataFrameAsTable(spu,'spu_table')
sql_context.cacheTable('spu_table')
print 'table register over', time.strftime("%H:%M:%S")
trans_sets=spu.rdd.repartition(32).map(lambda x:(x[0],x[2])).groupByKey().mapvalues(list).values().cache()
print 'trans group over',time.strftime("%H:%M:%S")
model=FPGrowth.train(trans_sets,supp,10)
print 'model train over',time.strftime("%H:%M:%S")
model_f1=model.freqItemsets().filter(lambda x: len(x[0]==1))
model_f2=model.freqItemsets().filter(lambda x: len(x[0]==2))
#register model_f1 as dictionary
model_f1_tuple=model_f1.map(lambda (U,V):(tuple(U)[0],V))
model_f1Map=model_f1_tuple.collectAsMap()
#convert model_f1Map to broadcast
bc_model=sc.broadcast(model_f1Map)
#generate association rules
model_f2_conf=model_f2.map(lambda x:(x[0][0],x[0][1],float(x[1])/bc_model.value[x[0][0]],float(x[1]/bc_model.value[x[0][1]])))
print 'conf calculation over',time.strftime("%H:%M:%S")
model_f2_conf_flt=model_f2_conf.flatMap(lambda x: (x[0],x[1]))
#filter the association rules by confidence threshold
model_f2_conf_flt_ftr=model_f2_conf_flt.filter(lambda x: x[2]>=conf)
#calculate the quantity coefficient for the filtered association rules
#since we cannot use nested sql operations in rdd, I have to collect the rules to list first
asso_list=model_f2_conf_flt_ftr.map(lambda x: list(x)).collect()
print 'coef calculation over',time.strftime("%H:%M:%S")
for row in asso_list:
row.append(qty_coef(row[0],row[1]))
#rewrite the list to dataframe
asso_df=sql_context.createDataFrame(asso_list,['item1','item2','conf','coef'])
sql_context.clearCache()
path = "hdfs:/user/hive/wilber/%s"%(prod)
asso_df.write.mode('overwrite').parquet(path)
if __name__ == '__main__':
sc = SparkContext()
sql_context=HiveContext(sc)
prod_list=sc.textFile('hdfs:/user/hive/wilber/prod_list').collect()
total_spu=read_data()
print 'spu read over',time.strftime("%H:%M:%S")
for prod in list(prod_list):
print 'prod',prod
train(prod)

What is the best practice when importing 2 simapro datasets in brightway2 to merge them together

I have been importing one simaproCSV dataset with a recipe
sp = SimaProCSVImporter("recipe.CSV","recipe")
sp.migrate("simapro-ecoinvent-3")
sp.apply_strategies()
and another simaproCSV dataset with 4 specific unit processes for some of the ingredients in the first dataset.
sp2 = SimaProCSVImporter("ingredients.CSV","ingredients")
sp2.migrate("simapro-ecoinvent-3")
sp2.apply_strategies()
By matching all exchanges of the ingredients with ecoinvent I am able to do impact assessments.
sp2.match_database("ecoinvent 3.2 cutoff",ignore_categories=True)
db = sp2.write_database()
lca = LCA(
demand={db.random(): 1},
method=('IPCC 2013', 'GWP', '100 years'),
)
lca.lci()
lca.lcia()
lca.score
As a next step I have matched the recipe dataset first with ecoinvent, and then second with the ingredient dataset.
sp.match_database("ecoinvent 3.2 cutoff",ignore_categories=True)
sp.match_database("ingredients",ignore_categories=True)
db2 = sp.write_database()
When I want to do the LCA calculation;
lca = LCA(
demand={db2.random(): 1},
method=('IPCC 2013', 'GWP', '100 years'),
)
lca.lci()
lca.lcia()
lca.score
I get the following error:
Technosphere matrix is not square: 12917 rows and 12921 products.
What did I do wrong, what is the best practice?
Hard to say without seeing the actual data. Are you checking .statistics() each time to make sure there aren't any unlinked exchanges before writing the database? The warning message is a bit confusing (now fixed in 1.3.5), but you have too many products (rows) and not enough activities (columns). The most probable way this could happen is if you have an activity with multiple products, but again, impossible to say more or suggest fixes without seeing the actual data.

Resources