De-aggregating ecoinvent processes in Brightway2 - brightway

Let's say I want use an ecoinvent process for an automobile, and that the process model includes impacts for producing the car, maintenance, road maintenance, fuel, etc. And let's assume that I want to model the automobile without the fuel, because I want to model the use of a different fuel. Can I tell Brightway to calculate impacts for the automobile minus the fuel?

There are at least two ways to get the results you want. Let's say your inventory datasets look like this:
[
{
'code': 'car',
'database': 'example',
'exchanges': [{
'input': ('database', 'fuel'),
'amount': 1
}]
}, {
'code': 'fuel',
'database': 'example',
}
]
Then you can either construct a new data set and subtract the fuel:
{
'code': 'car w/out fuel',
'database': 'example',
'exchanges': [{
'input': ('example', 'car'),
'amount': 1
}, {
'input': ('example', 'fuel'),
'amount': -1
}
}]
}
And then use this dataset as your functional unit. Alternatively, you could subtract the fuel input directly in your functional unit passed to the LCA class:
LCA({('example', 'car'): 1, ('example', 'fuel'): -1)})
You could also save this modified functional unit in a calculation setup.
Responding to a comment about the ease of manipulating datasets, there isn't really a simple way. It is very difficult to define generic rules for working with inventory datasets, as inputs are structured very differently from industry sector to sector. To answer the specific comment, you could do something like:
from brightway2 import *
db = Database("ecoinvent 3.2 cutoff")
car = db.search('transport, passenger car, large size, diesel')[0]
new_car = car.copy()
for exc in new_car.exchanges():
if 'diesel, low-sulfur' in exc.input['name']:
exc.delete()
But this would require that you examine the search terms manually to make sure you get the behaviour that you want. In an ideal world, we would have a domain-specific language for manipulating datasets in simple ways, but I don't know what that would look like yet.

Related

Is there an efficient way to filter the maximum numbers in a list

I'm just curious to know if there is a better way to find all the maximum numbers in a list.
As an example, i have a list of dictionaries. which has a key ("info") with corresponding value of dictionary as shown below. I want to find the maximum numbers ("count") from that list. once the maximum numbers are found, the corresponding "code" for that element has to be determined.
I am able to do it as below, but for a really large list, the execution time will be high.
codes = [{"code": 1, "info": {"status": "running", "count": 4}},
{"code": 2, "info": {"status": "running", "count": 1}},
{"code": 3, "info": {"status": "running", "count": 2}},
{"code": 4, "info": {"status": "running", "count": 4}},
{"code": 5, "info": {"status": "running", "count": 4}}
]
count_max = 0
filtered_codes = []
for i in range(len(codes)):
if codes[i]['info']['count'] == count_max:
filtered_codes.append(codes[i]['code'])
if codes[i]['info']['count'] > count_max:
filtered_codes.clear()
filtered_codes.append(codes[i]['code'])
count_max = codes[i]['info']['count']
print(filtered_codes)
output :
[1, 4, 5] #1,4,5 are the code for dictionary with count == 4 (max amongst all the count)
so, is there any better way to do this? may be using like filter, lambda, max ?
Edit : Someone had posted the below code as answer, later got deleted
max_count = max(info["info"]["count"] for info in codes)
filtered_codes = [val["code"] for val in filter(lambda x: x["info"]["count"] == max_count, codes)]
So i ran both the code on a list of length 1000000
The first approach posted by me took : 1.4539954662322998s
Second approach took : 0.4440000057220459s
So is the second approach best solution?
Your code is the most asymptotically efficient way to do this. Also your code is good because it can operate on a stream -- using max and then filter would require multiple passes. The advantage of using these builtins would be readability.
If you really need performance though, you probably want to use a library like numpy or write your code in C.
Updated question:
Yeah I mean they are both linear time asymptotically. Which one is faster in the real world depends on which kinds of input you see, as well as cache and interpreter quirks. For example, the second one will definitely be way faster for specific datasets like count = [1,1,1,1,1,1,...,1,2]. The first example would be way faster streaming a 10TiB random input list from a remote service over a slow network. To do a true comparison you would need to define what space of datasets you want to test over.
Is solution 2 the "best" for the input data you generated? Well I mean there are more ways to optimize -- for example you can spin up threads and partition the input list for the filtering once you determine the max. Depends on how far you want to go.
Personally I would prefer to write (and read) the second example. Because it works and is easy to understand. Going even further it's probably more idiomatic to use if rather than filter in a comprehension, i.e.
max_count = max(info["info"]["count"] for info in codes)
filtered_codes = [val["code"] for val in codes if val["info"]["count"] == max_count]
Like I said before though if you actually care about super high performance on large data you obviously don't want to be using pure Python.

How to reconstruct text entities with Hugging Face's transformers pipelines without IOB tags?

I've been looking to use Hugging Face's Pipelines for NER (named entity recognition). However, it is returning the entity labels in inside-outside-beginning (IOB) format but without the IOB labels. So I'm not able to map the output of the pipeline back to my original text. Moreover, the outputs are masked in BERT tokenization format (the default model is BERT-large).
For example:
from transformers import pipeline
nlp_bert_lg = pipeline('ner')
print(nlp_bert_lg('Hugging Face is a French company based in New York.'))
The output is:
[{'word': 'Hu', 'score': 0.9968873858451843, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9329522848129272, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9781811237335205, 'entity': 'I-ORG'},
{'word': 'French', 'score': 0.9981815814971924, 'entity': 'I-MISC'},
{'word': 'New', 'score': 0.9987512826919556, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9976728558540344, 'entity': 'I-LOC'}]
As you can see, New York is broken up into two tags.
How can I map Hugging Face's NER Pipeline back to my original text?
Transformers version: 2.7
The 17th of May, a new pull request https://github.com/huggingface/transformers/pull/3957 with what you are asking for has been merged, therefore now our life is way easier, you can you it in the pipeline lik
ner = pipeline('ner', grouped_entities=True)
and your output will be as expected. At the moment you have to install from the master branch since there is no new release yet. You can do it via
pip install git+git://github.com/huggingface/transformers.git#48c3a70b4eaedab1dd9ad49990cfaa4d6cb8f6a0
Unfortunately, as of now (version 2.6, and I think even with 2.7), you cannot do that with the pipeline feature alone. Since the __call__ function invoked by the pipeline is just returning a list, see the code here. This means you'd have to do a second tokenization step with an "external" tokenizer, which defies the purpose of the pipelines altogether.
But, instead, you can make use of the second example posted on the documentation, just below the sample similar to yours. For the sake of future completeness, here is the code:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
label_list = [
"O", # Outside of a named entity
"B-MISC", # Beginning of a miscellaneous entity right after another miscellaneous entity
"I-MISC", # Miscellaneous entity
"B-PER", # Beginning of a person's name right after another person's name
"I-PER", # Person's name
"B-ORG", # Beginning of an organisation right after another organisation
"I-ORG", # Organisation
"B-LOC", # Beginning of a location right after another location
"I-LOC" # Location
]
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge."
# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)
print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])
This is returning exactly what you are looking for. Note that the ConLL annotation scheme lists the following in its original paper:
Each line contains four fields: the word, its part-of-speech tag, its chunk tag and its named entity tag. Words tagged with O are outside of named entities and the I-XXX tag is used for words inside a named entity of type XXX. Whenever two entities of type XXX are immediately next to each other, the first word of the second entity will be tagged B-XXX in order to show that it starts another entity. The data contains entities of four types: persons (PER),organizations (ORG), locations (LOC) and miscellaneous names (MISC). This tagging scheme is the IOB scheme originally put forward by Ramshaw and Marcus (1995).
Meaning, if you are unhappy with the (still split) entities, you can concatenate all the subsequent I- tagged entities, or B- followed by I- tags. It is not possible in this scheme that two different (immediately neighboring) entities are both tagged with only the I- tags.
If you're looking at this in 2022:
the grouped_entities keyword is now deprecated
you should use aggregation_strategy: default is None, you're looking for simple or first or average or max -> see documentation of the AggregationStrategy class
from transformers import pipeline
import pandas as pd
text = 'Hugging Face is a French company based in New York.'
tagger = pipeline(task='ner', aggregation_strategy='simple')
named_ents = tagger(text)
pd.DataFrame(named_ents)
[{'entity_group': 'ORG',
'score': 0.96934015,
'word': 'Hugging Face',
'start': 0,
'end': 12},
{'entity_group': 'MISC',
'score': 0.9981816,
'word': 'French',
'start': 18,
'end': 24},
{'entity_group': 'LOC',
'score': 0.9982121,
'word': 'New York',
'start': 42,
'end': 50}]

Python Google Map API gives OSM_type that does not match documentation

I was using Google's Map API to extract address information for latitude and longitude coordinates from within Python. As shown in the code below, there is an attribute called osm_type that I believe is "open street maps type". But when I google for documentation, I find just "type" and all the lists I am finding do not include "way" as one of the expected answers to type. Does anyone know where I can get a list of types that are valid for osm_type?
Code:
from geopy.geocoders import Nominatim
geolocator = Nominatim()
from geopy.exc import GeocoderTimedOut
import time
lat = 43.2335233435383
lon = -70.9108497973799
location = geolocator.reverse(str(lat) + ", " + str(lon), timeout=10)
print(location.raw)
Output:
{'address': {'city': 'Dover',
'country': 'United States of America',
'country_code': 'us',
'county': 'Strafford County',
'house_number': '155',
'postcode': '03820',
'road': 'Long Hill Road',
'state': 'New Hampshire'},
'boundingbox': ['43.233423343538',
'43.233623343538',
'-70.91094979738',
'-70.91074979738'],
'display_name': '155, Long Hill Road, Dover, Strafford County, New Hampshire, 03820, United States of America',
'lat': '43.2335233435383',
'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
'lon': '-70.9108497973799',
'osm_id': '18868744',
'osm_type': 'way',
'place_id': '201786637'}
As far as I can tell, the specific code you are using does not support returning the type. However, running a few spot checks with your code, inputting other latitudes and longitudes, I found that a McDonalds in NY produced an osm_type of "node" instead of "way" and "way" comes up a lot. I got "node" with: Latitude=40.730949, Longitude=-74.001083.
If instead of checking Google Maps documentation, you look at Open Street Maps documentation, you will see that this field is most likely defining the type of data rather than the type of address. These types are defined by the given URL:
node
way
relation
That should answer what "way" means for you in this context.
Accessing the type as defined for the Google Maps API will probably require a different piece of code (if its possible). The command you are using does not appear to have that field in its output.
This page on location searches, would seem to indicate that "type" is a field used in a location search to narrow down what you are looking for. These types do not appear to be a part of the API call you are using.

How to represent Objects in Functional Python?

I don't have experience with Python.
It is this thing that keeps bothering me few days. I tried to find an answer but unfortunately i didn't succeed. Since i am reading data from file to in-memory, i thought of way to represent it as:
students [{id: [firstname,lastname,password]}, {id: [firstname,lastname,password]}]
How this approach seem to you? And how could i iterate through this, and check for login credentials? It's Python 3
You define students as an array of objects (each is a dict): good
You define each student's attributes as an array: not so good.
That array is the value of a property with the name id: confusing.
The more natural way to store such data is like this:
students = [{
'id': 1,
'firstname': 'Mary',
'lastname': 'Jones',
'password': 'secret'
}, {
'id': 2,
'firstname': 'Liam',
'lastname': 'McEnzie',
'password': '9#4t&$X'
}]
If you are just wanting to bind data fields together given a regular naming scheme, a namedtuple will work great. It has the added benefit of being immutable, if you are trying to maintain a purely functional approach:
from collections import namedtuple
# namedtuple is actually a class-factory:
Student = namedtuple("Students", ['id','firstname','lastname','password'])
# notice i'm using a plain tuple as a container - again, we want immutability
# for a functional approach
students = (
Student(id=1, firstname='Rick', lastname='Sanchez', password='rickkyticky'),
Student(id=2, firstname='Morty', lastname='Smith', password='password'),
Student(id=3, firstname='Summer', lastname='Smith', password='tinkles')
)
You can iterate over this like you would almost anything else in Python, using for-loops or, if you'd rather be more functional, using comprehensions (in this case, really a generator-expression fed into a tuple constructor):
>>> for s in students:
... print(s.password)
...
rickkyticky
password
tinkles
>>> passwords = tuple(s.password for s in students)
>>> passwords
('rickkyticky', 'password', 'tinkles')
>>>
namedtuple is faster and significantly more memory efficient than a dict.

Adding JSON data together

Let's say I have 2 JSON objects (dictionaries):
first_dict={"features": [{"attributes": {"id": "KA88457","name": "Steven", "phone":"+6590876589"}}]}
second_dict={"features": [{"attributes": {"id": "KA88333","name": "John", "phone":"+6590723456"}}]}
I want to add them so that I have something like this:
{"features": [{"attributes": {"id": "KA88457","name": "Steven", "phone":"+6590876589"}}], 'features': [{'attributes': {'id': 'KA88333', 'name': 'John', 'phone': '+6590723456'}}]}
If I use first_dict.update(second_dict), I get the following. How do I fix that?
{'features': [{'attributes': {'id': 'KA88333', 'name': 'John', 'phone': '+6590723456'}}]}
Since both dicts have the same key, "features", you'll need to rename one of the keys and add it and its values into one of the dictionaries. This is one way to avoid your merge conflict. E.g.:
second_dict={"features": [{"attributes": {"id": "KA88333","name": "John", "phone":"+6590723456"}}]}
first_dict={"features": [{"attributes": {"id": "KA88457","name": "Steven", "phone":"+6590876589"}}]}
temp_var = second_dict['features']
first_dict['features2'] = temp_var
merged data in first_dict:
{'features': [{'attributes': {'id': 'KA88457', 'name': 'Steven', 'phone': '+6590876589'}}], 'features2': [{'attributes': {'id': 'KA88333', 'name': 'John', 'phone': '+6590723456'}}]}
According to RFC 7159, You cannot have duplicate names inside objects.
An object structure is represented as a pair of curly brackets
surrounding zero or more name/value pairs (or members). A name is a
string. A single colon comes after each name, separating the name
from the value. A single comma separates a value from a following
name. The names within an object SHOULD be unique.
Although, The original JSON standard ECMA-404 doesn't say anything about duplicate names. Most of JSON libraries (including python3 JSON library) doesn't support this feature.
Another reason that you can't do this is you're trying to have two different values for a key in your dictionary (which is basically a hash map).
If you really need this you have to write your own serializer or maybe find a JSON library for python that supports this.

Resources