Accessing exchange data in Brightway2 database object - brightway

I have a question about accessing exchange data using the Brightway database object. Suppose I have import Brightway2 as bw and am in a project where there is an LCI database:
[In] bw.databases
[Out] Brightway2 databases metadata with 2 objects:
biosphere3
ecoinvent 3_2 APOS
I can get information on activities:
[In] ei32 = bw.Database('ecoinvent 3_2 APOS')
someActivity = ei32.get('00c71af952a1248552fc2bfa727bb6b5')
someActivity
[Out] 'market for transport, freight, inland waterways, barge with reefer, cooling' (ton kilometer, GLO, None)
It seems I have access to the following data:
[In] list(someActivity)
[Out] ['database',
'production amount',
'name',
'reference product',
'classifications',
'activity',
'location',
'filename',
'parameters',
'code',
'authors',
'paramters',
'comment',
'flow',
'type',
'unit',
'activity type']
Notice that there is no 'exchanges'. In fact, while this works:
[In] someActivity.get('location')
[Out] 'GLO'
or, equivalently:
[In] someActivity['location']
[Out] 'GLO'
Changing 'location' for 'exchanges' yields nothing (first syntax) or a key error (second syntax).
And yet I have seen this syntax in Brightway code:
exchanges = ds.get('exchanges', [])
For now, my only way for accessing exchange data is to .load the database (which loads the entire database in a dictionary), create an activity key, and call exchanges as follows:
[In] ei32Loaded = ei32.load()
activities = sorted(ei32Loaded.keys())
ei32Loaded[activities[42]]['exchanges']
[Out] [{'activity': '0fb6238a-e252-4d19-a417-c569ce5e2729', 'amount': xx,
...}]
It works fine, but I know the exchange data is in the database, so I'm sure there must exist a method to get to it without loading. At the very least, I'd like to know why someActivity.get('exchanges', []) does not work for me.
Thanks!

Brightway2 uses a SQLite database to store LCI data (at least most of the time - other backends are possible, but SQLite is the default option). In the SQLite database, there are two tables, ActivityDataset and ExchangeDataset. An ActivityDataset describes an object in the supply chain graph (not strictly limited to transforming activities), and ExchangeDataset describes a numerical relationship between two ActivityDatasets. See their schema definition.
When you use Database('foo').get('bar') or get_activity(('foo', 'bar')), you create an [Activity][2], which is a proxy object for interacting with the database. TheActivityobject exposes a number of useful methods, and handles some "magic" - for example, updating anActivityDataset`` should also update the search index, which is a completely separate database.
Instantiating an Activity loads the data that is in the ActivityDataset row. There are no real requirements or limits on what can be included, but one thing which is definitely not included are exchanges. Exchanges are loaded lazily, i.e. only when needed.
Some of the useful methods that Activity includes are exchange filters. For example, .technosphere() returns an iterator over all exchanges for which this Activity is the output, and the exchange type is technosphere. In LCA parlance, .technosphere() are the technosphere inputs for the activity. Similarly, .upstream() exposes the exchanges which consume this activity. Activity also includes:
.exchanges(): All exchanges for which this activity is an output.
.biosphere(): All exchanges for which this activity is an output, and are of the type biosphere.
.production(): All exchanges for which this activity is an output, and are of the type production.
All these methods are iterators - they won't retrieve data from the database until they are iterated over. These are also methods, not data attributes of the activity, i.e. they are not accessed like foo['technosphere'], but rather foo.technosphere().
Exchange types are used to determine where and in which matrices the numeric exchange values are to be placed during an LCA computation.
The referenced case where exchanges = ds.get('exchanges', []) appears in the in the IO library, where data is being imported and processed, but is not yet linked by ExchangeDatasets or stored in the SQLite database at all - when importing and processing inventory data, the data is a plain Python dictionary, and not a fancy combination of Activitys, Exchanges, etc.

Related

How to get all transaction data from the entire Ethereum network using web3py

I'm trying to run some analysis on cryptocurrency(e.g. Bitcoin, Ethereum) data but having trouble finding data sources. For example, I'd like to collect transaction data such as input address, output address, transaction time, transaction amount, etc. for Ethereum.
I've found that I can access Ethereum data with web3py but is it possible to get data for "ALL" transactions that have made recently in the entire Ethereum network, not just the transactions connected to my own wallet(address)? For example, I'd like to get data on all Ethereum transaction occurred today.
Also, do I must have my own Ethereum wallet(address) in order to access their data with web3py? I wonder whether I need a specific address as a starting point or I can just scrape the data without creating a wallet.
Thanks.
For example, I'd like to collect transaction data such as input address, output address, transaction time, transaction amount, etc. for Ethereum.
You can iterate over all blocks and transactions using web3.eth.get_block call. You need, however, parse the transaction content yourself.
To access all the data, it is recommended that you run your own node to have the maximum network bandwidth for JSON-RPC calls.
Also, do I must have my own Ethereum wallet(address) in order to access their data with web3py?
Address is just a derived from a random number and you do not need to generate one.
The following code should help you access the most recent blocks assuming you already have an Infura Project ID:
ethereum_mainnet_endpoint = f'https://mainnet.infura.io/v3/{INFURA_PROJ_ID}'
web3 = Web3(Web3.HTTPProvider(ethereum_mainnet_endpoint))
assert web3.isConnected()
eth_block_df = pd.DataFrame(ethBlocks).set_index('number')
Once you've accessed the most recent transactions, you can loop through each of the transaction hashes and create a new dataset with it:
def decoder(txns):
block = []
for i in txns:
hash = '0x' + bytes(i).hex()
block.append(hash)
return block
eth_block_df['transactions_0x'] = eth_block_df['transactions'].apply(lambda x: decoder(x))
def transaction_decoder(hashes):
"""
Generates a list of ETH transactions per row
"""
txn_dets = []
for i in hashes:
txn = web3.eth.get_transaction(str(i))
txn_dets.append(dict(txn))
return txn_dets
def transaction_df(series):
"""
Converts a list of lists of Ethereum transactions into a single DataFrame.
"""
obj = series.apply(transaction_decoder)
main = []
for row in obj:
for txn in row:
main.append(txn)
eth_txns_df = pd.DataFrame(main, columns=main[0].keys())
return eth_txns_df
eth_txns_df = transaction_df(eth_block_df['transactions_0x'])
print(eth_txns_df.shape)
I used this code recently for a project I'm still working so it's probably not the most efficient or the cleanest solution but it gets the job done.
Hope that helps!

unit conversion when importing datasets from excel files in brightway

I am trying to create some activities using the excel importer. My activity has a technosphere flow of 0.4584 MWh of Production of electricity by gas from the previously imported EXIOBASE 3.3.17 hybrid database. The activity of Production of electricity by gas is in TJ in the database.
I ran without problems the import, something like:
ei = ExcelImporter(path_to_my_excel)
ei.apply_strategies()
ei.match_database(fields = ['name','location'])
ei.match_database(db_name = 'EXIOBASE 3.3.17 hybrid', fields = ['name','location'])
ei.match_database(db_name = 'biosphere3', fields = ['name','categories'])
ei.write_project_parameters()
ei.write_database(activate_parameters=True)
but if I iterate over the technosphere flows of my activity consuming natural gas it says it uses 0.4584 TJ of Production of electricity by gas (the same unit as the activity of production of electricty by gas, but the same amount I put in MWh). I was kind of hoping some unit conversion under the hood. Perhaps using bw2io.units.UNITS_NORMALIZATION.
Should we always express the units of exchanges with the same units as the activity they link ? is there an existing strategy to do the unit conversion for us? Thanks!
This line: ei.match_database(db_name = 'EXIOBASE 3.3.17 hybrid', fields = ['name','location']) is telling the program to match, but not to match based on units.
You can get the desired result with a migration, see an example here (in the section Fixing units for passenger cars).

Python Data saving performance

I`ve got some bottleneck with data, and will be appreciated for senior advice.
I have an API, where i recieve financial data that looks like this GBPUSD 2020-01-01 00:00:01.001 1.30256 1.30250, my target is to write those data directly into databse as fast as it possible.
Inputs:
Python 3.8
PastgreSQL 12
Redis Queue (Linux)
SQLAlchemy
Incoming data structure, as showed above, comes in one dictionary {symbol: {datetime: (price1, price2)}}. All of the data comes in String datatype.
API is streaming 29 symbols, so I can recieve for example from 30 to 60+ values of different symbols just in one second.
How it works now:
I recieve new value in dictionary;
All new values of each symbol, when they come to me, is storing in one variable dict - data_dict;
Next I'm asking those dictionary by symbol key and last value, and send those data to Redis Queue - data_dict[symbol][last_value].enqueue(save_record, args=(datetime, price1, price2)) . Till this point everything works fine and fast.
When it comes to Redis worker, there is save_record function:
"
def save_record(Datetime, price1, price2, Instr, adf):
# Parameters
#----------
# Datetime : 'string' : Datetime value
# price1 : 'string' : Bid Value
# price2 : 'string' : Ask Value
# Instr : 'string' : symbol to save
# adf : 'string' : Cred to DataBase engine
#-------
# result : : Execute save command to database
engine = create_engine(adf)
meta = MetaData(bind=engine,reflect=True)
table_obj = Table(Instr,meta)
insert_state = table_obj.insert().values(Datetime=Datetime,price1=price1,price2=price2)
with engine.connect() as conn:
conn.execute(insert_state)
When i`m execute last row of function, it takes from 0.5 to 1 second to write those row into the database:
12:49:23 default: DT.save_record('2020-00-00 00:00:01.414538', 1.33085, 1.33107, 'USDCAD', 'postgresql cred') (job_id_1)
12:49:24 default: Job OK (job_id_1)
12:49:24 default: DT.save_record('2020-00-00 00:00:01.422541', 1.56182, 1.56213, 'EURCAD', 'postgresql cred') (job_id_2)
12:49:25 default: Job OK (job_id_2)
Queued jobs for inserting each row directly into database is that bottleneck, because I can insert only 1 - 2 value(s) in 1 second, and I can recieve over 60 values in 1 second. If I run this saving, it starts to create huge queue (maximum i get was 17.000 records in queue after 1 hour of API listening), and it won't stop rhose size.
I'm currently using only 1 queue, and 17 workers. This make my PC CPU run in 100%.
So question is how to optimize this process and not create huge queue. Maybe try to save for example in JSON some sequence and then insert into DB, or store incoming data in separated variables..
Sorry if something is doubted, ask - and I`ll answer.
--UPD--
So heres my little review about some experiments:
Move engine meta out of function
Due to my architechture, API application located on Windows 10, and Redis Queue located on Linux. There was an issue wis moving meta and engine out of function, it returns TypeError (it is not depends on OS), a little info about it here
Insert multiple rows in a batch:
This approach seemed to be the most simple and easy - so it is! Basically, i've just created dictionary: data_dict = {'data_pack': []}, to begin storing there incoming values. Then I ask if there is more than 20 values per symbol is written allready - i'm sending those branch to Redis Queue, and it takes 1.5 second to write down in database. Then i delete taken records from data_dict, and process continue. So thanks Mike Organek for good advice.
Those approach is quite enough for my targets to exist, at the same time I can say that this stack of tech can provide you really good flexibility!
Every time you call save_record you re-create the engine and (reflected) meta objects, both of which are expensive operations. Running your sample code as-is gave me a throughput of
20 rows inserted in 4.9 seconds
Simply moving the engine = and meta = statements outside of the save_record function (and thereby only calling them once) improved throughput to
20 rows inserted in 0.3 seconds
Additional note: It appears that you are storing the values for each symbol in a separate table, i.e. 'GBPUSD' data in a table named GBPUSD, 'EURCAD' data in a table named EURCAD, etc.. That is a "red flag" suggesting bad database design. You should be storing all of the data in a single table with a column for the symbol.

How to do always necessary pre processing / cleaning with intake?

I'm having a use case where:
I always need to apply a pre-processing step to the data before being able to use it. (Because the naming etc. don't follow community conventions enforced by some software further down the processing chain.)
I cannot change the raw data. (Because it might be in a repo I don't control, or because it's too big to duplicate, ...)
If I aim at providing a user with the easiest and most transparent way of obtaining the data in a pre-processed way, I can see two ways of doing this:
1. Load unprocessed data with intake and apply the pre-processing immediately:
import intake
from my_tools import pre_process
cat = intake.open_catalog('...')
raw_df = cat.some_data.read()
df = pre_process(raw_df)
2. Apply the pre-processing step with the .read() call.
Catalog:
sources:
some_data:
args:
urlpath: "/path/to/some_raw_data.csv"
description: "Some data (already preprocessed)"
driver: csv
preprocess: my_tools.pre_process
And:
import intake
cat = intake.open_catalog('...')
df = cat.some_data.read()
Option 2. is not possible in Intake right now; Intake was designed to be "load" rather than "process", so we've avoided the pipeline idea for now, but we might come back to it in the future.
However, you have a couple of options within Intake that you could consider alongside Option 1., above:
make your own driver, which implements the load and any processing exactly how you like. Writing drivers is pretty easy, and can involve arbitrary code/complexity
write an alias-type driver, which takes the output of an entry in the same catalog and does something to it. See the docs and code for pointers.

Is there a performance difference between `dedupe.match(generator=True)` and `dedupe.matchBlocks()` for large datasets?

I'm preparing to run dedupe on a fairly large dataset (400,000 rows) with Python. In the documentation for the DedupeMatching class, there are both the match and matchBlocks functions. For match the docs suggest to only use on small to moderately sized datasets. From looking through the code, I can't gather how matchBlocks in tandem with block_data performs better than just match on larger datasets when the generator=True in match.
I've tried running both methods on a small-ish dataset (10,000 entities) and didn't notice a difference.
data_d = {'id1': {'name': 'George Bush', 'address': '123 main st.'}
{'id2': {'name': 'Bill Clinton', 'address': '1600 pennsylvania ave.'}...
{id10000...}}
then either method A:
blocks = deduper._blockData(data_d)
clustered_dupes = deduper.matchBlocks(blocks, threshold=threshold)
or method B
clustered_dupes = deduper.match(blocks, threshold=threshold, generator=True)
(Then the computationally intensive part is running a for-loop on the clustered_dupes object.
cluster_membership = {}
for (cluster_id, cluster) in enumerate(clustered_dupes):
# Do something with each cluster_id like below
cluster_membership[cluster_id] = cluster
I expect/wonder if there is a performance difference. If so, could you point me to the code that shows that and explain why?
there is no difference between calling _blockData and then matchBlocks versus just match. Indeed if you look at the code, you'll see that match calls those two methods.
The reason why matchBlocks is exposed is that _blockData can take a lot of memory, and you may want to generate the blocks another way, such as taking advantage of a relational database.

Resources