store/access activity metadata in datapackage created with brightway data - brightway

When I define a datapackage with with bw2data. How can I access metadata associated with the activities ? lets say I create a simple database :
# biosphere
bio_db = bd.Database("mini_biosphere")
bio_db.register()
co2 = bio_db.new_activity(code = 'CO2',
name = 'carbon dioxide',
categories=('air',),
type='emission',
unit='kg')
co2.save()
ch4 = bio_db.new_activity(code = 'CH4',
name = 'methane',
categories=('air',),
type='emission',
unit='kg')
ch4.save()
# technosphere
a_key = ("testdb", "a")
b_key = ("testdb", "b")
act_a_def = {
'name': 'a',
'unit': 'kilogram',
'comment':'just saying',
'exchanges': [{"input": co2.key, "type": "biosphere", "amount": 10},
{"input": a_key, "output":a_key,'type':'production','amount':1},
{"input": b_key, "output":a_key,'type':'substitution','amount':1},
],
}
act_b_def = {
'name': 'b',
'unit': 'kilogram',
'comment':'it depends',
'exchanges': [
{"input": b_key, "output":a_key,'type':'production','amount':1},
{"input": ch4.key, "type": "biosphere", "amount": 1},
],
}
db = bd.Database("testdb")
db.write(
{
a_key : act_a_def,
b_key : act_b_def
}
)
I can read the metadata of the associated datapackage with db.datapackage().metadata an even perform a calculation (as done here)... but data such as the name of the activities or the comments seems to be missing. Where is it stored or what needs to be done to store it ?

By default, datapackages don't have metadata of this nature - their whole point is to just have the numeric values for matrix creation. You would normally query the Database object to get such metadata.
However, you can ask for this metadata to be written as well. You will need to process the database again, with (in your example) db.process(csv=True). In this case, you will get an additional resource, <database name>_activity_metadata, which is loaded as a Pandas DataFrame. In the example, this would be retrieved with dp.get_resource('testdb_activity_metadata')[0].
Note that there was a bug, so this functionality requires bw2data version 4.0.dev16 or higher (released 2022-06-09).

Related

umongo, pymongo, python 3, how do i load data from reference field(s)

I'm trying to understand how and why it's so hard to load my referenced data, in unmongo/pymongo
#instance.register
class MyEntity(Document):
account = fields.ReferenceField('Account', required=True)
date = fields.DateTimeField(
default=lambda: datetime.utcnow(),
allow_none=False
)
positions = fields.ListField(fields.ReferenceField('Position'))
targets = fields.ListField(fields.ReferenceField('Target'))
class Meta:
collection = db.myentity
when i retrieve this with:
def find_all(self):
items = self._repo.find_all(
{
'user_id': self._user_id
}
)
return items
and then dump it like so:
from bson.json_util import dumps
all_items = []
for item in all_items:
all_items.append(item.dump())
return dumps(all_items)
i get the following JSON object:
[
{
"account": "5e990db75f22b6b45d3ce814",
"positions": [
"5e9a594373e07613b358bdbb",
"5e9a594373e07613b358bdbe",
"5e9a594373e07613b358bdc1"
],
"date": "2020-04-18T01:34:59.919000+00:00",
"id": "5e9a594373e07613b358bdcb",
"targets": [
"5e9a594373e07613b358bdc4",
"5e9a594373e07613b358bdc7",
"5e9a594373e07613b358bdca"
]
}
]
and without dump
<object Document models.myentity.schema.MyEntity({
'targets':
<object umongo.data_objects.List([
<object umongo.frameworks.pymongo.PyMongoReference(
document=Target,
pk=ObjectId('5e9a594373e07613b358bdc4')
)>,
<object umongo.frameworks.pymongo.PyMongoReference(
document=Target,
pk=ObjectId('5e9a594373e07613b358bdc7')
)>,
<object umongo.frameworks.pymongo.PyMongoReference(
document=Target,
pk=ObjectId('5e9a594373e07613b358bdca'))>]
)>,
'id': ObjectId('5e9a594373e07613b358bdcb'),
'positions':
<object umongo.data_objects.List([
<object umongo.frameworks.pymongo.PyMongoReference(
document=Position,
pk=ObjectId('5e9a594373e07613b358bdbb')
)>,
<object umongo.frameworks.pymongo.PyMongoReference(
document=Position,
pk=ObjectId('5e9a594373e07613b358bdbe'))>,
<object umongo.frameworks.pymongo.PyMongoReference(
document=Position,
pk=ObjectId('5e9a594373e07613b358bdc1'))>])>,
'date': datetime.datetime(2020, 4, 18, 1, 34, 59, 919000),
'account': <object umongo.frameworks.pymongo.PyMongoReference(document=Account, pk=ObjectId('5e990db75f22b6b45d3ce814'))>
})>
I'm really struggling on how to dereference this. I'd like, recursively that all loaded fields, if i specify them it in umongo schema, are dereferenced. Is this not in the umongo API?
i.e. what if there's a reference field in 'target' as well? I understand this can be expensive on the DB, but is there some way to specify this on the schema definition itself? i.e. in meta class, that i always want the full, dereferenced object for a particular field?
the fact that i'm finding very little documentation / commentary on this, that it's not even mentioned in the umongo docs, and some solutions for other ODMs i've found (like mongoengine) are painfully writing out recursive, manual functions per field / per query. This suggests to me there's a reason this is not a popular question. Might be an anti-pattern? if so, why?
I'm not that new to mongo, but new to python / mongo. I feel like I'm missing something fundamental here.
EDIT: so right after posting, i did find this issue:
https://github.com/Scille/umongo/issues/42
which provides a way forward
is this still the best approach? Still trying to understand why this is treated like an edge case.
EDIT 2: progress
class MyEntity(Document):
account = fields.ReferenceField('Account', required=True, dump=lambda: 'fetch_account')
date = fields.DateTimeField(
default=lambda: datetime.utcnow(),
allow_none=False
)
#trade = fields.DictField()
positions = fields.ListField(fields.ReferenceField('Position'))
targets = fields.ListField(fields.ReferenceField('Target'))
class Meta:
collection = db.trade
#property
def fetch_account(self):
return self.account.fetch()
so with the newly defined property decorator, i can do:
items = MyEntityService().find_all()
allItems = []
for item in allItems:
account = item.fetch_account
log(account.dump())
allItems.append(item.dump())
When I dump account, all is good. But I don't want to explicitly/manually have to do this. It still means I have to recursively unpack and then repack each referenced doc, and any child references, each time I make a query. It also means the schema SOT is no longer contained just in the umongo class, i.e. if a field changes, I'll have to refactor every query that uses that field.
I'm still looking for a way to decorate/flag this on the schema itself.
e.g.
account = fields.ReferenceField('Account', required=True, dump=lambda: 'fetch_account')
dump=lambda: 'fetch_account' i just made up, it doesn't do anything, but that's more or less the pattern i'm going for, not sure if this is possible (or even smart: other direction, pointers to why i'm totally wrong in my approach are welcome) ....
EDIT 3:
so this is where i've landed:
#property
def fetch_account(self):
return self.account.fetch().dump()
#property
def fetch_targets(self):
targets_list = []
for target in self.targets:
doc = target.fetch().dump()
targets_list.append(doc)
return targets_list
#property
def fetch_positions(self):
positions_list = []
for position in self.positions:
doc = position.fetch().dump()
positions_list.append(doc)
return positions_list
and then to access:
allItems = []
for item in items:
account = item.fetch_account
positions = item.fetch_positions
targets = item.fetch_targets
item = item.dump()
item['account'] = account
item['positions'] = positions
item['targets'] = targets
# del item['targets']
allTrades.append(item)
I could clean it up/abstract it some, but i don't see how i could really reduce the general verbosity at at this point. It does seem to be give me the result i'm looking for though:
[
{
"date": "2020-04-18T01:34:59.919000+00:00",
"targets": [
{
"con_id": 331641614,
"value": 106,
"date": "2020-04-18T01:34:59.834000+00:00",
"account": "5e990db75f22b6b45d3ce814",
"id": "5e9a594373e07613b358bdc4"
},
{
"con_id": 303019419,
"value": 0,
"date": "2020-04-18T01:34:59.867000+00:00",
"account": "5e990db75f22b6b45d3ce814",
"id": "5e9a594373e07613b358bdc7"
},
{
"con_id": 15547841,
"value": 9,
"date": "2020-04-18T01:34:59.912000+00:00",
"account": "5e990db75f22b6b45d3ce814",
"id": "5e9a594373e07613b358bdca"
}
],
"account": {
"user_name": "hello",
"account_type": "LIVE",
"id": "5e990db75f22b6b45d3ce814",
"user_id": "U3621607"
},
"positions": [
{
"con_id": 331641614,
"value": 104,
"date": "2020-04-18T01:34:59.728000+00:00",
"account": "5e990db75f22b6b45d3ce814",
"id": "5e9a594373e07613b358bdbb"
},
{
"con_id": 303019419,
"value": 0,
"date": "2020-04-18T01:34:59.764000+00:00",
"account": "5e990db75f22b6b45d3ce814",
"id": "5e9a594373e07613b358bdbe"
},
{
"con_id": 15547841,
"value": 8,
"date": "2020-04-18T01:34:59.797000+00:00",
"account": "5e990db75f22b6b45d3ce814",
"id": "5e9a594373e07613b358bdc1"
}
],
"id": "5e9a594373e07613b35
8bdcb"
}
]
It seems like this is a design choice in umongo.
In Mongoid for example (the Ruby ODM for MongoDB), when an object is referenced it is fetched from the database automatically through associations as needed.
As an aside, in an ODM the features of "define a field structure" and "seamlessly access data through application objects" are quite separate. For example, my experience with Hibernate in Java suggests it is similar to what you are discovering with umongo - once the data is loaded, it provides a way of accessing the data using application-defined field structure with types etc., but it doesn't really help with loading the data from application domain transparently.

Azure Data Factory complex JSON source (nested arrays) to Azure Sql Database?

I have a JSON source document that will be uploaded to Azure blob storage regularly. The customer wants to have this input written to Azure Sql Database using Azure Data Factory. The JSON is however complex with many nested arrays and so far I have not be able to find a way to flatten the document. Perhaps this is not supported/possible?
[
{
"ActivityId": 1,
"Header": {},
"Body": [{
"1stSubArray": [{
"Id": 456,
"2ndSubArray": [{
"Id": "abc",
"Descript": "text",
"3rdSubArray": [{
"Id": "def",
"morefields": "text"
},
{
"Id": "ghi",
"morefields": "sample"
}]
}]
}]
}]
}
]
I need to flatten it:
ActivityId, Id, Id, Descript, Id, morefields
1, 456, abc, text1, def, text
1, 456, abc, text2, ghi, sample
1, 456, xyz, text3, jkl, textother
1, 456, xyz, text4, mno, moretext
There could be 8+ flat records per ActivityId. Anyone out there that has seen this and found a way to resolve using Azure Data Factory Copy Data?
Azure SQL Database has some capable JSON shredding abilities including OPENJSON which shreds JSON, and JSON_VALUE which returns scalar values from JSON. Being as you already have Azure SQL DB in your architecture, it would make sense to use it rather than add additional components.
So why not adopt an ELT pattern where you use Data Factory to insert the JSON into a table in Azure SQL DB and then call a stored procedure task to shred it? Some sample SQL based on your example:
DECLARE #json NVARCHAR(MAX) = '[
{
"ActivityId": 1,
"Header": {},
"Body": [
{
"1stSubArray": [
{
"Id": 456,
"2ndSubArray": [
{
"Id": "abc",
"Descript": "text",
"3rdSubArray": [
{
"Id": "def",
"morefields": "text"
},
{
"Id": "ghi",
"morefields": "sample"
}
]
},
{
"Id": "xyz",
"Descript": "text",
"3rdSubArray": [
{
"Id": "jkl",
"morefields": "textother"
},
{
"Id": "mno",
"morefields": "moretext"
}
]
}
]
}
]
}
]
}
]'
--SELECT #json j
-- INSERT INTO yourTable ( ...
SELECT
JSON_VALUE ( j.[value], '$.ActivityId' ) AS ActivityId,
JSON_VALUE ( a1.[value], '$.Id' ) AS Id1,
JSON_VALUE ( a2.[value], '$.Id' ) AS Id2,
JSON_VALUE ( a2.[value], '$.Descript' ) AS Descript,
JSON_VALUE ( a3.[value], '$.Id' ) AS Id3,
JSON_VALUE ( a3.[value], '$.morefields' ) AS morefields
FROM OPENJSON( #json ) j
CROSS APPLY OPENJSON ( j.[value], '$."Body"' ) AS m
CROSS APPLY OPENJSON ( m.[value], '$."1stSubArray"' ) AS a1
CROSS APPLY OPENJSON ( a1.[value], '$."2ndSubArray"' ) AS a2
CROSS APPLY OPENJSON ( a2.[value], '$."3rdSubArray"' ) AS a3;
As you can see, I've used CROSS APPLY to navigate multiple levels. My results:
In the past,you could follow this blog and my previous case:Loosing data from Source to Sink in Copy Data to set Cross-apply nested JSON array option in Blob Storage Dataset. However,it disappears now.
Instead,Collection Reference is applied for array items schema mapping in copy activity.
But based on my test,only one array can be flattened in a schema. Multiple arrays can be referenced—returned as one row containing all of the elements in the array. However, only one array can have each of its elements returned as individual rows. This is the current limitation with jsonPath settings.
As workaround,you can first convert json file with nested objects into CSV file using Logic App and then you can use the CSV file as input for Azure Data factory. Please refer this doc to understand how Logic App can be used to convert nested objects in json file to CSV. Surely,you could also make some efforts on the sql database side,such as SP which is mentioned in the comment by #GregGalloway.
Just for summary,unfortunately,the "Collection reference" only works for one level down in the array structure which is not suitable for #Emrikol. Finally,#Emrikol abandoned Data Factory and has built an app to the work.

How to iterate through indexed field to add field from another index

I'm rather new to elasticsearch, so i'm coming here in hope to find advices.
I have two indices in elastic from two different csv files.
The index_1 has this mapping:
{'settings': {
'number_of_shards' : 3
},
'mappings': {
'properties': {
'place': {'type': 'keyword' },
'address': {'type': 'keyword' },
}
}
}
The file is about 400 000 documents long.
The index_2 with a much smaller file(about 50 documents) has this mapping:
{'settings': {
"number_of_shards" : 1
},
'mappings': {
'properties': {
'place': {'type': 'text' },
'address': {'type': 'keyword' },
}
}
}
The field "place" in index_2 is all of the unique values from the field "place" in index_1.
In both indices the "address" fields are postcodes of datatype keyword with a structure: 0000AZ.
Based on the "place" field keyword in index_1 I want to assign the term of field "address" from index_2.
I have tried using the pandas library but the index_1 file is too large. I have also to tried creating modules based off pandas and elasticsearch, quite unsuccessfully. Although I believe this is a promising direction. A good solution would be to stay into the elasticsearch library as much as possible as these indices will be later be used for further analysis.
If i understand correctly it sounds like you want to use updateByQuery.
the request body should look a little like this:
{
'query': {'term': {'place': "placeToMatch"}},
'script': 'ctx._source.address = "updatedZipCode"'
}
This will update the address field of all documents with the matched place.
EDIT:
So what we want to do is use updateByQuery while iterating over all the documents in index2.
First step: get all the documents from index2, will just do this using the basic search feature
{
"index": 'index2',
"size": 100 // get all documents, once size is over 10,000 you'll have to padginate.
"body": {"query": {"match_all": {}}}
}
Now we iterate over all the results and use updateByQuery for each of the results:
// sudo
doc = response[i]
// update by query request.
{
index: 'index1',
body: {
'query': {'term': {'address': doc._source.address}},
'script': 'ctx._source.place = "`${doc._source.place}`"'
}
}

ArangoDB AQL Updating Strange Attribute Names

In arangodb I have a Lookup Table as per below:
{
'49DD3A82-2B49-44F5-A0B2-BD88A32EDB13' = 'Human readable value 1',
'B015E210-27BE-4AA7-83EE-9F754F8E469A' = 'Human readable value 2',
'BC54CF8A-BB18-4E2C-B333-EA7086764819' = 'Human readable value 3',
'8DE15947-E49B-4FDC-89EE-235A330B7FEB' = 'Human readable value n'
}
I have documents in a seperate collection such as this which have non human readable attribute and value pairs as per "details" below:
{
"ptype": {
"name": "BC54CF8A-BB18-4E2C-B333-EA7086764819",
"accuracy": 9.6,
"details": {
"49DD3A82-2B49-44F5-A0B2-BD88A32EDB13": "B015E210-27BE-4AA7-83EE-9F754F8E469A",
"8DE15947-E49B-4FDC-89EE-235A330B7FEB": true,
}
}
}
I need to update the above document by looking up the human readable values out of the lookup table and I also need to update the non-human readable attributes with the readable attribute names also found in the lookup table.
The result should look like this:
{
"ptype": {
"name": "Human readable value 3",
"accuracy": 9.6,
"details": {
"Human readable value 1": "Human readable value 2",
"Human readable value n": true,
}
}
}
so ptype.name and ptype.details are updated with values from the lookup table.
This query should help you see how a LUT (Look Up Table) can be used.
One cool feature of AQL is that you can do a LUT query and assign it's value to a variable with the LET command, and then access the contents of that LUT later.
See if this example helps:
LET lut = {
'aaa' : 'Apples',
'bbb' : 'Bananas',
'ccc' : 'Carrots'
}
LET garden = [
{
'size': 'Large',
'plant_code': 'aaa'
},
{
'size': 'Medium',
'plant_code': 'bbb'
},
{
'size': 'Small',
'plant_code': 'ccc'
}
]
FOR doc IN garden
RETURN {
'size': doc.size,
'vegetable': lut[doc.plant_code]
}
The result of this query is:
[
{
"size": "Large",
"vegetable": "Apples"
},
{
"size": "Medium",
"vegetable": "Bananas"
},
{
"size": "Small",
"vegetable": "Carrots"
}
]
You'll notice in the bottom query that actually returns data, it's referring to the LUT by using the doc.plant_code as the look up key.
This is much more performant that performing subqueries there, because if you had 100,000 garden documents you don't want to perform a supporting query 100,000 times to work out the name of the plant_code.
If you wanted to confirm that you could find a value in the LUT, you could optionally have your final query in this format:
FOR doc IN garden
RETURN {
'size': doc.size,
'vegetable': (lut[doc.plant_code] ? lut[doc.plant_code] : 'Unknown')
}
This optional way to return the value for vegetable uses an inline if/then/else, where if the value is not found in the lut, it will return the value 'Unknown'.
Hope this helps you with your particular use case.

Creating a 'SS' item in DynamoDB using boto3

I'm trying to create an item in AWS DynamoDB using boto3 and regardless what I try I can't manage to get an item of type 'SS' created. Here's my code:
client = boto3.resource('dynamodb', region_name=region)
table = client.Table(config[region]['table'])
sched = {
"begintime": begintime,
"description": description,
"endtime": endtime,
"name": name,
"type": "period",
"weekdays": [weekdays]
}
table.put_item(Item=sched)
The other columns work fine but regardless what I try, weekdays always ends up as a 'S' type. For reference, this is what one of the other items look like from the same table:
{'begintime': '09:00', 'endtime': '18:00', 'description': 'Office hours', 'weekdays': {'mon-fri'}, 'name': 'office-hours', 'type': 'period'}
Trying to convert this to a Python structure obviously fails so I'm not sure how it's possible to insert a new item.
To indicate an attribute of type SS (String Set) using the boto3 DynamoDB resource-level methods, you need to supply a set rather than a simple list. For example:
import boto3
res = boto3.resource('dynamodb', region_name=region)
table = res.Table(config[region]['table'])
sched = {
"begintime": '09:00',
"description": 'Hello there',
"endtime": '14:00',
"name": 'james',
"type": "period",
"weekdays": set(['mon', 'wed', 'fri'])
}
table.put_item(Item=sched)
As follow up on #jarmod's answer:
If you want to call update_item with a String Set, then you'll insert a set via ExpressionAttributeValues property like shown below:
entry = table.put_item(
ExpressionAttributeNames={
"#begintime": "begintime",
"#description": "description",
"#endtime": "endtime",
"#name": "name",
"#type": "type",
"#weekdays": "weekdays"
},
ExpressionAttributeValues={
":begintime": '09:00',
":description": 'Hello there',
":endtime": '14:00',
":name": 'james',
":type": "period",
":weekdays": set(['mon', 'wed', 'fri'])
},
UpdateExpression="""
SET #begintime= :begintime,
#description = :description,
#endtime = :endtime,
#name = :name,
#type = :type,
#weekdays = :weekdays
"""
)
(Hint: Usage of AttributeUpdates (related Items equivalent for put_item calls) is deprecated, therefore I recommend using ExpressionAttributeNames, ExpressionAttributeValues and UpdateExpression).

Resources