When using Brightway, should I delete exchanges linked to an activity that I have erased? - brightway

This question is related to a previous question I raised about creating activities with Brightway 2 from a proxy activity. The question is: should I modify the table of exchanges if I decide to erase my proxy activity?
let's say I decide to create a heat pump in Quebec using as proxy a heat pump from Switzerland but adapting the origin of electricity.
#identify the activity supplying electricity from Quebec
for ds in Database('ei_33cutoff'):
if ('market for electricity, low voltage' in ds['name']) & (ds['location']=='CA-QC'):
print(ds['name'])
print(ds['code'])
elw_qc=Database('ei_33cutoff').get('44389eae7d62fa9d4ea9ea2b9fc2f609')
#identify the proxy activity
for ds in Database('ei_33cutoff'):
if ('heat production, air-water' in ds['name']) & (ds['location']=='CH'):
print(ds['name'],ds['location'],ds['code'])
hp_proxy=Database('ei_33cutoff').get('694d03f60920c0f7d964c08db1c67226')
#create a copy of the proxy
hp_qc=hp_proxy.copy()
#update location
hp_qc['location']='CA-QC'
#update electricity exchange
elect_to_hp = [exc for exc in hp_qc.technosphere() if 'electricity, low voltage' in exc['name']][0]
elect_to_hp['input']=elw_qc
elect_to_hp.save()
#store my new activity in the database
hp_qc.save()
However, if during this procedure I create a proxy activity that contains mistakes or for other reasons I no longer want. How should I "clean" the database from this activities containing mistakes? would hp_qc.delete() suffice? activities and exchanges are stored in different tables in the SQLite database. I am wondering if I am "polluting" the exchange table with exchanges that are linked to activities that no longer exist, which could bring problems in the future.

Calling Activity.delete() will delete all exchanges where your activity is the consumer, i.e. all incoming exchanges. It won't delete exchanges where other activities consumer your reference product, but as far as I can tell you don't have any exchange like that in this example.
There are a number of ways to "clean" the database, although such cleaning isn't really necessary in this case. The easiest is probably to learn how to use the Exchanges object, then you can delete whatever you want.

Related

Prevent DELETES from bypassing versioning in Amazon QLDB

Amazon QLDB allows querying the version history of a specific object by its ID. However, it also allows deleting objects. It seems like this can be used to bypass versioning by deleting and creating a new object instead of updating the object.
For example, let's say we need to track vehicle registrations by VIN.
INSERT INTO VehicleRegistration
<< {
'VIN' : '1N4AL11D75C109151',
'LicensePlateNumber' : 'LEWISR261LL'
} >>
Then our application can get a history of all LicensePlateNumber assignments for a VIN by querying:
SELECT * FROM _ql_committed_VehicleRegistration AS r
WHERE r.data.VIN = '1N4AL11D75C109151';
This will return all non-deleted document revisions, giving us an unforgeable history. The history function can be used similarly if you remember the document ID from the insert. However, if I wanted to maliciously bypass the history, I would simply delete the object and reinsert it:
DELETE FROM VehicleRegistration AS r WHERE VIN = '1N4AL11D75C109151';
INSERT INTO VehicleRegistration
<< {
'VIN' : '1N4AL11D75C109151',
'LicensePlateNumber' : 'ABC123'
} >>
Now there is no record that I have modified this vehicle registration, defeating the whole purpose of QLDB. The document ID of the new record will be different from the old, but QLDB won't be able to tell us that it has changed. We could use a separate system to track document IDs, but now that other system would be the authoritative one instead of QLDB. We're supposed to use QLDB to build these types of authoritative records, but the other system would have the exact same problem!
How can QLDB be used to reliably detect modifications to data?
There would be a record of the original record and its deletion in the ledger, which would be available through the history() function, as you pointed out. So there's no way to hide the bad behavior. It's a matter of hoping nobody knows to look for it. Again, as you pointed out.
You have a couple of options here. First, QLDB rolled-out fine-grained access control last week (announcement here). This would let you, say, prohibit deletes on a given table. See the documentation.
Another thing you can do is look for deletions or other suspicious activity in real-time using streaming. You can associate your ledger with a Kinesis Data Stream. QLDB will push every committed transaction into the stream where you can react to it using a Lambda function.
If you don't need real-time detection, you can do something with QLDB's export feature. This feature dumps ledger blocks into S3 where you can extract and process data. The blocks contain not just your revision data but also the PartiQL statements used to create the transaction. You can setup an EventBridge scheduler to kick off a periodic export (say, of the day's transactions) and then churn through it to look for suspicious deletes, etc. This lab might be helpful for that.
I think the best approach is to manage it with permissions. Keep developers out of production or make them assume a temporary role to get limited access.

Possibility of GUID collision in MS CRM Data migration

We are doing CRM data migration in order to keep two CRM Systems in Sync. And removing history data from Primary CRM. Target CRM is been created by taking Source as base. Now while we migrate the data we keep guids of record, same in order to maintain data integrity. This solution expects that in target systems that GUID must be available to assign to new record. There are no new records created directly in target system except Emails, that too very low in number. But apart from that there are ways in which system creates its guids, e.g when we move newly created entity to target solution using Solution it will not maintain the GUID of entity and attributes and will create its own, since we do not have control on this. Also some of the records which are created internally will also get created by platform and assigned a new GUID. Now if we do not have control over guid creation in target system(Although number is very small), i fear of the situation where Source System has guid which target has already consumed!! And at time of data migration it will give errors.
My Question is there any possibility that above can happen? because if that happens to us whole migration solution will loose its value.
SQL Server's NEWID() generates a 128-bit ID. All IDs generated on the same machine are guaranteed to be unique but because yours have been generated across multiple machines, there's no guarantee.
That being said, from this source on GUIDs:
...for there to be a one in a billion chance of duplication, 103 trillion version 4 UUIDs must be generated.
So the answer is yes there is a chance of collision, but it's so astromonically low that most consider the answer to effectively be no.

How do I process data that isn't sliced by time in Azure Data Factory?

So I am trying to use Azure Data Factory to replace the SSIS system we have in place, and I am having some trouble...
The process I want to follow is to take a list of projects and a list of clients and create a report of the clients and projects we have. These lists update frequently, so I want to update this report every hour. To combine the data, I will be using Power BI Pro, so Data Factory just needs to load the data into a usable format.
My source right now is a call to an API that returns a list of projects. However, this data isn't separated by time at all. I don't see any sort of history. Same goes for the list of clients.
What should the availability for my dataset be?
you may use the custom activity in ADF to call the API that returns list of projects. The custom activity will then write that data in the right format to the destination.
Example of a custom activity in ADF: https://azure.microsoft.com/en-us/documentation/articles/data-factory-use-custom-activities/
The frequency will be the cadence at which you wish to run this operation.

Cassandra - multiple counters based on timeframe

I am building an application and using Cassandra as my datastore. In the app, I need to track event counts per user, per event source, and need to query the counts for different windows of time. For example, some possible queries could be:
Get all events for user A for the last week.
Get all events for all users for yesterday where the event source is source S.
Get all events for the last month.
Low latency reads are my biggest concern here. From my research, the best way I can think to implement this is a different counter tables for each each permutation of source, user, and predefined time. For example, create a count_by_source_and_user table, where the partition key is a combination of source and user ID, and then create a count_by_user table for just the user counts.
This seems messy. What's the best way to do this, or could you point towards some good examples of modeling these types of problems in Cassandra?
You are right. If latency is your main concern, and it should be if you have already chosen Cassandra, you need to create a table for each of your queries. This is the recommended way to use Cassandra: optimize for read and don't worry about redundant storage. And since within every table data is stored sequentially according to the index, then you cannot index a table in more than one way (as you would with a relational DB). I hope this helps. Look for the "Data Modeling" presentation that is usually given in "Cassandra Day" events. You may find it on "Planet Cassandra" or John Haddad's blog.

sequentiual numbering in the cloud

Ok so a simple task such as generating a sequential number has caused us an issue in the cloud.
Where you have more than one server it gets harder and harder to guarantee that the allocated number between servers are not clashing.
We are using Azure servers if it helps.
We thought about using the app cache but you cannot guarantee it will be updated between servers.
We are limited to using:
a SQL table with an identity column
or
some peer to peer method between servers
or
use a blob store and utalise the locks to store the nost upto date number. (this could have scaling issues)
I just wondered of anyone has an idea of a solution to resolve this?
Surely its a simple problem and must have been solved by now.
If you can live with a use-case where sometimes the numbers you get from this central location are not always sequential (but guaranteed to be unique) I would suggest considering the following pattern. I've helped an large e-commerce client implement this since they needed unique int PK's to synchronize back to premise:
Create a queue and create a small always-running process that populates this queue with sequential integers (this process should remember which number it generated last and keep replenishing the pool with more numbers once the queue gets close to be empty)
Now, you can have your code first poll the next number from the queue, delete it from the queue and then attempt to save it into the SQL Azure database. In case of failure, all you'll have is a "hole" in your sequential numbers. In scenarios with frequent inserts, you may be saving things out of order to the database (two processes poll from queue, one polls first but saves last, the PK's saved to the database are not sequential anymore)
The biggest downside is that you now have to maintain/monitor a process that replenishes the pool of PK's.
After read this, I would not trust on identity column.
I think the best way is before insert, get the last stored id and increment it by one. (programatically). Another option is create a trigger, but it could be a mass if you'll receive a lot of concurrent requests on DB or if your table have millions of records.
create trigger trigger_name
on table_name
after insert
as
declare #seq int
set #seq = (select max(id) + 1 from table_name)
update table_name
set table_name.id = #seq
from table_name
inner join inserted
on table_name.id = inserted.id
More info:
http://msdn.microsoft.com/en-us/library/windowsazure/ee336242.aspx
If you're worried about scaling the number generation when using blobs, then you can use the SnowMaker library which is available on GitHub and Nuget. It gets around the scale problem by retrieving blocks of ids into a local cache. This guarantees that the Ids are unique, but not necessarily sequential if you have more than one server. I'm not sure if that would achieve what you're after.

Resources