More unlinked exchanges when switching background database from ecoinvent cut-off to consequential - brightway

I have an LCI inventory in excel that is originally linked to the ecoinvent 3.4 cut-off database. When I import it, I get no unlinked exchanges.
Now, I would like to switch and link it to the ecoinvent 3.4 consequential database.
for k, fp in {"LCI": "lci.xlsx"}.items():
if k not in databases:
imp = ExcelImporter(fp)
imp.apply_strategies()
imp.match_database(fields=["name", "unit", "location"])
imp.match_database('ecoinvent_conseq', fields=["reference product", "name", "unit", "location"])
imp.match_database('ecoinvent_conseq', fields=["name", "unit", "location"])
imp.statistics()
imp.write_excel()
imp.write_database()
database = Database('LCI')
For multi-output processes where there is a change in the reference product, I know why it does not get linked.
In some cases, the matching is not working even though the name, the location and the unit of the dataset are provided.
The matching works with these fields when I use the cut-off db but it does not with the consequential db.
What would be the reasons why these exchanges remain unlinked when switching to a consequential db?
Thank you !

This is a long shot, but when I installed the consequential version of ecoinvent 3.4 two flows were erased in the process. they are stored in the log. venting of nitrogen, liquid and residual wood, dry

The additional "unexplained" unlinked exchanges were resulting from the unmatched exchanges that were caused by the difference between the attributional and the consequential database (i.e. one unmatched exchanges trickled down to other unmatching).
The main difference is processes in the inventory where the reference product is different between both databases.
A typical example is a CHP process, with heat as determining product, used in the inventory to provide the reference product: "medium voltage, electricity". In CLCA, this cannot happen, so the strategy is simply to look for an alternative provider in the consequential database able to provide "medium voltage, electricity" as its own reference product.
Once these cases are resolved, the matching works perfectly with fields=["name", "unit", "location"]

Related

Delta live tables data quality checks

I'm using delta live tables from Databricks and I was trying to implement a complex data quality check (so-called expectations) by following this guide. After I tested my implementation, I realized that even though the expectation is failing, the tables dependent downstream on the source table are still loaded.
To illustrate what I mean, here is an image describing the situation.
Image of the pipeline lineage and the incorrect behaviour
I would assume that if the report_table fails due to the expectation not being met (in my case, it was validating for correct primary keys), then the Customer_s table would not be loaded. However, as can be seen in the photo, this is not quite what happened.
Do you have any idea on how to achieve the desired result? How can I define a complex validation with SQL that would cause the future nodes to not be loaded (or it would make the pipeline fail)?
The default behavior when expectation violation occurs in Delta Live Tables is to load the data but track the data quality metrics (retain invalid records). The other options are : ON VIOLATION DROP ROW and ON VIOLATION FAIL UPDATE. Choose "ON VIOLATION DROP ROW" if that is the behavior you want in your pipeline.
https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-expectations.html#drop-invalid-records

ArangoDB comparison of documents in different databases

I'm interested if it's possible to compare two documents with the same "_id"s (same collections names and "_keys") which are stored in different databases.
My use case is a custom "map / layout engine" that is "mainly" fed by "automatic import / conversion jobs" from an external geo-data system.
So far that works fine.
In some cases however it's necessary manually adjust e.g. the
"x/y"-coordinates of some objects to make them more usable.
By running the import job again any (e.g. to fetch the latest data)
all manual adjustments are lost as they're simply overwritten by the
"auto" data.
Therefore I think of a system setup consisting of several identically
structured ArangoDB databases, used for different "stages" of the
data lifecycle like:
"staging" - newly "auto imported" data is placed here.
"production" - the "final data" that's presented to the user
including all the latest manual adjustments is stored here.
The according (simplified) lifecycle would be this way:
Auto-import into "staging"
Compare and import all manual adjustments from "production" into "staging"
Deploy "merged" contents from 1. and 2. as the new "production" version.
So, this topic is all about step 2's "comparison phase" between the "production" and the "staging" data values.
In SQL I'd express it with sth. like this:
SELECT
x, y
FROM databaseA.layout AS layoutA
JOIN databaseB.layout ON (layoutA.id = layoutB.id) AS layoutB
WHERE
...
Thanks for any hints on how to solve this in ArangoDB using an AQL query or a FOXX service!
Hypothetically, if you had a versioning graph database handy, you could do the following:
On first import, insert new data creating a fresh revision R0 for each inserted node.
Manually change some fields of a node, say N in this data, giving rise to a new revision of N, say R1. Your previous version R0 is not lost though.
Repeat steps 1 and 2 as many times as you like.
Finally, when you need to show this data to the end user, use custom application logic to merge as many previous versions as you want with the current version, doing an n-way merge rather than a 2-way merge.
If you think this could be a potential solution, you can take a look at CivicGraph, which is a version control layer built on top of ArangoDB.
Note: I am the creator of CivicGraph, and this answer could qualify as a promotion for the product, but I also believe it could help solve your problem.

Possibility of GUID collision in MS CRM Data migration

We are doing CRM data migration in order to keep two CRM Systems in Sync. And removing history data from Primary CRM. Target CRM is been created by taking Source as base. Now while we migrate the data we keep guids of record, same in order to maintain data integrity. This solution expects that in target systems that GUID must be available to assign to new record. There are no new records created directly in target system except Emails, that too very low in number. But apart from that there are ways in which system creates its guids, e.g when we move newly created entity to target solution using Solution it will not maintain the GUID of entity and attributes and will create its own, since we do not have control on this. Also some of the records which are created internally will also get created by platform and assigned a new GUID. Now if we do not have control over guid creation in target system(Although number is very small), i fear of the situation where Source System has guid which target has already consumed!! And at time of data migration it will give errors.
My Question is there any possibility that above can happen? because if that happens to us whole migration solution will loose its value.
SQL Server's NEWID() generates a 128-bit ID. All IDs generated on the same machine are guaranteed to be unique but because yours have been generated across multiple machines, there's no guarantee.
That being said, from this source on GUIDs:
...for there to be a one in a billion chance of duplication, 103 trillion version 4 UUIDs must be generated.
So the answer is yes there is a chance of collision, but it's so astromonically low that most consider the answer to effectively be no.

Does CQL3 "IF" make my update not idempotent?

It seems to me that using IF would make the statement possibly fail if re-tried. Therefore, the statement is not idempotent. For instance, given the CQL below, if it fails because of a timeout or system problem and I retry it, then it may not work because another person may have updated the version between retries.
UPDATE users
SET name = 'foo', version = 4
WHERE userid = 1
IF version = 3
Best practices for updates in Cassandra are to make updates idempotent, yet the IF operator is in direct opposition to this. Am I missing something?
If your application is idempotent, then generally you wouldn't need to use the expensive IF clause, since all your clients would be trying to set the same value.
For example, suppose your clients were aggregating some values and writing the result to a roll up table. Each client would calculate the same total and write the same value, so it wouldn't matter if multiple clients wrote to it, or what order they wrote to it, since it would be the same value.
If what you are actually looking for is mutual exclusion, such as keeping a bank balance, then the IF clause could be used. You might read a row to get the current balance, then subtract some money and update the balance only if the balance hadn't changed since you read it. If another client was trying to add a deposit at the same time, then it would fail and would have to try again.
But another way to do that without mutual exclusion is to write each withdrawal and deposit as a separate clustered transaction row, and then calculate the balance as an idempotent result of applying all the transaction rows.
You can use the IF clause for idempotent writes, but it seems pointless. The first client to do the write would succeed and Cassandra would return the value "applied=True". And the next client to try the same write would get back "applied=False, version=4", indicating that the row had already been updated to version 4 so nothing was changed.
This question is more about linerizability(ordering) than idempotency I think. This query uses Paxos to try to determine the state of the system before applying a change. If the state of the system is identical then the query can be retried many times without a change in the results. This provides a weak form of ordering (and is expensive) unlike most Cassandra writes. Generally you should only use CAS operations if you are attempting to record state of a system (rather than a history or log)
Do not use many of these queries if you can help it, the guidelines suggest having only a small percentage of your queries rely on this behavior.

Is it possible to make conditional inserts with Azure Table Storage

Is it possible to make a conditional insert with the Windows Azure Table Storage Service?
Basically, what I'd like to do is to insert a new row/entity into a partition of the Table Storage Service if and only if nothing changed in that partition since I last looked.
In case you are wondering, I have Event Sourcing in mind, but I think that the question is more general than that.
Basically I'd like to read part of, or an entire, partition and make a decision based on the content of the data. In order to ensure that nothing changed in the partition since the data was loaded, an insert should behave like normal optimistic concurrency: the insert should only succeed if nothing changed in the partition - no rows were added, updated or deleted.
Normally in a REST service, I'd expect to use ETags to control concurrency, but as far as I can tell, there's no ETag for a partition.
The best solution I can come up with is to maintain a single row/entity for each partition in the table which contains a timestamp/ETag and then make all inserts part of a batch consisting of the insert as well as a conditional update of this 'timestamp entity'. However, this sounds a little cumbersome and brittle.
Is this possible with the Azure Table Storage Service?
The view from a thousand feet
Might I share a small tale with you...
Once upon a time someone wanted to persist events for an aggregate (from Domain Driven Design fame) in response to a given command. This person wanted to ensure that an aggregate would only be created once and that any form of optimistic concurrency could be detected.
To tackle the first problem - that an aggregate should only be created once - he did an insert into a transactional medium that threw when a duplicate aggregate (or more accurately the primary key thereof) was detected. The thing he inserted was the aggregate identifier as primary key and a unique identifier for a changeset. A collection of events produced by the aggregate while processing the command, is what is meant by changeset here. If someone or something else beat him to it, he would consider the aggregate already created and leave it at that. The changeset would be stored beforehand in a medium of his choice. The only promise this medium must make is to return what has been stored as-is when asked. Any failure to store the changeset would be considered a failure of the whole operation.
To tackle the second problem - detection of optimistic concurrency in the further life-cycle of the aggregate - he would, after having written yet another changeset, update the aggregate record in the transactional medium if and only if nobody had updated it behind his back (i.e. compared to what he last read just before executing the command). The transactional medium would notify him if such a thing happened. This would cause him to restart the whole operation, rereading the aggregate (or changesets thereof) to make the command succeed this time.
Of course, now he had solved the writing problems, along came the reading problems. How would one be able to read all the changesets of an aggregate that made up its history? Afterall, he only had the last committed changeset associated with the aggregate identifier in that transactional medium. And so he decided to embed some metadata as part of each changeset. Among the meta data - which is not so uncommon to have as part of a changeset - would be the identifier of the previous last committed changeset. This way he could "walk the line" of changesets of his aggregate, like a linked list so to speak.
As an additional perk, he would also store the command message identifier as part of the metadata of a changeset. This way, when reading changesets, he could know in advance if the command he was about to execute on the aggregate was already part of its history.
All's well that ends well ...
P.S.
1. The transactional medium and changeset storage medium can be the same,
2. The changeset identifier MUST not be the command identifier,
3. Feel free to punch holes in the tale :-),
4. Although not directly related to Azure Table Storage, I've implemented the above tale successfully using AWS DynamoDB and AWS S3.
How about storing each event at "PartitionKey/RowKey" created based on AggregateId/AggregateVersion?where AggregateVersion is a sequential number based on how many events the aggregate already has.
This is very deterministic, so when adding a new event to the aggregate, you will make sure that you were using the latest version of it, because otherwise you'll get an error saying that the row for that partition already exists. At this time you can drop the current operation and retry, or try to figure out if you could merge the operation anyways if the new updates to the aggregate do not conflict to the operation you just did.

Resources