Idempotency of cronjobs - "workflow" tables in database

Idempotency of cronjobs - "workflow" tables in database - cron

I'm currently working on a backend system, and am faced with porting cronjobs functionality from a legacy system to the new backend. A bunch of these jobs are not idempotent, and I will want to make them idempotent when porting them.
As I understand it, for a job to be idempotent, its state (whether it has been completed, or possible whether it is being currently performed) should somehow be represented in the database / entity model. Because then, a single task can always conditionally opt-out of running if the data shows that it's already done / being handled.
I can imagine simple scenario's where you can just add an extra field (column) to entities (tables) for certain tasks specifically related to that entity, for example
entity Reservation {
id
user_id
...
reminder_sent(_at) <- whether the "reminder" task has been performed yet
}
But more generally, I feel like tasks often involve a bunch of different entities, and it would "pollute" the entities if they need to "know about" the tasks that operate on them. Also, the "state" of a job can be more complicated than just "done or not done yet" in more complex cases. Here's some examples from our business:
If a user has more than a certain amount in total unpaid invoices, we sent three consecutive email reminders at certain intervals, until it it resolved, and if not, end up sending the data to an external party for collection. If the user pays the said invoices, but then acquires new ones, the workflow should restart instead of continue.
Once every month, certain users get rewarded with vouchers. The description of the voucher will note the details, e.g. "Campaign bla bla, Jul 2022", but that's all we have "in" the data of the voucher to know it relates to this job.
I feel like there must be a general known engineering concept here, but I can't seem to find the right resources on the internet. For the time being, I'm calling these things "workflows", and think it makes sense for them to have their own entity/table, e.g.
entity Workflow_UnpaidInvoicesReminder {
id
# "key" by which the job is idempotent
user_id
invoice_id / invoice_ids
# workflow status/result fields
created_at
paid_at
first_reminder_sent_at
second_reminder_sent_at
third_reminder_sent_at
sent_externally_at
}
entity Workflow_CampaignVouchers {
id
# "key" by which the job is idempotent
user_id
campaign_key
# workflow status/result fields
created_at
voucher_id
}
Can someone maybe help me find the appropriate terms and resources to describe the stuff above? I can't seem to find the relevant information about the general idea of "workflows" like these, on the internet, that well.

Related

Prevent DELETES from bypassing versioning in Amazon QLDB

Amazon QLDB allows querying the version history of a specific object by its ID. However, it also allows deleting objects. It seems like this can be used to bypass versioning by deleting and creating a new object instead of updating the object.
For example, let's say we need to track vehicle registrations by VIN.
INSERT INTO VehicleRegistration
<< {
'VIN' : '1N4AL11D75C109151',
'LicensePlateNumber' : 'LEWISR261LL'
} >>
Then our application can get a history of all LicensePlateNumber assignments for a VIN by querying:
SELECT * FROM _ql_committed_VehicleRegistration AS r
WHERE r.data.VIN = '1N4AL11D75C109151';
This will return all non-deleted document revisions, giving us an unforgeable history. The history function can be used similarly if you remember the document ID from the insert. However, if I wanted to maliciously bypass the history, I would simply delete the object and reinsert it:
DELETE FROM VehicleRegistration AS r WHERE VIN = '1N4AL11D75C109151';
INSERT INTO VehicleRegistration
<< {
'VIN' : '1N4AL11D75C109151',
'LicensePlateNumber' : 'ABC123'
} >>
Now there is no record that I have modified this vehicle registration, defeating the whole purpose of QLDB. The document ID of the new record will be different from the old, but QLDB won't be able to tell us that it has changed. We could use a separate system to track document IDs, but now that other system would be the authoritative one instead of QLDB. We're supposed to use QLDB to build these types of authoritative records, but the other system would have the exact same problem!
How can QLDB be used to reliably detect modifications to data?

There would be a record of the original record and its deletion in the ledger, which would be available through the history() function, as you pointed out. So there's no way to hide the bad behavior. It's a matter of hoping nobody knows to look for it. Again, as you pointed out.
You have a couple of options here. First, QLDB rolled-out fine-grained access control last week (announcement here). This would let you, say, prohibit deletes on a given table. See the documentation.
Another thing you can do is look for deletions or other suspicious activity in real-time using streaming. You can associate your ledger with a Kinesis Data Stream. QLDB will push every committed transaction into the stream where you can react to it using a Lambda function.
If you don't need real-time detection, you can do something with QLDB's export feature. This feature dumps ledger blocks into S3 where you can extract and process data. The blocks contain not just your revision data but also the PartiQL statements used to create the transaction. You can setup an EventBridge scheduler to kick off a periodic export (say, of the day's transactions) and then churn through it to look for suspicious deletes, etc. This lab might be helpful for that.
I think the best approach is to manage it with permissions. Keep developers out of production or make them assume a temporary role to get limited access.

Sum on collection invariant prevent Aggregate root creation/update

I am aware there is a lot of topics on set validation and i won’t say i have read every single one of them but i’ve read a lot and still don’t feel i’ve seen some definite answer that doesn’t smell hackish.
Consider this:
we have a concept of Customer
Customer has some general details data
Customer can make Transaction (buying things from the store)
if Customer is in credit mode then he has a limit of how much he can spend in a year
number of Transactions per Customer per year can be huge (thousands+)
it is critical that Customer never spents a cent over a limit (there is no human delivering goods that would check limit manually)
Customer can either create new Transaction or add items to existing ones and for both the limit must be checked
Customer can actualy be a Company behind which there are many Users making actual transactions meaning Transactions can be created/updated concurrently
Obviously, i want to avoid loading all Transactions for Customer when creating new or editing existing Transaction as it doesn’t scale well for huge number of Transactions.
If i introduce aggregate dedicated to check currentLimitSpent before create/update Transaction then i have non-transactional create/update (one step to check currentLimitSpent and then another for create/update Transaction).
I know how to implement this if i don’t care about all ddd rules (or if its eventual consistency approach) but i am wondering if there is some idiomatic ddd way of solving this kinds of problems with strict consistency that doesnt involve loading all Transactions for every Transaction create/update?

it is critical that Customer never spents a cent over a limit (there
is no human delivering goods that would check limit manually)
Please read this couple of posts: RC Dont Exist and Eventual Consistency
If the systems owners still think that the codition must be honored then, to avoid concurrency issues, you could use a precomputed currentLimitSpent stored in persistence (because no Event Sourcing Tag in your question) to check the invariant and use it as optimistic concurrency flag.
Hidrate your aggregate with currentLimitSpent and any other data you need from persistence.
Check rules (customerMaxCredit <= currentLimitSpent + newTransactionValue).
Persist (currentLimitSpent + newTransactionValue) as the new currentLimitSpent.
If currentLimitSpent has changed in persistence while the aggregate was working (many Users in the same Company making transactions) you should get a optimisticConcurrency error from persistence.
You could stop on exception or rehidrate the aggregate and try again.
This is a overview. It can not be more detailed without entering into tech stack details and architectural design.

How to handle (partially) dependant aggregate roots?

I have domain concept of Product.
Product have some GeneralDetails, lets say: sku, name, description.
At the same time, Product have some ProductCalculations part where accountants can put different values like purchasePrice, stockLevelExpenses, wholeSalesPrice, retailPrice.
So, so far, Product would look something like:
class Product{
GeneralDetails Details;
ProductCalculations Calculations;
ChangeDetails(GeneralDetails details){}
Recalculate(ProductCalculations calculations{}
}
This setup would make Product an aggregate root. But now, i want to split it in a way that Product manager can input/update product details but then that accountant can step in and intependently change calculations for given product without concurrency issues.
That would suggest splitting it into 2 separate aggregate roots.
But then, deleting ProductDetails aggregate must mean deleting ProductCalculations too and it should happen in transactional way.
Assuming they are 2 aggregate roots, meaning they have 2 separate repositories with corresponding Delete methods, how to implement this as an atomic transaction?
The only thing i can think about is to raise event when ProductDetails gets deleted, have a handler (DomainService) that uses some special repository that handles transactions over multiple aggregate roots.
Is there some problem with that approach and/or is there some better way to handle it?
PS.
I cannot allow eventual consistency when ProductDetails is deleted.
PS2.
Based on comments from #Jon, Details and Calculations create&delete should be synced in a way that when Details are created/deleted, Calculations should also be created/deleted.
On the other hand, their updates should be completely independent.

I think the answer to your question depends somewhat on what data storage technology you're using and your data storage model, because if you can push operation transactionality to the data layer, things get much easier.
If you're using a document-oriented database (Cosmos DB, MongoDB, etc...), I would model and store your Product aggregate (including Details and Calculations) as a single document and you get the atomic transaction and concurrency checking for free from the database.
If you must store these as separate documents/records in your data store, then providing atomic transactions and concurrency checking becomes your concern. For years folks (especially those using Entity Framework) have been using the Unit of Work pattern to batch up multiple repository operations and submit them to the database as a single operation (EF-specific UoW implementation). Rob Conery suggests here that a better option is to use Command objects to encapsulate a multi-part operation that needs to be executed as a single transaction.
In any event, I would encourage you to keep the management of this operation within Product, so that consumers of Product are unaware of what's going on during the save - they just blissfully call product.SaveAsync() and they don't need to know whether that's causing one record update or ten. As long as Product is injected with the repositories it needs to get the job done, there's no need to have a separate domain service to coordinate this operation. There's nothing wrong with Product listening for events that its children raise and responding appropriately.
Hope this helps!

" I cannot allow eventual consistency when ProductDetails is deleted"
Why not? What would be the business cost of having Inventory.Product exist while Finance.Product doesn't or vice-versa?
"but then that accountant can step in and intependently change calculations for given product"
That's pretty much what eventual consistency is, no?
If you really can't have eventual consistency then use a domain service to create/delete two distinct aggregate roots in a single transaction, but ask yourself how you are going to do that if the information is not entirely provided by the same end user?

I agree with #plalx in almost every point. However I want to do my bit to the discussion.
I've found that there is usually a very little cost in creating two or more related aggregates inside a single transaction (inside a single bounded context). After all, if those aggregates don't exist yet there cannot be a concurrency conflict, there is no contention and no much difference. Furher, you don't need to deal with partially created state (thinking that state is split between aggregates). It is possible to do that using eventual consistency, and there are situations where that is a better approach, but most of the time there is no great benefit. Even Vernon in his book Implementing Domain-Driven Design mentions this use case as "valid reason to break the rules".
Deleting more than one aggregate is a different story. What should happen if you delete and aggregate that another user is updating at the same time? The probability of such a conflict increases as more aggregates you try to modify/delete in the same transaction. Is there always an upstream/downstream relationship between those aggregates? I mean, if an user deletes A and B must be also deleted, have the user that is updating B no "power" or "voice" to cancel that deletion since she is providing more information to the state of the aggregate?
Those are a very tricky questions and most of the time it is something you need to discuss with a domain expert, and there are very few real scenarios when the answer is something you can't afford with eventual consistency. I discovered that in many cases is preferable to put a "flag" marking the aggregate as "inactive", notifying that will be deleted after some period of time. If no user with enough permission request that aggregate to become active again, then it gets deleted. That helped users to not kill themselves when they delete some aggregate by mistake.
You've mentioned that you don't want a user to spend hours modifying one aggregate if there is a deletion, but that is something that a transaction doesn't contribute much. This is very dependent in the whole architecture, though. That user could have loaded the aggregate into her own memory space and then a deletion occurs. It doesn't matter if you delete inside a transaction, the user is still wasting time. A better solution could be to publish a domain event that triggers some sort of push notification to the user, so she knows that a deletion happened and can stop working (or request a cancellation of that deletion, if you follow such approach).
For the reports and calculations, there are many cases when those "scripts" can skip records where the sibling aggregate is gone, so users doesn't notice there is a missing part or there is no complete consistency yet.
If for some reason you still need to delete several aggregates in the same transaction you just start a transaction in an application service and use repositories to perform the deletion, analogous to the creation case.
So, to summarize:
The rule of "modify one aggregate per transaction" is not that important when there is a creation of aggregates.
Deletion of many aggregates works quite well (most of the time) with eventual consistency, and very often just disabling those aggregates, one at a time, is better than performing the deletion immediately.
Preventing an user from wasting time is better achieved with proper notifications than transactions.
If there is a real need to perform those actions inside a single transaction, then manage that transaction in the application an be explicit. Using a domain service to perform all the required operations (except for the transaction that is mostly an application concern) brings that logic back to the domain layer.

Rebuild queries from domain events by multiple aggregates

I'm using a DDD/CQRS/ES approach and I have some questions about modeling my aggregate(s) and queries. As an example consider the following scenario:
A User can create a WorkItem, change its title and associate other users to it. A WorkItem has participants (associated users) and a participant can add Actions to a WorkItem. Participants can execute Actions.
Let's just assume that Users are already created and I only need userIds.
I have the following WorkItem commands:
CreateWorkItem
ChangeTitle
AddParticipant
AddAction
ExecuteAction
These commands must be idempotent, so I cant add twice the same user or action.
And the following query:
WorkItemDetails (all info for a work item)
Queries are updated by handlers that handle domain events raised by WorkItem aggregate(s) (after they're persisted in the EventStore). All these events contain the WorkItemId. I would like to be able to rebuild the queries on the fly, if needed, by loading all the relevant events and processing them in sequence. This is because my users usually won't access WorkItems created one year ago, so I don't need to have these queries processed. So when I fetch a query that doesn't exist, I could rebuild it and store it in a key/value store with a TTL.
Domain events have an aggregateId (used as the event streamId and shard key) and a sequenceId (used as the eventId within an event stream).
So my first attempt was to create a large Aggregate called WorkItem that had a collection of participants and a collection of actions. Participant and Actions are entities that live only within a WorkItem. A participant references a userId and an action references a participantId. They can have more information, but it's not relevant for this exercise. With this solution my large WorkItem aggregate can ensure that the commands are idempotent because I can validate that I don't add duplicate participants or actions, and if I want to rebuild the WorkItemDetails query, I just load/process all the events for a given WorkItemId.
This works fine because since I only have one aggregate, the WorkItemId can be the aggregateId, so when I rebuild the query I just load all events for a given WorkItemId.
However, this solution has the performance issues of a large Aggregate (why load all participants and actions to process a ChangeTitle command?).
So my next attempt is to have different aggregates, all with the same WorkItemId as a property but only the WorkItem aggregate has it as an aggregateId. This fixes the performance issues, I can update the query because all events contain the WorkItemId but now my problem is that I can't rebuild it from scratch because I don't know the aggregateIds for the other aggregates, so I can't load their event streams and process them. They have a WorkItemId property but that's not their real aggregateId. Also I can't guarantee that I process events sequentially, because each aggregate will have its own event stream, but I'm not sure if that's a real problem.
Another solution I can think of is to have a dedicated event stream to consolidate all WorkItem events raised by the multiple aggregates. So I could have event handlers that simply append the events fired by the Participant and Actions to an event stream whose id would be something like "{workItemId}:allevents". This would be used only to rebuild the WorkItemDetails query. This sounds like an hack.. basically I'm creating an "aggregate" that has no business operations.
What other solutions do I have? Is it uncommon to rebuild queries on the fly? Can it be done when events for multiple aggregates (multiple event streams) are used to build the same query? I've searched for this scenario and haven't found anything useful. I feel like I'm missing something that should be very obvious, but I haven't figured what.
Any help on this is very much appreciated.
Thanks

I don't think you should design your aggregates with querying concerns in mind. The Read side is here for that.
On the domain side, focus on consistency concerns (how small can the aggregate be and the domain still remain consistent in a single transaction), concurrency (how big can it be and not suffer concurrent access problems / race conditions ?) and performance (would we load thousands of objects in memory just to perform a simple command ? -- exactly what you were asking).
I don't see anything wrong with on-demand read models. It's basically the same as reading from a live stream, except you re-create the stream when you need it. However, this might be quite a lot of work for not an extraordinary gain, because most of the time, entities are queried just after they are modified. If on-demand becomes "basically every time the entity changes", you might as well subscribe to live changes. As for "old" views, the definition of "old" is that they are not modified any more, so they don't need to be recalculated anyways, regardless of if you have an on-demand or continuous system.
If you go the multiple small aggregates route and your Read Model needs information from several sources to update itself, you have a couple of options :
Enrich emitted events with additional data
Read from multiple event streams and consolidate their data to build the read model. No magic here, the Read side needs to know which aggregates are involved in a particular projection. You could also query other Read Models if you know they are up-to-date and will give you just the data you need.
See CQRS events do not contain details needed for updating read model

In DDD, a UoW per Repository or Bounded Context or Transaction?

In DDD, an aggregate root can have a repository. Let us take an Order aggregate and it's non-persistant counterpart OrderRepository and persistent counterpart OrderUoW. We have also ProductVariant aggregate which tracks the inventory of the products in the order. It can have a ProductVariantRepository and ProductVariantUoW.
The way the Order and the ProductVariant work is that before the order is persisted, the inventory is checked. If there is inventory, the order will be persisted by calling OrderUoW.Commit(). Yes, the ProductVariantUoW.Commit() will be called next to update the inventory of the products.
UNFORTUNATELY things can go bad, a user bought the same products in that short time (Consider this as a web app where two users are buying the same products). Now the whole transaction for the second user should fail by reverting the order that just created. Should I call the OrderUoW to rollback the changes (the order should be deleted from the db)? Or should I put both UoW.Commit() operations in a transaction scope, so failing of one commit() will rollback the changes? Or both the repositories (Order, ProductVariant) should have only UoW and it needs to have only one transaction scope?
I may be able to make the story short by saying, how the transaction is handled where there are multiple repositories involved?

A question we could ask is who is doing the following:
The way the Order and the ProductVariant work is that before the order
is persisted, the inventory is checked. If there is inventory, the
order will be persisted by calling OrderUoW.Commit(). Yes, the
ProductVariantUoW.Commit() will be called next to update the inventory
of the products.
Some argue that this kind of work belongs in the service layer, which allows the service layer to put things crossing aggregate objects into a single transaction.
According to http://www.infoq.com/articles/ddd-in-practice:
Some developers prefer managing the transactions in the DAO classes
which is a poor design. This results in too fine-grained transaction
control which doesn't give the flexibility of managing the use cases
where the transactions span multiple domain objects. Service classes
should handle transactions; this way even if the transaction spans
multiple domain objects, the service class can manage the transaction
since in most of the use cases the Service class handles the control
flow.
I think as an alternative to using a single transaction, you can claim the inventory using ProductVariant, and, if all the inventory items necessary are available then you can commit the order. Otherwise (i.e. you can't claim all the products you need for the order) you have to return the inventory that was successfully claimed using compensating transactions. The results it that in the case of unsuccessfull commit of an order, some of the inventory will temporarily appear unavailable for other orders, but the advantage is that you can work without a distributed transaction.
None the less, this logic still belongs in the service layer, not the DAO classes.

The way you are using unit of work seems a bit fine-grained. Just in case you haven't read Martin Fowler's take: http://martinfowler.com/eaaCatalog/unitOfWork.html
That being said you want to handle the transaction at the use-case level. The fact that the inventory is checked up-front is simply a convenience (UX) and the stock level should be checked when persisting the various bits also. An exception can be raised for insufficient stock.
The transaction isolation level should be set such that the two 'simultaneous' parts are performed serially. So whichever one gets to update the stock levels first is going to 'win'. The second will then raise the exception.

If you can use a single UoW then do so, because it's easier.
If your repositories are on different DBs (or maybe one is file-based and the others are not) then you may be forced to use multiple UoWs, but then you're writing roll-back commands too, because if UoW1 saves changes to SqlRepo OK, but then UoW2 fails to save changes to FileRepo then you need to rollback SqlRepo. Don't bother writing all that rollback command stuff if you can avoid it!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string